Home

Awesome

License install with bioconda

<img src="kegalign_logo.png" width="300">

This is a @galaxyproject's modified fork of the original SegAlign.

Table of Contents

<a name="overview"></a> Overview

Precise genome aligner efficiently leveraging GPUs.

<a name="changes"></a> Changes from the original implementation

<a name="installation"></a> Installation

For standalone installation use Conda: conda install conda-forge::kegalign

For standalone installation with additional tools use Bioconda: conda install bioconda::kegalign-full

For installation in Galaxy we currently use the wrappers richard-burhans:kegalign and richard-burhans:batched_lastz from the Main Tool Shed. Try the tools at usegalaxy.org: kegalign, batched_lastz

git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash -dev
source ./conda-env-dev.bash

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

<a name="dependencies"></a> Dependencies

The following dependencies are required by KegAlign:

<a name="usage"></a> Usage

<a name="alignment"></a> Alignment

Running a Sample Alignment

# install kegalign
git clone https://github.com/galaxyproject/KegAlign.git
cd KegAlign
./scripts/make-conda-env.bash
source ./conda-env.bash

# convert target (ref) and query to 2bit
mkdir work
faToTwoBit <(gzip -cdfq ./test-data/apple.fasta.gz) work/ref.2bit
faToTwoBit <(gzip -cdfq ./test-data/orange.fasta.gz) work/query.2bit

# generate LASTZ keg
python ./scripts/runner.py --diagonal-partition --format maf- --num-cpu 16 --num-gpu 1 --output-file data_package.tgz --output-type tarball --tool_directory ./scripts test-data/apple.fasta.gz test-data/orange.fasta.gz
python ./scripts/package_output.py --format_selector maf --tool_directory ./scripts

# run LASTZ keg
python ./scripts/run_lastz_tarball.py --input=data_package.tgz --output=apple_orange.maf --parallel=16

# check output
diff apple_orange.maf <(gzip -cdfq ./test-data/apple_orange.maf.gz)

# command-line kegalign
kegalign test-data/apple.fasta.gz test-data/orange.fasta.gz work/ --num_gpu 1 --num_threads 16 > lastz-commands.txt
bash lastz-commands.txt
(echo "##maf version=1"; cat *.maf-) > apple_orange.maf

Running with MIG/MPS

GPU utilization can be increased by using MIG and/or MPS, leading up to 20% faster alignments.

With the provided split_input.py script we assign individual chromosomes from the input genome into separate fasta files (up to --max_chunks), each with roughly --goal_bp number of base pairs, which will then be run in parallel on the same GPU(s). Since individual chromosomes are not split, the --goal_bp parameter should not be significantly smaller than the largest chromosome in the input file to ensure similar sized chunks. A good --goal_bp size for the human genome is 200 million base pairs.

mkdir query_split target_split
./scripts/mps-mig/split_input.py --input ./test-data/apple.fasta.gz --out query_split --to_2bit --goal_bp 20000000 --max_chunks 30
./scripts/mps-mig/split_input.py --input ./test-data/orange.fasta.gz --out target_split --to_2bit --goal_bp 20000000 --max_chunks 30
mkdir tmp
nvidia-smi -L

Each KegAlign instance, with default settings, uses around 12 to 16 GiB of GPU memory. The chosen GPUs or MIG instances should each have enough GPU memory to run the number of KegAlign instances defined by the --MPS parameter.

python ./scripts/mps-mig/run_mig.py [GPU-UUID1],[GPU-UUID2] --MPS 4 --target ./target_split --query ./query_split  --tmp_dir ./tmp/ --mps_pipe_dir ./tmp/ --output ./apples_oranges.maf --num_threads 64

<a name="scoring"></a>Scoring Options

By default the HOXD70 substitution scores are used (from Chiaromonte et al. 2002)

bad_score          = X:-1000  # used for sub['X'][*] and sub[*]['X']
fill_score         = -100     # used when sub[*][*] is not defined
gap_open_penalty   =  400
gap_extend_penalty =   30

     A     C     G     T
A   91  -114   -31  -123
C -114   100  -125   -31
G  -31  -125   100  -114
T -123   -31  -114    91

Matrix can be supplied as an input to --scoring parameter. Substitution matrix can be inferred from your data using another LASTZ-based tool (LASTZ_D: Infer substitution scores).

<a name="output"></a>Output Options

The default output is a MAF alignment file. Other formats can be selected with the --format parameter. See LASTZ manual for description of possible formats.

<a name="cite_kegalign"></a> Citing KegAlign

B Gulhan, R Burhans, R Harris, M Kandemir, M Haeussler, A Nekrutenko. KegAlign: Optimizing pairwise alignments with diagonal partitioning. BIORXIV, 2024. doi: 10.1101/2024.09.02.610839