Awesome
<!-- ![C/C++ CI](https://github.com/TimoLassmann/kalign/workflows/C/C++%20CI/badge.svg) -->Kalign
Kalign is a fast multiple sequence alignment program for biological sequences.
Installation
Release Tarball
Download tarball from releases. Then:
tar -zxvf kalign-<version>.tar.gz
cd kalign-<version>
mkdir build
cd build
cmake ..
make
make test
make install
on macOS, install brew then:
brew install cmake
git clone https://github.com/TimoLassmann/kalign.git
cd kalign
mkdir build
cd build
cmake ..
make
make test
make install
Usage
The command line interface of Kalign accepts the following options:
Usage: kalign -i <seq file> -o <out aln>
Options:
--format : Output format. [Fasta]
--type : Alignment type (rna, dna, internal). [rna]
Options: protein, divergent (protein)
rna, dna, internal (nuc).
--gpo : Gap open penalty. []
--gpe : Gap extension penalty. []
--tgpe : Terminal gap extension penalty. []
-n/--nthreads : Number of threads. [4]
--version (-V/-v) : Prints version. [NA]
Kalign expects the input to be a set of unaligned sequences in fasta format or aligned sequences in aligned fasta, MSF or clustal format. If the sequences are already aligned, kalign will remove all gap characters and re-align the sequences.
By default, Kalign automatically detects whether the input sequences are protein or DNA and selects appropriate alignment parameters.
The --type
option gives users more direct control over the alignment parameters. Currently there are five core options:
protein
: uses a the CorBLOSUM66_13plus substituion matrix (default for protein sequence)divergent
: uses the gonnet 250 substituion matrixdna
: default DNA parameters- 5 match score
- -4 mismatch score
- -8 gap open penalty
- -6 gap extension penalty
- 0 terminal gap extension penalty
internal
: same as above but terminal gaps set to 8 to encourage gaps within the sequences.rna
: parameters optimised for RNA alignments.
The --gpo
, --gpe
and --tgpe
options can be used to further fine tune the parameters.
Examples
Passing sequences via stdin:
cat input.fa | kalign -f fasta > out.afa
Combining multiple input files:
kalign seqsA.fa seqsB.fa seqsC.fa -f fasta > combined.afa
Align sequences and output the alignment in MSF format:
kalign -i BB11001.tfa -f msf -o out.msf
Align sequences and output the alignment in clustal format:
kalign -i BB11001.tfa -f clu -o out.clu
Re-align sequences in an existing alignment:
kalign -i BB11001.msf -o out.afa
Reformat existing alignment:
kalign -i BB11001.msf -r afa -o out.afa
Kalign library
To incorporate Kalign into your own projects you can link to the library like this:
find_package(kalign)
target_link_libraries(<target> kalign::kalign)
Alternatively, you can include the kalign code directly in your project and link with:
if (NOT TARGET kalign)
add_subdirectory(<path_to_kalign>/kalign EXCLUDE_FROM_ALL)
endif ()
target_link_libraries(<target> kalign::kalign)
Benchmark results
Here are some benchmark results. The code to reproduce these figures can be found at here.
Balibase
Bralibase
Please cite:
- Lassmann, Timo. Kalign 3: multiple sequence alignment of large data sets. Bioinformatics (2019). pdf
Other papers:
- Lassmann, Timo, Oliver Frings, and Erik LL Sonnhammer. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic acids research 37.3 (2008): 858-865. Pubmed
- Lassmann, Timo, and Erik LL Sonnhammer. Kalign: an accurate and fast multiple sequence alignment algorithm. BMC bioinformatics 6.1 (2005): 298. Pubmed