Home

Awesome

GraphAligner

Seed-and-extend program for aligning long error-prone reads to genome graphs. To cite, see https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02157-2. For a description of the bitvector alignment extension algorithm, see https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz162/5372677

Installation

Install via bioconda:

Compilation

Bioconda is the recommended installation method. If you however want to compile GraphAligner yourself, run these:

Note that miniconda is only required during compilation and not during runtime. After compilation you can run the binary without the miniconda environment or copy the binary elsewhere.

If you want to compile without miniconda, you will need to install boost, protobuf and protoc, sdsl, jemalloc and sparsehash.

Running

Quickstart: GraphAligner -g test/graph.gfa -f test/read.fa -a test/aln.gaf -x vg

See Parameters, the option GraphAligner --help and the subsections below for more information and options

Test example

The command above outputs the following alignment in test/aln.gaf:

read 71 0 71 + >1>2>4 87 3 73 67 72 60 NM:i:5 AS:f:56.3 dv:f:0.0694444 id:f:0.930556 cg:Z:4=1X2=1I38=1D5=1I5=1X13=

which aligned the read to the nodes 1,2,4 with an identity of 93%. See GAF format for more information about the output format. Alternatively try -a aln.gam for output compatible with vg.

The parameter -x vg uses a parameter preset for aligning reads to a variation graph. Other options are -x dbg for aligning to a de Bruijn graph.

File formats

GraphAligner's file formats are interoperable with vg's file formats. Graphs can be inputed either in .gfa format or .vg format. Reads are inputed as .fasta or .fastq, either gzipped or uncompressed. Alignments are outputed in GAF format or vg's alignment format, either as a binary .gam or JSON depending on the file name. Custom seeds can be inputed in .gam format.

Seed hits

GraphAligner has three built-in methods for finding seed hits: minimizers (default), maximal unique matches (MUMs) and maximal exact matches (MEMs). Only matches entirely within a node are found. Minimizers (default) are faster and MUM/MEMs can be more sensitive. MUM/MEM modes use MEMfinder to find matches between the read and nodes. Use the parameter --seeds-mum-count n to use the n longest MUMs as seeds (or -1 for all MUMs), and --seeds-mem-count n for the n longest MEMs (or -1 for all MEMs). Use --seeds-mxm-length n to only use matches at least n characters long. If you are aligning multiple files to the same graph, use --seeds-mxm-cache-prefix file_name_prefix to store the MUM/MEM index to disk for reuse instead of rebuilding it each time.

Alternatively you can use any method to find seed hits and then import the seeds in .gam format with the parameter -s seedfile.gam. The seeds must be passed as an alignment message, with path.mapping[0].position describing the position in the graph, name the name of the read and query_position the position in the forward strand of the read. Match length (path.mapping[0].edit[0].from_length) is only used to order the seeds, with longer matches tried before shorter matches.

Alternatively you can use the parameter --seeds-first-full-rows to use the dynamic programming alignment algorithm on the entire first row instead of using seeded alignment. This is very slow except on tiny graphs, and not recommended.

Extension

GraphAligner uses a bitvector banded DP alignment algorithm to extend the seed hits. The DP matrix is calculated inside a certain area (the band), which depends on the extension parameters. Note that "bandwidth" in graph alignment does NOT directly correspond to bandwidth in linear alignment. The bandwidth parameter describes the maximum allowed score difference between the minimum score in a row and a cell, with cells whose score is higher than that falling outside the band. Bandwidth higher than 35 is not recommended for complex graphs due to huge increases in runtime but might work for variation graphs.

The algorithm starts using the initial bandwidth. Should it detect that the alignment is incorrect, it will rewind and rerun with the ramp bandwidth parameter, aligning high-error parts of the read without slowing down alignment in low-error parts. The tangle effort parameter determines how much time GraphAligner spends inside complex cyclic subgraphs. If the size of the band grows beyond the tangle effort parameter, GraphAligner will use the current best alignment for the aligned prefix and move forward along the read. This might miss the optimal alignment.

Parameters

All parameters below are optional.

Seeding:

Extension: