Home

Awesome

WFA2-lib

1. INTRODUCTION

1.1 What is WFA?

The wavefront alignment (WFA) algorithm is an exact gap-affine algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process. Unlike to traditional dynamic programming algorithms that run in quadratic time, the WFA runs in time O(ns+s^2), proportional to the sequence length n and the alignment score s, using O(s^2) memory (or O(s) using the ultralow/BiWFA mode). Moreover, the WFA algorithm exhibits simple computational patterns that the modern compilers can automatically vectorize for different architectures without adapting the code. To intuitively illustrate why the WFA algorithm is so interesting, look at the following figure. The left panel shows the cells computed by a classical dynamic programming based algorithm (like Smith-Waterman or Needleman Wunsch). In contrast, the right panel shows the cells computed by the WFA algorithm to obtain the same result (i.e., the optimal alignment).

<p align = "center"> <img src = "img/wfa.vs.swg.png" width="750px"> </p>

1.2 What is WFA2-lib?

The WFA2 library implements the WFA algorithm for different distance metrics and alignment modes. It supports various distance functions: indel, edit, gap-linear, gap-affine, and dual-gap gap-affine distances. The library allows computing only the score or the complete alignment (i.e., CIGAR) (see Alignment Scope). Also, the WFA2 library supports computing end-to-end alignments (a.k.a. global-alignment) and ends-free alignments (including semi-global, glocal, and extension alignment) (see Alignment Span). In the case of long and noisy alignments, the library provides different low-memory modes that significantly reduce the memory usage of the naive WFA algorithm implementation. Beyond the exact-alignment modes, the WFA2 library implements heuristic modes that dramatically accelerate the alignment computation. Additionally, the library provides many other support functions to display and verify alignment results, control the overall memory usage, and more.

1.3 Getting started

Git clone and compile the library, tools, and examples (by default, use cmake for the library and benchmark build).

git clone https://github.com/smarco/WFA2-lib
cd WFA2-lib
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --verbose
ctest . --verbose

There are some flags that can be used. For instance:

cmake .. -DOPENMP=TRUE
cmake .. -DCMAKE_BUILD_TYPE=Release -DEXTRA_FLAGS="-ftree-vectorizer-verbose=5"

Alternatively, the simple Makefile build system can be used.

$> git clone https://github.com/smarco/WFA2-lib
$> cd WFA2-lib
$> make clean all

Also, it is possible to build WFA2-lib in a GNU Guix container, for more information see guix.scm.

1.4 Contents (where to go from here)

Section WFA2-lib features explores the most relevant options and features of the library. Then, the folder tools/ contains tools that can be used to execute and understand the WFA2 library capabilities. Additionally, the folder examples/ contains simple examples illustrating how to integrate the WFA2 code into any tool.

1.5 Important notes and clarifications

<a name="wfa2.programming"></a> 2. USING WFA2-LIB IN YOUR PROJECT

<a name="wfa2.programming.c"></a> 2.1 Simple C example

This simple example illustrates how to align two sequences using the WFA2 library. First, include the WFA2 alignment headers.

#include "wavefront/wavefront_align.h"

Next, create and configure the WFA alignment object. The following example uses the defaults configuration and sets custom gap_affine penalties. Note that mismatch, gap-opening, and gap-extension must be positive values.

// Configure alignment attributes
wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
attributes.distance_metric = gap_affine;
attributes.affine_penalties.mismatch = 4;
attributes.affine_penalties.gap_opening = 6;
attributes.affine_penalties.gap_extension = 2;
// Initialize Wavefront Aligner
wavefront_aligner_t* const wf_aligner = wavefront_aligner_new(&attributes);

Finally, call the wavefront_align function.

char* pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT";
char* text    = "TCTATACTGCGCGTTTGGAGAAATAAAATAGT";
wavefront_align(wf_aligner,pattern,strlen(pattern),text,strlen(text)); // Align

Afterwards, we can use the library to display the alignment result (e.g., the alignment score and CIGAR).

// Display CIGAR & score
cigar_print_pretty(stderr,pattern,strlen(pattern),text,strlen(text),
                   &wf_aligner->cigar,wf_aligner->mm_allocator);
fprintf(stderr,"Alignment Score %d\n",wf_aligner->cigar.score);

At the end of the program, it is polite to release the memory used.

wavefront_aligner_delete(wf_aligner); // Free

To compile and run this example, you need to link against the WFA library (-lwfa).

$> gcc -O3 wfa_example.c -o wfa_example -lwfa
$> ./wfa_example

IMPORTANT. Once an alignment object is created, it is strongly recommended to reuse it to compute multiple alignments. Creating and destroying the alignment object for every alignment computed can have a significant overhead. Reusing the alignment object allows repurposing internal data structures, minimising the cost of memory allocations, and avoiding multiple alignment setups and precomputations.

<a name="wfa2.programming.cpp"></a> 2.2 Simple C++ example

The WFA2 library can be used from C++ code using the C++ bindings. This example is similar to the previous one but uses C++ bindings. First, include the C++ bindings and remember to use the WFA namespace.

#include "bindings/cpp/WFAligner.hpp"
using namespace wfa;

Configure and create the WFA alignment object. In this case, gap-affine distance using custom penalties and the standard memory-usage algorithm (i.e., standard WFA algorithm).

// Create a WFAligner
WFAlignerGapAffine aligner(4,6,2,WFAligner::Alignment,WFAligner::MemoryHigh);

Align two sequences (in this case, given as strings).

string pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT";
string text    = "TCTATACTGCGCGTTTGGAGAAATAAAATAGT";
aligner.alignEnd2End(pattern,text); // Align

Display the result of the alignment.

// Display CIGAR & score
string cigar = aligner.getAlignmentCigar();
cout << "CIGAR: " << cigar  << endl;
cout << "Alignment score " << aligner.getAlignmentScore() << endl;

IMPORTANT. Once an alignment object is created, it is strongly recommended to reuse it to compute multiple alignments. Creating and destroying the alignment object for every alignment computed can have a significant overhead. Reusing the alignment object allows repurposing internal data structures, minimising the cost of memory allocations, and avoiding multiple alignment setups and precomputations.

<a name="wfa2.programming.rust"></a> 2.3 Rust bindings

Rust bindings can be generated automatically using bindgen, see bindings/rust/build.rs. An example of how to use them is here.

<a name="wfa2.features"></a> 3. WFA2-LIB FEATURES

<a name="wfa2.distances"></a> 3.1 Distance Metrics

The WFA2 library implements the wavefront algorithm for the most widely used distance metrics. The practical alignment time can change depending on the distance function, although the computational complexity always remains proportional to the alignment score or distance. The WFA2 library offers the following distance metrics or functions:

    PATTERN    A-GCTA-GTGTC--AATGGCTACT-T-T-TCAGGTCCT
               |  ||| |||||    |||||||| | | |||||||||
    TEXT       AA-CTAAGTGTCGG--TGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1I1D3M1I5M2I2D8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = indel;
    PATTERN    AGCTA-GTGTCAATGGCTACT-T-T-TCAGGTCCT
               | ||| |||||  |||||||| | | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = edit;
    PATTERN    A-GCTA-GTGTC--AATGGCTACT-T-T-TCAGGTCCT
               |  ||| |||||    |||||||| | | |||||||||
    TEXT       AA-CTAAGTGTCGG--TGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1I1D3M1I5M2I2D8M1I1M1I1M1I9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_linear;
    attributes.linear_penalties.mismatch = 6; // X > 0
    attributes.linear_penalties.indel = 2;    // I > 0
    PATTERN    AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
               | ||| |||||  ||||||||   | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M3I1M1X9M
    // Configuration
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_affine;
    attributes.affine_penalties.mismatch = 6;      // X > 0
    attributes.affine_penalties.gap_opening = 4;   // O >= 0
    attributes.affine_penalties.gap_extension = 2; // E > 0
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.distance_metric = gap_affine_2p;
    attributes.affine2p_penalties.mismatch = 6;       // X > 0
    attributes.affine2p_penalties.gap_opening1 = 4;   // O1 >= 0
    attributes.affine2p_penalties.gap_extension1 = 2; // E1 > 0
    attributes.affine2p_penalties.gap_opening2 = 12;  // O2 >= 0
    attributes.affine2p_penalties.gap_extension2 = 1; // E2 > 0

<a name="wfa2.scope"></a> 3.2 Alignment Scope

Depending on the use case, it is often the case that an application is only required to compute the alignment score, not the complete alignment (i.e., CIGAR). As it happens with traditional dynamic programming algorithms, the WFA algorithm requires less memory (i.e., O(s)) to compute the alignment score. In turn, this results in slighter faster alignment executions. For this reason, the WFA2 library implements two different modes depending on the alignment scope: score-only and full-CIGAR alignment.

The ** score-only alignment ** mode computes only the alignment score. This mode utilizes only the front-wavefronts of the WFA algorithm to keep track of the optimal alignment score. As a result, it requires O(s) memory and, in practice, performs slighter faster than the standard full-CIGAR mode.

    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_scope = compute_score;

The ** full-CIGAR alignment ** computes the sequence of alignment operations (i.e., {'M','X','D','I'}) that transforms one sequence into the other (i.e., alignment CIGAR). The alignment score can be obtained as a by-product of the alignment process, evaluating the score of the alignment CIGAR. This mode requires O(s^2) memory (using the default memory mode, wavefront_memory_high) or less (using the low-memory modes).

    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_scope = compute_alignment;

<a name="wfa2.span"></a> 3.3 Alignment Span

The WFA2 library allows computing alignments with different spans or shapes. Although there is certain ambiguity and confusion in the terminology, we have tried to generalize the different options available to offer flexible parameters that can capture multiple alignment scenarios. During the development of the WFA we decided to adhere to the classical approximate string matching terminology where we align a pattern (a.k.a. query or sequence) against a text (a.k.a. target, database, or reference).

    PATTERN    AATTAATTTAAGTCTAGGCTACTTTCGGTACTTTGTTCTT
               ||||    ||||||||||||||||||||||||||   |||
    TEXT       AATT----TAAGTCTAGGCTACTTTCGGTACTTT---CTT
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_end2end;
    PATTERN    AATTAATTTAAGTCTAGGCTACTTTCGGTACTTTGTTCTT
                   |||||||||||||||||||||||||||||| ||
    TEXT       ----AATTTAAGTCTAGGCTACTTTCGGTACTTTCTT---
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = text_end_free;
<details><summary>Glocal alignment (a.k.a. semi-global or fitting)</summary> <p>
    PATTERN    -------------AATTTAAGTCTAGGCTACTTTC---------------
                            ||||||||| ||||||||||||
    TEXT       ACGACTACTACGAAATTTAAGTATAGGCTACTTTCCGTACGTACGTACGT
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = text_end_free;
</p> </details> <details><summary>Extension alignment</summary> <p>
    // Right extension
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = 0;
    attributes.alignment_form.text_end_free = text_end_free;

    PATTERN    AATTTAAGTCTG-CTACTTTCACGCA-GCT----------
               ||||| |||||| ||||||||||| | | |
    TEXT       AATTTCAGTCTGGCTACTTTCACGTACGATGACAGACTCT
    // Left extension
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = 0;

    PATTERN    -------------AAACTTTCACGTACG-TGACAGTCTCT
                              ||||||||||||| |||||| ||||
    TEXT       AATTTCAGTCTGGCTACTTTCACGTACGATGACAGACTCT
</p> </details> <details><summary>Overlapped alignment</summary> <p>
    // Overlapped (Right-Left)
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = pattern_begin_free;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = 0;
    attributes.alignment_form.text_end_free = text_end_free;

    PATTERN    ACGCGTCTGACTGACTGACTAAACTTTCATGTAC-TGACA-----------------
                                   ||||||||| |||| |||||
    TEXT       --------------------AAACTTTCACGTACGTGACATATAGCGATCGATGACT
    // Overlapped (Left-Right)
    wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = 0;
    attributes.alignment_form.pattern_end_free = pattern_end_free;
    attributes.alignment_form.text_begin_free = text_begin_free;
    attributes.alignment_form.text_end_free = 0;

    PATTERN    ----------------------ACGCGTCTGACTGACTACGACTACGACTGACTAGCAT
                                     ||||||||| || ||
    TEXT       ACATGCATCGATCAGACTGACTACGCGTCTG-CTAAC----------------------
</p> </details>

<a name="wfa2.mem"></a> 3.4 Memory modes

The WFA2 library implements various memory modes: wavefront_memory_high, wavefront_memory_med, wavefront_memory_low, and , wavefront_memory_ultralow. These modes allow regulating the overall memory consumption at the expense of execution time. The standard WFA algorithm, which stores explicitly all wavefronts in memory, correspond to the mode wavefront_memory_high. The other methods progressively reduce memory usage at the expense of slightly larger alignment times. These memory modes can be used transparently with other alignment options and generate identical results. Note that this option does not affect the score-only alignment mode (it already uses a minimal memory footprint of O(s)).

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.memory_mode = wavefront_memory_ultralow;

<a name="wfa2.heuristics"></a> 3.5 Heuristic modes

The WFA algorithm can be used combined with many heuristics to reduce the alignment time and memory used. As it happens to other alignment methods, heuristics can result in suboptimal solutions and loss of accuracy. Moreover, some heuristics may drop the alignment if the sequences exceed certain divergence thresholds (i.e., x-drop/z-drop). Due to the popularity and efficiency of these methods, the WFA2 library implements many of these heuristics. Note, it is not about how little DP-matrix you compute, but about how good/accurate the resulting alignments are.

WFA2's heuristics are classified into the following categories: 'wf-adaptive', 'drops', and 'bands'. It is possible to combine a maximum of one heuristic from each category (OR-ing the strategy values or using the API). In the case of using multiple heuristics, these will applied in cascade, starting with 'wf-adaptive', then 'drops', and finally 'bands'.

<p align="center"> <table> <tr> <td><p align="center">Full-WFA</p></td> </tr> <tr> <td><img src="img/heuristics.none.png" align="center" width="400px"></td> </tr> </table> </p>
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_none;
  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_wfadaptive;
  attributes.heuristic.min_wavefront_length = 10;
  attributes.heuristic.max_distance_threshold = 50;
  attributes.heuristic.steps_between_cutoffs = 1;

        Graphical examples:

<p align="center"> <table> <tr> <td><p align="center">Adaptive-WF(10,50)</p></td> <td><p align="center">Adaptive-WF(10,50,10)</p></td> </tr> <tr> <td><img src="img/heuristics.wfadap.10.50.1.png" align="center" width="400px"></td> <td><img src="img/heuristics.wfadap.10.50.10.png" align="center" width="400px"></td> </tr> </table> </p>

        X-drop implements the classical X-drop heuristic. For each diagonal $k$, the X-drop heuristic compares the current score $sw_k$ with the maximum observed score so far $sw_{max}$. If the difference drops more than the $xdrop$ parameter (i.e., $sw_{max} - sw_k > xdrop$), the heuristic prunes the diagonal $k$ as it is unlikely to lead to the optimum alignment. If all the diagonals are pruned under this criteria, the alignment process is abandoned.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_xdrop;
  attributes.heuristic.xdrop = 100;
  attributes.heuristic.steps_between_cutoffs = 100;

        Z-drop implements the Z-drop heuristic (as described in Minimap2). This heuristic halts the alignment process if the score drops too fast in the diagonal direction. Let $sw_{max}$ be the maximum observed score so far, computed at cell $(i',j')$. Then, let $sw$ be the maximum score found in the last computed wavefront, computed at cell $(i,j)$. The Z-drop heuristic stops the alignment process if $sw_{max} - sw > zdrop + gap_e·|(i-i')-(j-j')|$, being $gap_e$ the gap-extension penalty and $zdrop$ a parameter of the heuristic.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_zdrop;
  attributes.heuristic.zdrop = 100;
  attributes.heuristic.steps_between_cutoffs = 100;

        Graphical examples:

<p align="center"> <table> <tr> <td><p align="center">None</p></td> <td><p align="center">X-drop(200,1)</p></td> <td><p align="center">Z-drop(200,1)</p></td> </tr> <tr> <td><img src="img/heuristics.drop.none.png" align="center" width="300px"></td> <td><img src="img/heuristics.xdrop.200.png" align="center" width="300px"></td> <td><img src="img/heuristics.zdrop.200.png" align="center" width="300px"></td> </tr> </table> </p>

        Static-band sets a fixed band in the diagonals preventing the wavefront from growing beyond those limits. It allows setting the minimum diagonal (i.e., min_k) and maximum diagonal (i.e., max_k).

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_banded_static;
  attributes.heuristic.min_k = -10;
  attributes.heuristic.max_k = +10;

        Adaptive-band is similar to the static-band heuristic; however, it allows the band to move towards the diagonals closer to the end of the alignment. Unlike the static-band that is performed on each step, the adaptive-band heuristics allows configuring the number of steps between heuristic band cut-offs.

  wavefront_aligner_attr_t attributes = wavefront_aligner_attr_default;
  attributes.heuristic.strategy = wf_heuristic_banded_adaptive;
  attributes.heuristic.min_k = -10;
  attributes.heuristic.max_k = +10;
  attributes.heuristic.steps_between_cutoffs = 1;

        Graphical examples:

<p align="center"> <table> <tr> <td><p align="center">Banded(10,10)</p></td> <td><p align="center">Banded(10,150)</p></td> </tr> <tr> <td><img src="img/heuristics.band.10.10.png" align="center" width="400px"></td> <td><img src="img/heuristics.band.10.150.png" align="center" width="400px"></td> </tr> <tr> <td><p align="center">Adaptive-Band(10,10,1)</p></td> <td><p align="center">Adaptive-Band(50,50,1)</p></td> </tr> <tr> <td><img src="img/heuristics.aband.10.10.png" align="center" width="400px"></td> <td><img src="img/heuristics.aband.50.50.png" align="center" width="400px"></td> </tr> </table> </p>

<a name="wfa2.other.notes"></a> 3.6 Some technical notes

<a name="wfa2.complains"></a> 4. REPORTING BUGS AND FEATURE REQUEST

Feedback and bug reporting is highly appreciated. Please report any issue or suggestion on github or email to the main developer (santiagomsola@gmail.com). Don't hesitate to contact us if:

<a name="wfa2.licence"></a> 5. LICENSE

WFA2-lib is distributed under MIT licence.

<a name="wfa2.complains"></a> 6. AUTHORS

Santiago Marco-Sola (santiagomsola@gmail.com) is the main developer and the person you should address your complaints.

Andrea Guarracino and Erik Garrison have contributed to the design of new features and intensive testing of the library.

Pjotr Prins contributed the CMake build system, preventing of leaking variables in include headers and other tweaks.

Miquel Moretó has contributed with fruitful technical discussions and tireless efforts seeking funding, so we could keep working on this project.

<a name="wfa2.complains"></a> 7. ACKNOWLEDGEMENTS

<a name="wfa2.cite"></a> 8. CITATION

Santiago Marco-Sola, Juan Carlos Moure, Miquel Moreto, Antonio Espinosa. "Fast gap-affine pairwise alignment using the wavefront algorithm.". Bioinformatics, 2020.

Santiago Marco-Sola, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, Miquel Moreto. "Optimal gap-affine alignment in O(s) space". Bioinformatics, 2023.