Home

Awesome

GBWTGraph

GBWTGraph is a handle graph based on the GBWT. Its data model is based on the graph as an alignment of haplotypes. The gfa2gbwt tool can be used for converting between a subset of GFA1, the GBZ file format, and GBWTGraph.

See the wiki for further documentation.

There is also a partial Rust implementation of the GBWTGraph.

Overview

The GBWTGraph represents the graph induced by the haplotypes stored in a GBWT index. It uses the GBWT index for graph topology and stores the node sequences in plain form for fast extraction. Construction extracts the sequences from another graph implementing handlegraph::HandleGraph or from gbwtgraph::SequenceSource.

GBWTGraph supports the following libhandlegraph interfaces:

Compared to other handle graph implementations, sequence access is very fast, while graph navigation may be slower. There are also some additional operations:

Accessing and decompressing GBWT node records is somewhat slow. Algorithms that repeatedly access the edges in a small subgraph may create a CachedGBWT cache using get_cache() and pass it explicitly to the relevant queries. Alternatively, they can create a CachedGBWTGraph overlay graph that uses a cache automatically. Both types of caches store all accessed records, so a new cache should be created for each subgraph.

GBWTGraph also supports an experimental SegmentHandleGraph interface with GFA-like semantics. Each GFA segment with a string name maps to a range of node ids, and GFA links correspond to edges that connect the ends of segments. This interface is currently only available in graphs built using SequenceSource.

The package also includes:

Construction from GFA

The gfa2gbwt tool can be used for building GBWTGraph from GFA1, for extracting GFA from the graph, and for converting between plain and compressed representations of GBWTGraph. The tool interprets the GFA file in the following way:

In the plain representation, the GBWT index and the GBWTGraph are stored in separate .gbwt and .gg files. The compressed representation uses a single .gbz file, with the graph stored more space-efficiently than the in-memory representation.

Dependencies

These dependencies should be installed separately (the latest master should always work). Because libhandlegraph and SDSL are header-based libraries, having multiple versions of them in the same project may cause issues. Hence all submodules of the main project should use the same copies of these libraries.

All dependencies should be installed before compiling GBWTGraph. By default, libhandlegraph installs to system directories, while GBWT and SDSL install to the user's home directory. Dependencies not installed in system directories should use the same install prefix as SDSL.

Compiling GBWTGraph

GBWTGraph uses C++14 and OpenMP. At the moment, it compiles with g++ (version 6.1 or newer should be enough) on both Mac and Linux. Apple Clang will also work on Mac, but you must install libomp separately from Macports or Homebrew. Some algorithms are slower when compiled with Clang, because there is no multithreaded std::sort.

GBWTGraph is frequently tested in the following environments:

Like GBWT, GBWTGraph takes its compiler options from SDSL. For this purpose, you must set SDSL_DIR in the makefile to your SDSL main directory. The default value is ../sdsl-lite, which is usually appropriate. The makefile will read $SDSL_DIR/Make.helper to determine compilers and compiler options.

After that, make will compile the library, while install.sh will compile and install the headers and the library to your home directory. Another install directory can be specified with install.sh prefix.

Citing GBWTGraph

Jouni Sirén, Jean Monlong, Xian Chang, Adam M. Novak, Jordan M. Eizenga, Charles Markello, Jonas A. Sibbesen, Glenn Hickey, Pi-Chuan Chang, Andrew Carroll, Namrata Gupta, Stacey Gabriel, Thomas W. Blackwell, Aakrosh Ratan, Kent D. Taylor, Stephen S. Rich, Jerome I. Rotter, David Haussler, Erik Garrison, and Benedict Paten: Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374(6574):abg8871, 2021. DOI: 10.1126/science.abg8871

Citing GBZ File Format

Jouni Sirén and Benedict Paten: GBZ file format for pangenome graphs. Bioinformatics 38(22):5012-5018, 2022. DOI: 10.1093/bioinformatics/btac656