Awesome
gfabase
gfabase
is a command-line tool for indexed storage of Graphical Fragment Assembly (GFA1) data. It imports a .gfa file into a compressed .gfab file, from which it can later access subgraphs quickly (reading only the necessary parts), producing .gfa or .gfab. Beyond ID lookups, .gfab indexes the graph by mappings onto reference genome coordinates, facilitating navigation within de novo assemblies and pangenome reference graphs.
Effectively, .gfab is a new GFA-superset format with built-in compression and indexing. It is in fact a SQLite (+ Genomics Extension) database populated with a GFA1-like schema, which programmers have the option to access directly, without requiring gfabase
nor even a low-level parser for .gfa/.gfab.
Quick start
Each Release includes prebuilt gfabase
executables for Linux and macOS x86-64 hosts. The executable provides subcommands:
gfabase load -o my.gfab [my.gfa]
: create .gfab from a .gfa file (or pipe decompression through standard input)gfabase view my.gfab
: dump back to .gfa (if standard output is a terminal, automatically pipes toless -S
)gfabase sub my.gfab SEGMENT/PATH/RANGE... [--view]
: query for a subgraph, producing either .gfa or .gfabgfabase add-mappings my.gfab mappings.paf
: add index of reference genome mappings for GFA segments
The following quick example accesses a scaffold by its Path name in a metaSPAdes assembly of simulated metagenomic reads from Ye <em>et al.</em> (2019); it also uses zstd
for decompression.
curl -L "https://github.com/mlin/gfabase/blob/main/test/data/atcc_staggered.assembly_graph_with_scaffolds.gfa.zst?raw=true" \
| zstd -dc \
| ./gfabase load -o atcc_staggered.metaspades.gfab
# extract a scaffold from the metagenome assembly (by GFA Path name)
./gfabase sub atcc_staggered.metaspades.gfab -o a_scaffold.gfab --path NODE_2_length_747618_cov_15.708553_3
# view GFA:
./gfabase view a_scaffold.gfab
# or in one command:
./gfabase sub atcc_staggered.metaspades.gfab --view --path NODE_2_length_747618_cov_15.708553_3
The following in-depth notebooks demonstrate human genome uses, also integrating with Bandage for visualization:
<img width="500" alt="index" src="https://user-images.githubusercontent.com/356550/105319466-fd571080-5b68-11eb-9422-a0b3b01c7056.png">Segment mappings
Adding --range
to gfabase sub
means the other command-line arguments are linear sequence ranges (chr1:234-567) to be resolved to overlapping segments. This relies on mappings of each segment to its own linear coordinates, which gfabase load
understands in two forms:
- The rGFA tags
SN:Z
andSO:i
are present and the segment sequence length is known (from given sequence orLN:i
) - Segment tag
rr:Z
giving a browser-style range likerr:Z:chr1:2,345-6,789
Furthermore, gfabase add-mappings my.gfab mappings.paf
adds mappings of segment sequences generated by minimap2 or a similar tool producing PAF format. The .gfab is updated in-place, so make a backup copy if needed.
Connected subgraphs
Adding --connected
to gfabase sub
expands the subgraph to include the complete connected component(s) associated with the specified segments.
That may be overkill, if we're only interested in the segments' immediate neighborhood. In that case, instead set --cutpoints 1
to extract the associated biconnected component(s), stopping the subgraph expansion at cutpoints (segments that any end-to-end walk of the chromosome must traverse). Setting --cutpoints 2
or higher expands to more-distant cutpoints. The expansion can be modified to disregard cutpoint segments less than L nucleotides long by adding --cutpoints-nt L
.
<sup>The --connected
and --cutpoints
expansions treat the segment graph as undirected. Therefore the extracted subgraphs include, but are not limited to, directed "superbubbles."</sup>
Web access
gfabase view
and gfabase sub
can read .gfab http/https URLs directly. The web server must support HTTP GET range requests, and the content must be immutable. This is mainly useful to query for a small subgraph, especially with --no-sequences
. On the other hand, a series of queries expected to traverse a large fraction of the graph will be better-served by downloading the whole file upfront.
Here's an example invocation to inspect the subgraph surrounding the HLA locus in a Shasta ONT assembly, remotely accessing a .gfab served by GitHub. (See the above-linked notebooks for details about the flags given.)
./gfabase sub \
https://github.com/mlin/gfabase/releases/download/v0.5.0/shasta-HG002-Guppy-3.6.0-run4-UL.gfab \
--view --cutpoints 2 --no-sequences --guess-ranges --range \
chr6:29,700,000-29,950,000
To publish a .gfab on the web, it's helpful to first "defragment" the file using the genomicsqlite
command-line tool made available by pip3 install genomicsqlite
or conda install -c mlin genomicsqlite
:
genomicsqlite my.gfab --compact --inner-page-KiB 64 --outer-page-KiB 2
...generating my.gfab.compact
, a defragmented version that'll be more efficient to access. (mv my.gfab.compact my.gfab
if so desired.)
Building from source
git clone https://github.com/mlin/gfabase.git
cd gfabase
./cargo build --release
Then find the executable target/release/gfabase
.