Awesome

Cuttlefish

Cuttlefish is a fast, parallel, and very lightweight memory tool to construct the compacted de Bruijn graph from sequencing reads or reference sequences. It is highly scalable in terms of the size of the input data.

Overview
Dependencies
Installation
Usage
Output formats
- ''Colored'' output for Cuttlefish 1
Example usage
Larger k-mer sizes
Differences between Cuttlefish 1 & 2
Citations & Acknowledgement
Licenses

Overview

Cuttlefish is a program to produce the compacted de Bruijn graph from sequencing reads or reference sequences.

The papers describing the work are: Cuttlefish (original) and Cuttlefish 2.

Dependencies

Cuttlefish can be installed using Bioconda (check Installation). If installing from source, the following are required:

GCC or Clang compilers for C++17 and C11
CMake (version >= 3.14)
zlib
bzip2

These should already be available in your platform; and if not, then these can be easily installed from their sources. Besides, these should also be available via some package manager for your operating system:

Linux

sudo apt-get install build-essential cmake zlib1g-dev libbz2-dev

MacOS

brew install --with-toolchain llvm
brew install cmake zlib bzip2

Installation

From Bioconda:
```
conda install -c bioconda cuttlefish
```
The Conda package supports k values up-to 127. To use larger k values, please install Cuttlefish from the source.
From source:
```
git clone https://github.com/COMBINE-lab/cuttlefish.git
cd cuttlefish/
mkdir build && cd build/
cmake -DCMAKE_INSTALL_PREFIX=../ ..
make -j 8 install
cd ..
```
You may replace 8 in make -j 8 with the preferred count of threads to use in the installation process.

This installs Cuttlefish in a sub-directory named bin, inside the project root directory. To specify a different installation directory, its path may be passed as the value of -DCMAKE_INSTALL_PREFIX with the cmake command, i.e. you may use cmake -DCMAKE_INSTALL_PREFIX=<custom_path>/ .. . Then the installed Cuttlefish executable will be found in <custom_path>/bin/. Skipping -DCMAKE_INSTALL_PREFIX entirely will install Cuttlefish in /usr/local/bin/, for which sudo access might be required (i.e. sudo make -j 8 install).

This installation supports k values up-to 63. To ensure support for larger values, please compile the source with the slight modification described in Larger k-mer sizes.

Usage

cuttlefish build --help displays the following message (the default threads argument is machine-configuration specific):

Efficiently construct the compacted de Bruijn graph from sequencing reads or reference sequences
Usage:
  cuttlefish build [OPTION...]

 common options:
  -s, --seq arg            input files
  -l, --list arg           input file lists
  -d, --dir arg            input file directories
  -k, --kmer-len arg       k-mer length (default: 27)
  -t, --threads arg        number of threads to use (default: 22)
  -o, --output arg         output file
  -w, --work-dir arg       working directory (default: .)
  -m, --max-memory arg     soft maximum memory limit in GB (default: 3)
      --unrestrict-memory  do not impose memory usage restriction
  -h, --help               print usage

 cuttlefish_1 options:
  -f, --format arg  output format (0: FASTA, 1: GFA 1.0, 2: GFA 2.0, 3:
                    GFA-reduced)

 cuttlefish_2 options:
      --read        construct a compacted read de Bruijn graph (for FASTQ
                    input)
      --ref         construct a compacted reference de Bruijn graph (for
                    FASTA input)
  -c, --cutoff arg  frequency cutoff for (k + 1)-mers (default: refs: 1,
                    reads: 2)
      --path-cover  extract a maximal path cover of the de Bruijn graph

 debug options:
      --vertex-set arg  set of vertices, i.e. k-mers (KMC database) prefix
                        (default: "")
      --edge-set arg    set of edges, i.e. (k + 1)-mers (KMC database) prefix
                        (default: "")

 specialized options:
      --save-mph       save the minimal perfect hash (BBHash) over the vertex
                       set
      --save-buckets   save the DFA-states collection of the vertices
      --save-vertices  save the vertex set of the graph

It supports GNU style arguments, -- for long options, and - for short options. Long options opt taking a parameter can be written as --opt=parameter or as --opt parameter. Short options o taking a parameter is written as -o parameter.

The common arguments (for Cuttlefish 1 and 2) are set as following.

The input files can be passed in any of the following ways (and the options may be mixed together).
- -s <data files>
- -l <newline-separated list files of data files>
- -d <directories containing only the data files>
Multiple values for each option can be passed as --seq=s1,s2,..., --seq s1 --seq s2 ..., -s s1,s2 ..., or -s s1 -s s2 (similarly for list and dir).

In case of using sequencing reads as input, the files should be in the FASTQ format. For reference sequences, those should be in the FASTA format. The input files can also be gzipped.
The k-mer length k must be odd and within 127 (and 63 if installed from source; see Larger k-mer sizes to increase the k-mer size capacity beyond these). The default value is 27.
The number of threads t is set to a quarter of the number of concurrent threads supported, by default. The use of high-enough values is recommended.
Cuttlefish generates two output files:
- A FASTA / GFA1 / GFA2 file containing the maximal unitigs of the de Bruijn graph (with the extension .fa / .gfa1 / .gfa2). The GFA output formats are exclusive for Cuttlefish 1.
- A metadata file containing some structural characteristics of the de Bruijn graph and its compacted form (with the extension .json).
The working directory w is used for temporary files created by the process—it is not created by Cuttlefish, and must exist beforehand. The current directory is set as the default working directory.
A soft maximum memory-limit m (in GB) can be provided to trade-off the RAM usage for faster execution time; this will only be adhered to if the provided limit is at least the minimum required memory for Cuttlefish, determined internally.
Memory-usage restrictions can be lifted by using unrestrict-memory, trading off extra RAM usage for faster execution time.

Cuttlefish 1 specific arguments are set as following.

The output formats (f) are —
- 0: only the maximal unitig (non-branching path) fragments, in FASTA;
- 1: the maximal unitigs, their connectivities, and the input sequence tilings, in GFA 1.0;
- 2: the maximal unitigs, their connectivities, and the input sequence tilings, in GFA 2.0; and
- 3: the maximal unitigs and the input sequence tilings, in GFA-reduced (see I/O formats).

Cuttlefish 2 specific arguments are set as following.

read and ref are ''input type'' arguments, based on whether you are providing sequencing reads or reference sequences as input, respectively.
The frequency threshold c (of (k + 1)-mers) is set to 2 for read inputs, and 1 for reference inputs, by default.
path-cover is used to construct a maximal vertex-disjoint path cover of the de Bruijn graph, instead of its compacted variant.

Note

The edge- and / or the vertex-set generation step could produce a high number of temporary files in disk, up-to 2000. Failure to ensure the capability of opening this many files could produce error messages of the following form:

Error: Cannot open temporary file ./kmc_00000.bin

The concurrently open file-handle limit for the user running the process can be raised with the following command:

ulimit -n 2048

Output formats

Cuttlefish 2 output

The currently supported output format is

The set of the maximal unitigs (non-branching paths) of the de Bruijn graph, in FASTA

Other output formats are currently in the development roadmap.

Cuttlefish 1 output

The currently supported output formats are —

The set of the maximal unitigs (non-branching paths) of the de Bruijn graph, in FASTA
The compacted de Bruijn graph in the GFA 1.0 and the GFA 2.0 formats
The compacted de Bruijn graph in a ''reduced'' GFA format. It consists of two files, with the extensions: .cf_seg and .cf_seq:
- The .cf_seg file contains all the maximal unitig fragments of the graph (the segment outputs from GFA, i.e. the S-tagged entries), each one with a unique id. This file is a list of pairs <id segment>.
- The .cf_seq file contains the ''tiling'' of each input sequence, made by the maximal unitig fragments (the paths in GFA 1 / ordered groups in GFA 2, i.e. the P- / O-tagged entries). Each line of the file is of the format <id tiling>, where id is a unique identifier (name) of this sequence, and tiling is a space-separated list of the unitig ids, completely covering the sequence. Each unitig id also has a + or - sign following it, depending on whether the corresponding unitig is present in the canonical or the reverse-complemented form in this tiling order.
For the example reference file refs1.fa (provided in the data directory), the output files may look like the following.
<table> <tr><th>Reference file</th></tr> <td> >ref1
CGACATGTCTTAG
</td> </table> <table> <tr><th>Segments file</th><th>Sequence tilings file</th></tr> <tr> <td>
1 CGA
3 ATGTC
6 CTAAGA
</td> <td>
Reference:1_Sequence:ref1 1+ 3- 3+ 6-
</td></tr> </table>
The only GFA information missing explictly in this format is the links (GFA 1) / edges and gaps (GFA 2), i.e. the L- or the E- and the G-tagged entries. These can be readily inferred from the sequence-tilings. For example, a tiling <code><seq_id u0 u1 ... un></code> corresponds to the edge and gap multi-set <code>{(u0, u1), (u1 u2), ... , (un-1, un)}</code>. Whether a pair <code>(ui, ui+1)</code> is an edge or a gap can be inferred by checking the suffix and the prefix (of length k - 1) of the unitigs <code>ui</code> and <code>ui+1</code>, respectively (in their correct orientations, based on their following +/- signs). Note that, a gap is possible in a sequence-tiling only if the sequence contains characters outside of A, C, G, and T.

For moderate to large sized genomes, this output format is preferrable to the GFA ones as the GFA formats can be quite verbose for this particular scenario, while the reduced representation provides effitively the same information, while taking much less space. For example, for the 7-human genome dataset (experimented with in the manuscripts) and using k = 31, the compacted graph takes 112 GB in GFA2, but only 29.3 GB in this reduced format.

Orientation of the output

Cuttlefish works with the canonical representations of the k-mers, i.e. each k-mer and its reverse complement are treated as the same vertex in the original graph. The maximal unitig fragments (the ''segments'' in the GFA-terminology) are always output in their canonical forms—the orientations are guaranteed to be the same across identical executions.

''Colored'' output for Cuttlefish 1

In the GFA output formats for the compacted de Bruijn graph, the graph is represented as a list of the vertices (i.e. the maximal unitigs) and the adjacencies between them. The output also includes a path-tiling for each individual sequence in the input references, i.e. an ordered list of the maximal unitig ids that completely tile that sequence. Put differently, the GFA outputs describe a colored de Bruijn graph in the sense that the color information for each vertex (maximal unitig) is encoded in the P (GFA 1.0) or the O (GFA 2.0) entries (or the tilings in the .cf_seq file, in the reduced output).

Throughout the manuscript (Cuttlefish 1), when we mention the colored de Bruijn graph, we refer to a specific definition of colors. While this definition is intuitive and natural when constructing the compacted colored de Bruijn graph from a set of reference genomes, it is not the case that the Cuttlefish algorithm allows arbitrary coloring of the k-mers in the de Bruijn graph. Specifically, in the definition adopted herein, the color set of a unitig is the subset of input references <code>si1, si2, ..., sil</code> in which the unitig appears. This color information is implicitly encoded in the path entries of the output GFA files (the P entries in GFA 1.0 and the O entries in GFA 2.0). As a result, all unitigs produced by Cuttlefish are ''monochromatic'' under this coloring definition, as a change to the color set internally to a unitig would imply either a branch (which would terminate the unitig) or the start or end of some reference string and a sentinel k-mer (which would also terminate the unitig). If one were constructing the compacted colored de Bruijn graph from raw sequencing reads or from highly-fractured assemblies, then one may wish to adopt a different notion of color, wherein color sets may vary across an individual unitig.

Example usage

We use k = 3, and 4 CPU threads, with a working directory named temp in the following examples.

Using Cuttlefish 2

From FASTQ files

To construct the maximal unitigs of the example FASTQ file reads.fq (provided in the data directory) with frequency cutoff c = 1, the following may be used.

cuttlefish build -s reads.fq -k 3 -t 4 -o cdbg -w temp/ --read -c 1

From FASTA files

To construct the maximal unitigs of the example FASTA file refs1.fa (provided in the data directory), the following may be used.

cuttlefish build -s refs1.fa -k 3 -t 4 -o cdbg -w temp/ --ref

These executions will produce two output files each: cdbg.fa, containing the maximal unitigs of the graph; and cdbg.json, a metadata file with some structural characteristics of the graph.

Multiple seq-files, lists of seq-files, or directories of seq-files may also be passed, as described in Usage.

Using Cuttlefish 1

To output the compacted de Bruijn graph (in GFA 2.0) for the example FASTA files refs1.fa and refs2.fa (provided in the data directory), the following may be used:

cuttlefish build -s refs1.fa,refs2.fa -k 3 -t 4 -o cdbg.gfa2 -f 2 -w temp/

You may also provide lists or directories of reference files as input, as described in Usage.

Larger k-mer sizes

The default maximum k-mer size supported with the installation from source is 63. To set the maximum k-mer size capacity to some MAX_K, add -DINSTANCE_COUNT=<instance_count> with the cmake command—where <instance_count> is the number of k-values that are to be supported by Cuttlefish, and should be set to (MAX_K + 1) / 2. For example, to support a MAX_K of 127, use the following:

cmake -DINSTANCE_COUNT=64 ..

Cuttlefish supports only the odd k values within MAX_K due to theoretical reasons. Currently, MAX_K is supported upto 255. Please contact the authors if support for a larger MAX_K is required.

Note that, Cuttlefish uses only as many bytes as required (rounded up to multiples of 8) for a k-mer. Thus, increasing the maximum k-mer size capacity through setting large values for MAX_K does not affect the performance for smaller k-mer sizes.

Differences between Cuttlefish 1 & 2

Cuttlefish 1 is applicable only for assembled reference sequences. Whereas Cuttlefish 2 is applicable for both sequencing reads and reference sequences.
For reference sequences, Cuttlefish 1 supports outputting the compacted graph in the GFA formats, whereas Cuttlefish 2 does not support this yet.
Cuttlefish 2 can be used by passing either one of the following arguments to the cuttlefish build command: --read or --ref. Passing neither of these invokes Cuttlefish 1.

Citations & Acknowledgement

If you use Cuttlefish or Cuttlefish 2 in your work, please include the following citations, as appropriate:

Cuttlefish (original)

Jamshed Khan, Rob Patro, Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections, Bioinformatics, Volume 37, Issue Supplement_1, July 2021, Pages i177–i186, https://doi.org/10.1093/bioinformatics/btab309

Cuttlefish 2

Khan, J., Kokot, M., Deorowicz, S. et al. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 23, 190 (2022). https://doi.org/10.1186/s13059-022-02743-6

This work is supported by NIH R01 HG009937, and by NSF CCF-1750472, and CNS-1763680.

Licenses

The BBHash library is MIT licensed.
The Boost C++ Metaprogramming library is Boost Software licensed.
The compact_vector library is MIT licensed.
The cxxopts library is MIT licensed.
The fmt library is MIT licensed.
The KMC software is GNU GPL 3 licensed.
The kseq library is MIT licensed.
The spdlog library is MIT licensed.
The xxHash library is BSD licensed.
Cuttlefish itself is Revised BSD licensed.