Home

Awesome

Cuttlefish

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

install with bioconda

Cuttlefish is a fast, parallel, and very lightweight memory tool to construct the compacted de Bruijn graph from sequencing reads or reference sequences. It is highly scalable in terms of the size of the input data.

Table of contents

Overview

Cuttlefish is a program to produce the compacted de Bruijn graph from sequencing reads or reference sequences.

The papers describing the work are: Cuttlefish (original) and Cuttlefish 2.

Dependencies

Cuttlefish can be installed using Bioconda (check Installation). If installing from source, the following are required:

These should already be available in your platform; and if not, then these can be easily installed from their sources. Besides, these should also be available via some package manager for your operating system:

Installation

Usage

cuttlefish build --help displays the following message (the default threads argument is machine-configuration specific):

Efficiently construct the compacted de Bruijn graph from sequencing reads or reference sequences
Usage:
  cuttlefish build [OPTION...]

 common options:
  -s, --seq arg            input files
  -l, --list arg           input file lists
  -d, --dir arg            input file directories
  -k, --kmer-len arg       k-mer length (default: 27)
  -t, --threads arg        number of threads to use (default: 22)
  -o, --output arg         output file
  -w, --work-dir arg       working directory (default: .)
  -m, --max-memory arg     soft maximum memory limit in GB (default: 3)
      --unrestrict-memory  do not impose memory usage restriction
  -h, --help               print usage

 cuttlefish_1 options:
  -f, --format arg  output format (0: FASTA, 1: GFA 1.0, 2: GFA 2.0, 3:
                    GFA-reduced)

 cuttlefish_2 options:
      --read        construct a compacted read de Bruijn graph (for FASTQ
                    input)
      --ref         construct a compacted reference de Bruijn graph (for
                    FASTA input)
  -c, --cutoff arg  frequency cutoff for (k + 1)-mers (default: refs: 1,
                    reads: 2)
      --path-cover  extract a maximal path cover of the de Bruijn graph

 debug options:
      --vertex-set arg  set of vertices, i.e. k-mers (KMC database) prefix
                        (default: "")
      --edge-set arg    set of edges, i.e. (k + 1)-mers (KMC database) prefix
                        (default: "")

 specialized options:
      --save-mph       save the minimal perfect hash (BBHash) over the vertex
                       set
      --save-buckets   save the DFA-states collection of the vertices
      --save-vertices  save the vertex set of the graph

It supports GNU style arguments, -- for long options, and - for short options. Long options opt taking a parameter can be written as --opt=parameter or as --opt parameter. Short options o taking a parameter is written as -o parameter.

The common arguments (for Cuttlefish 1 and 2) are set as following.

Cuttlefish 1 specific arguments are set as following.

Cuttlefish 2 specific arguments are set as following.

Note

The edge- and / or the vertex-set generation step could produce a high number of temporary files in disk, up-to 2000. Failure to ensure the capability of opening this many files could produce error messages of the following form:

Error: Cannot open temporary file ./kmc_00000.bin

The concurrently open file-handle limit for the user running the process can be raised with the following command:

ulimit -n 2048

Output formats

Cuttlefish 2 output

The currently supported output format is

Other output formats are currently in the development roadmap.

Cuttlefish 1 output

The currently supported output formats are —

Orientation of the output

Cuttlefish works with the canonical representations of the k-mers, i.e. each k-mer and its reverse complement are treated as the same vertex in the original graph. The maximal unitig fragments (the ''segments'' in the GFA-terminology) are always output in their canonical forms—the orientations are guaranteed to be the same across identical executions.

''Colored'' output for Cuttlefish 1

In the GFA output formats for the compacted de Bruijn graph, the graph is represented as a list of the vertices (i.e. the maximal unitigs) and the adjacencies between them. The output also includes a path-tiling for each individual sequence in the input references, i.e. an ordered list of the maximal unitig ids that completely tile that sequence. Put differently, the GFA outputs describe a colored de Bruijn graph in the sense that the color information for each vertex (maximal unitig) is encoded in the P (GFA 1.0) or the O (GFA 2.0) entries (or the tilings in the .cf_seq file, in the reduced output).

Throughout the manuscript (Cuttlefish 1), when we mention the colored de Bruijn graph, we refer to a specific definition of colors. While this definition is intuitive and natural when constructing the compacted colored de Bruijn graph from a set of reference genomes, it is not the case that the Cuttlefish algorithm allows arbitrary coloring of the k-mers in the de Bruijn graph. Specifically, in the definition adopted herein, the color set of a unitig is the subset of input references <code>s<sub>i<sub>1</sub></sub>, s<sub>i<sub>2</sub></sub>, ..., s<sub>i<sub>l</sub></sub></code> in which the unitig appears. This color information is implicitly encoded in the path entries of the output GFA files (the P entries in GFA 1.0 and the O entries in GFA 2.0). As a result, all unitigs produced by Cuttlefish are ''monochromatic'' under this coloring definition, as a change to the color set internally to a unitig would imply either a branch (which would terminate the unitig) or the start or end of some reference string and a sentinel k-mer (which would also terminate the unitig). If one were constructing the compacted colored de Bruijn graph from raw sequencing reads or from highly-fractured assemblies, then one may wish to adopt a different notion of color, wherein color sets may vary across an individual unitig.

Example usage

We use k = 3, and 4 CPU threads, with a working directory named temp in the following examples.

Using Cuttlefish 2

To construct the maximal unitigs of the example FASTQ file reads.fq (provided in the data directory) with frequency cutoff c = 1, the following may be used.

cuttlefish build -s reads.fq -k 3 -t 4 -o cdbg -w temp/ --read -c 1

To construct the maximal unitigs of the example FASTA file refs1.fa (provided in the data directory), the following may be used.

cuttlefish build -s refs1.fa -k 3 -t 4 -o cdbg -w temp/ --ref

These executions will produce two output files each: cdbg.fa, containing the maximal unitigs of the graph; and cdbg.json, a metadata file with some structural characteristics of the graph.

Multiple seq-files, lists of seq-files, or directories of seq-files may also be passed, as described in Usage.

Using Cuttlefish 1

To output the compacted de Bruijn graph (in GFA 2.0) for the example FASTA files refs1.fa and refs2.fa (provided in the data directory), the following may be used:

cuttlefish build -s refs1.fa,refs2.fa -k 3 -t 4 -o cdbg.gfa2 -f 2 -w temp/

You may also provide lists or directories of reference files as input, as described in Usage.

Larger k-mer sizes

The default maximum k-mer size supported with the installation from source is 63. To set the maximum k-mer size capacity to some MAX_K, add -DINSTANCE_COUNT=<instance_count> with the cmake command—where <instance_count> is the number of k-values that are to be supported by Cuttlefish, and should be set to (MAX_K + 1) / 2. For example, to support a MAX_K of 127, use the following:

cmake -DINSTANCE_COUNT=64 ..

Cuttlefish supports only the odd k values within MAX_K due to theoretical reasons. Currently, MAX_K is supported upto 255. Please contact the authors if support for a larger MAX_K is required.

Note that, Cuttlefish uses only as many bytes as required (rounded up to multiples of 8) for a k-mer. Thus, increasing the maximum k-mer size capacity through setting large values for MAX_K does not affect the performance for smaller k-mer sizes.

Differences between Cuttlefish 1 & 2

Citations & Acknowledgement

If you use Cuttlefish or Cuttlefish 2 in your work, please include the following citations, as appropriate:

Cuttlefish (original)

Jamshed Khan, Rob Patro, Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections, Bioinformatics, Volume 37, Issue Supplement_1, July 2021, Pages i177–i186, https://doi.org/10.1093/bioinformatics/btab309

Cuttlefish 2

Khan, J., Kokot, M., Deorowicz, S. et al. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 23, 190 (2022). https://doi.org/10.1186/s13059-022-02743-6

This work is supported by NIH R01 HG009937, and by NSF CCF-1750472, and CNS-1763680.

Licenses