Home

Awesome

nanoq <a href='https://github.com/esteinig'><img src='docs/nanoq.png' align="right" height="180" /></a>

build codecov DOI

Ultra-fast quality control and summary reports for nanopore reads

Overview

v0.10.0

Purpose

Nanoq implements ultra-fast read filters and summary reports for high-throughput nanopore reads.

Citation

We would appreciate a citation if you are using nanoq for research. Please see here for some suggestions how you could give back to the community if you are using nanoq for industry applications :pray:

Steinig and Coin (2022). Nanoq: ultra-fast quality control for nanopore reads. Journal of Open Source Software, 7(69), 2991, https://doi.org/10.21105/joss.02991

Performance

See data in the benchmarks section:

Tests

Nanoq comes with high test coverage for your peace of mind.

cargo test

Install

Cargo

cargo install nanoq

Conda

conda install -c conda-forge -c bioconda nanoq

Binaries

Precompiled binaries for Linux and MacOS are attached to the latest release.

VERSION=0.10.0
RELEASE=nanoq-${VERSION}-x86_64-unknown-linux-musl.tar.gz

wget https://github.com/esteinig/nanoq/releases/download/${VERSION}/${RELEASE}
tar xf nanoq-${VERSION}-x86_64-unknown-linux-musl.tar.gz

nanoq-${VERSION}-x86_64-unknown-linux-musl/nanoq -h

Usage

Nanoq accepts a file (-i) or stream (stdin) of reads in fast{a,q}.{gz,bz2,xz} format and outputs reads to file (-o) or stream (stdout).

nanoq -i test.fq.gz -o reads.fq
cat test.fq.gz | nanoq > reads.fq

Read filters

Reads can be filtered by minimum read length (-l), maximum read length (-m), minimum average read quality (-q) or maximum average read quality (-w).

nanoq -i test.fq -l 1000 -m 10000 -q 10 -w 15 > reads.fq 

Read trimming

A fixed number of bases can be trimmed from the start (-S) or end (-E) of reads:

nanoq -i test.fq -S 100 -E 100 > reads.fq 

Read report

Read summaries are produced when using the stats flag (-s, report to stdout, no read output to stdout) or when specifying a report file (-r):

nanoq -i test.fq -s
nanoq -i test.fq -r report.txt > reads.fq

For report types and configuration see the output section.

Fast mode

:warning: When using fast mode -f read quality scores are not computed (output of quality fields: NaN)

Read qualities may be excluded from filters and statistics to speed up read iteration (-f).

nanoq -i test.fq.gz -f -s

Compression

Output compression is inferred from file extensions (gz, bz2, lzma).

nanoq -i test.fq -o reads.fq.gz

Output compression can be specified manually with -O and -c.

nanoq -i test.fq -O g -c 9 > reads.fq.gz

Online runs

Nanoq can be used to check on active sequencing runs and barcoded samples.

find /data/nanopore/run -name "*.fastq" -print0 | xargs -0 cat | nanoq -s
for i in {01..12}; do
  find /data/nanopore/run -name barcode${i}.fastq -print0 | xargs -0 cat | nanoq -s
done

Parameters

nanoq 0.10.0

Filters and summary reports for nanopore reads

USAGE:
    nanoq [FLAGS] [OPTIONS]

FLAGS:
    -f, --fast       Ignore quality values if present
    -h, --help       Prints help information
    -H, --header     Header for summary output
    -j, --json       Summary report in JSON format
    -s, --stats      Summary report only [stdout]
    -V, --version    Prints version information
    -v, --verbose    Verbose output statistics [multiple, up to -vvv]

OPTIONS:
    -c, --compress-level <1-9>     Compression level to use if compressing output [default: 6]
    -i, --input <input>            Fast{a,q}.{gz,xz,bz}, stdin if not present
    -m, --max-len <INT>            Maximum read length filter (bp) [default: 0]
    -w, --max-qual <FLOAT>         Maximum average read quality filter (Q) [default: 0]
    -l, --min-len <INT>            Minimum read length filter (bp) [default: 0]
    -q, --min-qual <FLOAT>         Minimum average read quality filter (Q) [default: 0]
    -o, --output <output>          Output filepath, stdout if not present
    -O, --output-type <u|b|g|l>    u: uncompressed; b: Bzip2; g: Gzip; l: Lzma
    -r, --report <FILE>            Summary read statistics report output file
    -t, --top <INT>                Number of top reads in verbose summary [default: 5]
    -L, --read-lengths <FILE>      Output read lengths of surviving reads to file
    -Q, --read-qualities <FILE>    Output read qualities of surviving reads to file
    -S, --trim-start <INT>         Trim bases from the start of each read [default: 0]
    -E, --trim-end <INT>           Trim bases from the end of each read [default: 0]

Output

Read lengths and qualities

Files with read lengths (--read-lengths/-L) and qualities (--read-qualities/-Q) of the surviving reads can be output:

nanoq -i test.fq -Q rq.txt -L rl.txt > reads.fq

:warning: Length and quality outputs are meant for quick plotting of distributions. Because of dubious internal design decisions (my bad) outputs are ordered with an unstable sorting function, which means the order of identical values may change between outputs. Furthermore, output order does not correspond to read output order - this will change in the next release as outlined in this issue

Summary reports

Summary reports are output to file explicitly using --report/-r:

nanoq -i test.fq -r report.txt > reads.fq
nanoq -i test.fq -r report.txt -s

When using the --stats/-s flag read output is suppressed and summary is directed to stdout:

nanoq -i test.fq -s > report.txt

Report format is minimal by default:

100000 400398234 5154 44888 5 4003 3256 8.90 9.49

A machine readable header can be added using the -H flag:

nanoq -i test.fq -s -H

Extended summaries analogous to NanoStat can be obtained using multiple -v flags (up to -vvv), including the top (-t) read lengths and qualities:

nanoq -i test.fq -f -s -t 5 -vvv
Nanoq Read Summary
====================

Number of reads:      100000
Number of bases:      400398234
N50 read length:      5154
Longest read:         44888 
Shortest read:        5
Mean read length:     4003
Median read length:   3256 
Mean read quality:    NaN 
Median read quality:  NaN


Read length thresholds (bp)

> 200       99104             99.1%
> 500       96406             96.4%
> 1000      90837             90.8%
> 2000      73579             73.6%
> 5000      25515             25.5%
> 10000     4987              05.0%
> 30000     47                00.0%
> 50000     0                 00.0%
> 100000    0                 00.0%
> 1000000   0                 00.0%


Top ranking read lengths (bp)

1. 44888       
2. 40044       
3. 37441       
4. 36543       
5. 35630

JSON formatted extended output (equivalent to -vvv) can be output to --report (-r) or stdout (-s) using the --json/-j flag:

nanoq -i test.fq --json -f -r report.json > reads.fq
nanoq -i test.fq --json -f -s > report.json
{
  "reads": 100000,
  "bases": 400398234,
  "n50": 5154,
  "longest": 44888,
  "shortest": 5,
  "mean_length": 4003,
  "median_length": 3256,
  "mean_quality": null,
  "median_quality": null,
  "length_thresholds": {
    "200": 99104,
    "500": 96406,
    "1000": 90837,
    "2000": 73579,
    "5000": 25515,
    "10000": 4987,
    "30000": 47,
    "50000": 0,
    "100000": 0,
    "1000000": 0
  },
  "quality_thresholds": {
    "5": 0,
    "7": 0,
    "10": 0,
    "12": 0,
    "15": 0,
    "20": 0,
    "25": 0,
    "30": 0
  },
  "top_lengths": [
    44888, 40044, 37441, 36543, 35630
  ],
  "top_qualities": []
}

Note that in this example no read qualities are computed; quality thresholds are therefore all zero.

Benchmarks

Benchmarks evaluate processing speed and memory consumption of a basic read length filter and summary statistics on the even Zymo mock community (GridION) with comparisons to rust-bio-tools, seqtk fqchk, seqkit stats, NanoFilt, NanoStat and Filtlong. Time to completion and maximum memory consumption were measured using /usr/bin/time -f "%e %M", speedup is relative to the slowest command in the set. We note that summary statistics from rust-bio-tools and seqkit stats do not compute read quality scores and are therefore comparable to nanoq-fast.

Tasks:

Tools:

Commands used for stats task:

Commands used for filter task:

Files:

Data preparation:

wget "https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz"
zcat Zymo-GridION-EVEN-BB-SN.fq.gz > zymo.full.fq
head -400000 zymo.full.fq > zymo.fq && gzip -k zymo.fq

Elapsed real time and maximum resident set size:

/usr/bin/time -f "%e %M"

Task and command execution:

Commands were run in replicates of 10 with a mounted benchmark data volume in the provided Docker container. An additional cold start iteration for each command was not considered in the final benchmarks.

for i in {1..11}; do
  for f in /data/*.fq; do 
    /usr/bin/time -f "%e %M" nanoq -f- s -i $f 2> benchmark
    tail -1 benchmark >> nanoq_stat_fq
  done
done

Benchmark results

Nanoq benchmarks on 3.5 million reads of the Zymo mock community (10 replicates)

stats + zymo.full.fq

commandmb (sd)sec (sd)reads / secspeedupquality scores
nanostat741.4 (0.09)1260. (13.9)2,77001.00 xtrue
seqtk-fqchk103.8 (0.04)125.9 (0.15)27,72910.01 xtrue
seqkit-stats18.68 (3.15)125.3 (0.91)27,86110.05 xfalse
nanoq35.83 (0.06)94.51 (0.43)36,93813.34 xtrue
rust-bio43.20 (0.08)06.54 (0.05)533,803192.7 xfalse
nanoq-fast22.18 (0.07)02.85 (0.02)1,224,939442.1 xfalse

filter + zymo.full.fq

commandmb (sd)sec (sd)reads / secspeedup
nanofilt67.47 (0.13)1160. (20.2)3,00901.00 x
filtlong1516. (5.98)420.6 (4.53)8,36002.78 x
nanoq11.93 (0.06)94.93 (0.45)36,77512.22 x
nanoq-fast08.05 (0.05)03.90 (0.30)895,148297.5 x

Nanoq benchmarks on 100,000 reads of the Zymo mock community (10 replicates)

stats + zymo.fq

commandmb (sd)sec (sd)reads / secspeedupquality scores
nanostat79.64 (0.14)36.22 (0.27)2,76001.00 xtrue
nanoq04.26 (0.09)02.69 (0.02)37,14713.46 xtrue
seqtk-fqchk53.01 (0.05)02.28 (0.06)43,85915.89 xtrue
seqkit-stats17.07 (3.03)00.13 (0.00)100,00036.23 xfalse
rust-bio16.61 (0.08)00.22 (0.00)100,00036.23 xfalse
nanoq-fast03.81 (0.05)00.08 (0.00)100,00036.23 xfalse

stats + zymo.fq.gz

commandmb (sd)sec (sd)reads / secspeedupquality scores
nanostat79.46 (0.22)40.98 (0.31)2,44001.00 xtrue
nanoq04.44 (0.09)05.74 (0.04)17,42107.14 xtrue
seqtk-fqchk53.11 (0.05)05.70 (0.08)17,54307.18 xtrue
rust-bio01.59 (0.06)05.06 (0.04)19,76208.09 xfalse
seqkit-stats20.54 (0.41)04.85 (0.02)20,61908.45 xfalse
nanoq-fast03.95 (0.07)03.15 (0.02)31,74613.01 xfalse

filter + zymo.fq

commandmb (sd)sec (sd)reads / secspeedup
nanofilt66.29 (0.15)33.01 (0.24)3,02901.00 x
filtlong274.5 (0.04)08.49 (0.01)11,77803.89 x
nanoq03.61 (0.04)02.81 (0.28)35,58711.75 x
nanoq-fast03.26 (0.06)00.12 (0.01)100,00033.01 x

filter + zymo.fq.gz

commandmb (sd)sec (sd)reads / secspeedup
nanofilt01.57 (0.07)33.48 (0.35)2,98601.00 x
filtlong274.2 (0.04)16.45 (0.09)6,07902.04 x
nanoq03.68 (0.06)05.77 (0.04)17,33105.80 x
nanoq-fast03.45 (0.07)03.20 (0.02)31,25010.47 x

Dependencies

Nanoq uses needletail for read operations and niffler for output compression.

Etymology

Avoided name collision with nanoqc and dropped the c to arrive at nanoq [nanɔq] which coincidentally means 'polar bear' in Native American (Eskimo-Aleut, Greenlandic). If you find nanoq useful for your work consider a small donation to the Polar Bear Fund, RAVEN or Inuit Tapiriit Kanatami

Contributions

We welcome any and all suggestions or pull requests. Please feel free to open an issue in the repository on GitHub.