Awesome
fq
fq filters, generates, subsamples, and validates FASTQ files.
Install
There are different methods to install fq.
Releases
Precompiled binaries are built for modern Linux distributions
(x86_64-unknown-linux-gnu
), macOS (x86_64-apple-darwin
), and Windows
(x86_64-pc-windows-msvc
). The Linux binaries require glibc 2.31+ (CentOS/RHEL
9+, Debian 11+, Ubuntu 20.04+, etc.).
Conda
fq is available via Bioconda.
$ conda install fq=0.12.0
Manual
Clone the repository and use Cargo to install fq.
$ git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git
$ cd fq
$ cargo install --locked --path .
Container image
Container images are managed by Bioconda and available through Quay.io, e.g., using Docker:
$ docker image pull quay.io/biocontainers/fq:<tag>
See the repository tags for the available tags.
Alternatively, build the development container image:
$ git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git
$ cd fq
$ docker image build --tag fq:0.12.0 .
Usage
fq provides subcommands for filtering, generating, subsampling, and validating FASTQ files.
filter
fq filter filters a given FASTQ file by a set of names or a sequence pattern. The result includes only the records that match the given options.
Usage
Filters a FASTQ file
Usage: fq filter [OPTIONS] --dsts <DSTS> [SRCS]...
Arguments:
[SRCS]... FASTQ sources
Options:
--names <NAMES>
Allowlist of record names
--sequence-pattern <SEQUENCE_PATTERN>
Keep records that have sequences that match the given regular expression
--dsts <DSTS>
Filtered FASTQ destinations
-h, --help
Print help
-V, --version
Print version
Examples
# Filters an input FASTQ using the given allowlist.
$ fq filter --names allowlist.txt --dsts /dev/stdout in.fastq
# Filters FASTQ files by matching a sequence pattern in the first input's
# records and applying the match to all inputs.
$ fq filter --sequence-pattern ^TC --dsts out.1.fq --dsts out.2.fq in.1.fq in.2.fq
generate
fq generate is a FASTQ file pair generator. It creates two reads, formatting names as described by Illumina.
While generate creates "valid" FASTQ reads, the content of the files are completely random. The sequences do not align to any genome.
Usage
Generates a random FASTQ file pair
Usage: fq generate [OPTIONS] <R1_DST> <R2_DST>
Arguments:
<R1_DST> Read 1 destination. Output will be gzipped if ends in `.gz`
<R2_DST> Read 2 destination. Output will be gzipped if ends in `.gz`
Options:
-s, --seed <SEED> Seed to use for the random number generator
-n, --record-count <RECORD_COUNT> Number of records to generate [default: 10000]
--read-length <READ_LENGTH> Number of bases in the sequence [default: 101]
-h, --help Print help
-V, --version Print version
Examples
# Generates the default number of records, written to uncompressed files.
$ fq generate /tmp/r1.fastq /tmp/r2.fastq
# Generates FASTQ paired reads with 32 records, written to gzipped outputs.
$ fq generate --record-count 32 /tmp/r1.fastq.gz /tmp/r2.fastq.gz
lint
fq lint is a FASTQ file pair validator.
Usage
Validates a FASTQ file pair
Usage: fq lint [OPTIONS] <R1_SRC> [R2_SRC]
Arguments:
<R1_SRC> Read 1 source. Accepts both raw and gzipped FASTQ inputs
[R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
Options:
--lint-mode <LINT_MODE>
Panic on first error or log all errors [default: panic] [possible values: panic, log]
--single-read-validation-level <SINGLE_READ_VALIDATION_LEVEL>
Only use single read validators up to a given level [default: high] [possible values: low, medium, high]
--paired-read-validation-level <PAIRED_READ_VALIDATION_LEVEL>
Only use paired read validators up to a given level [default: high] [possible values: low, medium, high]
--disable-validator <DISABLE_VALIDATOR>
Disable validators by code. Use multiple times to disable more than one
-h, --help
Print help
-V, --version
Print version
Validators
validate includes a set of validators that run on single or paired records.
By default, records are validated with all rules, but validators can be
disabled using --disable-validator CODE
, where CODE
is one of validators
listed below.
Single
Code | Level | Name | Validation |
---|---|---|---|
S001 | low | PlusLine | Plus line starts with a "+". |
S002 | medium | Alphabet | All characters in sequence line are one of "ACGTN", case-insensitive. |
S003 | high | Name | Name line starts with an "@". |
S004 | low | Complete | All four record lines (name, sequence, plus line, and quality) are present. |
S005 | high | ConsistentSeqQual | Sequence and quality lengths are the same. |
S006 | medium | QualityString | All characters in quality line are between "!" and "~" (ordinal values). |
S007 | high | DuplicateName | All record names are unique. |
Paired
Code | Level | Name | Validation |
---|---|---|---|
P001 | medium | Names | Each paired read name is the same, excluding interleave. |
Examples
# Validate both reads using all validators. Exits cleanly (0) if no validation
# errors occur.
$ fq lint r1.fastq r2.fastq
# Log errors instead of quitting on first error.
$ fq lint --lint-mode log r1.fastq r2.fastq
# Disable validators S004 and S007.
$ fq lint --disable-validator S004 --disable-validator S007 r1.fastq r2.fastq
subsample
fq subsample outputs a subset of records from single or paired FASTQ files.
When using a probability (-p, --probability
), each file is read through once,
and a subset of records is selected based on that chance. Given the randomness
used when sampling a uniform distribution, the output record count will not be
exact but (statistically) close.
When using a record count (-n, --record-count
), the first input is read
twice, but it provides an exact number of records to be selected.
A seed (-s, --seed
) can be provided to influence the results, e.g.,
for a deterministic subset of records.
For paired input, the sampling is applied to each pair.
Usage
Outputs a subset of records
Usage: fq subsample [OPTIONS] --r1-dst <R1_DST> <--probability <PROBABILITY>|--record-count <RECORD_COUNT>> <R1_SRC> [R2_SRC]
Arguments:
<R1_SRC> Read 1 source. Accepts both raw and gzipped FASTQ inputs
[R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs
Options:
-p, --probability <PROBABILITY> The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count`
-n, --record-count <RECORD_COUNT> The exact number of records to keep. Cannot be used with `probability`
-s, --seed <SEED> Seed to use for the random number generator
--r1-dst <R1_DST> Read 1 destination. Output will be gzipped if ends in `.gz`
--r2-dst <R2_DST> Read 2 destination. Output will be gzipped if ends in `.gz`
-h, --help Print help
-V, --version Print version
Examples
# Sample ~50% of records from a single FASTQ file
$ fq subsample --probability 0.5 --r1-dst r1.50pct.fastq r1.fastq
# Sample ~50% of records from a single FASTQ file and seed the RNG
$ fq subsample --probability --seed 13 --r1-dst r1.50pct.fastq r1.fastq
# Sample ~25% of records from paired FASTQ files
$ fq subsample --probability 0.25 --r1-dst r1.25pct.fastq --r2-dst r2.25pct.fastq r1.fastq r2.fastq
# Sample ~10% of records from a gzipped FASTQ file and compress output
$ fq subsample --probability 0.1 --r1-dst r1.10pct.fastq.gz r1.fastq.gz
# Sample exactly 10000 records from a single FASTQ file
$ fq subsample --record-count 10000 -r1-dst r1.10k.fastq r1.fastq
Legal
Please see the disclaimer that applies to all crates and command line tools made available by St. Jude Rust Labs.