Home

Awesome

Fastaq

Manipulate FASTA and FASTQ files

Build Status
License: GPL v3

Contents

Introduction

Python3 script to manipulate FASTA and FASTQ (and other format) files, plus API for developers

Installation

There are a number of ways to install Fastaq and details are provided below. If you encounter an issue when installing Fastaq please contact your local system administrator. If you encounter a bug please log it here or email us at path-help@sanger.ac.uk.

Pip install

Install from PyPi

pip3 install pyfastaq

Or pip install the latest development version directly from this repo.

pip3 install git+https://github.com/sanger-pathogens/Fastaq.git

From source

If you want to edit the codebase, clone this repo and install in editable mode.

# Clone and install from this repository:
git clone https://github.com/sanger-pathogens/Fastaq.git && cd Fastaq && pip install -e ".[tests]"

Running the tests

The test can be run from the top level directory:

pytest tests

Runtime dependencies

These must be available in your path at run time:

Usage

The installation will put a single script called fastaq in your path. The usage is:

fastaq <command> [options]

Key points:

Examples

Reverse complement all sequences in a file:

fastaq reverse_complement in.fastq out.fastq

Reverse complement all sequences in a gzipped file, then translate each sequence:

fastaq reverse_complement in.fastq.gz - | fastaq translate - out.fasta

Available commands

CommandDescription
acgtn_onlyReplace every non acgtnACGTN with an N
add_indelsDeletes or inserts bases at given position(s)
caf_to_fastqConverts a CAF file to FASTQ format
capillary_to_pairsConverts file of capillary reads to paired and unpaired files
chunkerSplits sequences into equal sized chunks
count_sequencesCounts the sequences in input file
deinterleaveSplits interleaved paired file into two separate files
enumerate_namesRenames sequences in a file, calling them 1,2,3... etc
expand_nucleotidesMakes every combination of degenerate nucleotides
fasta_to_fastqConvert FASTA and .qual to FASTQ
filterFilter sequences to get a subset of them
get_idsGet the ID of each sequence
get_seq_flanking_gapsGets the sequences flanking gaps
interleaveInterleaves two files, output is alternating between fwd/rev reads
make_random_contigsMake contigs of random sequence
mergeConverts multi sequence file to a single sequence
replace_basesReplaces all occurrences of one letter with another
reverse_complementReverse complement all sequences
scaffolds_to_contigsCreates a file of contigs from a file of scaffolds
search_for_seqFind all exact matches to a string (and its reverse complement)
sequence_trimTrim exact matches to a given string off the start of every sequence
sort_by_nameSorts sequences in lexographical (name) order
sort_by_sizeSorts sequences in length order
split_by_base_countSplit multi sequence file into separate files
strip_illumina_suffixStrips /1 or /2 off the end of every read name
to_fake_qualMake fake quality scores file
to_fastaConverts a variety of input formats to nicely formatted FASTA format
to_mira_xmlCreate an xml file from a file of reads, for use with Mira assembler
to_orfs_gffWrites a GFF file of open reading frames
to_perfect_readsMake perfect paired reads from reference
to_random_subsetMake a random sample of sequences (and optionally mates as well)
to_tiling_bamMake a BAM file of reads uniformly spread across the input reference
to_unique_by_idRemove duplicate sequences, based on their names. Keep longest seqs
translateTranslate all sequences in input nucleotide sequences
trim_Ns_at_endTrims all Ns at the start/end of all sequences
trim_contigsTrims a set number of bases off the end of every contig
trim_endsTrim fixed number of bases of start and/or end of every sequence
versionPrint version number and exit

For developers

Here is a template for counting the sequences in a FASTA or FASTQ file:

from pyfastaq import sequences
seq_reader = sequences.file_reader(infile)
count = 0
for seq in seq_reader:
    count += 1
print(count)

Hopefully you get the idea and there are plenty of examples in tasks.py. Detection of the input file type and whether gzipped or not is automatic. See help(sequences) for the various methods already defined in the classes Fasta and Fastq.

License

Fastaq is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page or email path-help@sanger.ac.uk.