Awesome
assembly-scan
reads an assembly in FASTA format and outputs summary statistics
in TSV or JSON format
assembly-scan
I wanted a quick method to output simple summary statistics of an input assembly
in TSV or JSON format. There are alternatives including
assemblathon-stats.pl
and assembly-stats, but
they didn't output what I wanted. Thus assembly-scan
was born.
Installation
Bioconda
assembly-scan is available on Bioconda.
conda create -n assembly-scan -c conda-forge -c bioconda assembly-scan
From Source
While I will always recommend using the Bioconda installation, the only dependency
assembly-scan
has is Python >=3.7. So, if you have that already you can use the
script directly.
git@github.com:rpetit3/assembly-scan.git
cd assembly-scan
python3 bin/assembly-scan YOUR_ASSEMBLY.fasta
From there you can decide to add it to your PATH or not. But, again, I recommend just going the Bioconda route.
Usage
assembly-scan
requires an assembly, gzip compressed or uncompressed, as input.
Usage
usage: assembly-scan [-h] [--json] [--transpose] [--prefix PREFIX] [--version] ASSEMBLY
Generate statistics for a given assembly.
positional arguments:
ASSEMBLY FASTA file to read (gzip or uncompressed)
options:
-h, --help show this help message and exit
--json Print output in a JSON format
--transpose Print output in a transposed tab-delimited format
--prefix PREFIX ID to use for output (Default: basename of assembly)
--version show program's version number and exit
Example Usage
Many FASTA files are available in the test directory. These include an uncompressed complete phiX174 genome and a compressed Staphylococcus aureus assembly. This script reads the input and outputs summary statistics in tab-delimited format to STDOUT.
Uncompressed
By default assembly-scan
outputs the results in tab-delimited format. But for example
purposes the --transpose
option has been used. It is just easier to look at in the
README.
assembly-scan test/phiX174.fna --transpose
test/phiX174.fna sample phiX174.fna
test/phiX174.fna total_contig 1
test/phiX174.fna total_contig_length 5386
test/phiX174.fna max_contig_length 5386
test/phiX174.fna mean_contig_length 5386
test/phiX174.fna median_contig_length 5386
test/phiX174.fna min_contig_length 5386
test/phiX174.fna n50_contig_length 5386
test/phiX174.fna l50_contig_count 1
test/phiX174.fna num_contig_non_acgtn 0
test/phiX174.fna contig_percent_a 23.97
test/phiX174.fna contig_percent_c 21.48
test/phiX174.fna contig_percent_g 23.28
test/phiX174.fna contig_percent_t 31.27
test/phiX174.fna contig_percent_n 0.00
test/phiX174.fna contig_non_acgtn 0.00
test/phiX174.fna contigs_greater_1m 0
test/phiX174.fna contigs_greater_100k 0
test/phiX174.fna contigs_greater_10k 0
test/phiX174.fna contigs_greater_1k 1
test/phiX174.fna percent_contigs_greater_1m 0.00
test/phiX174.fna percent_contigs_greater_100k 0.00
test/phiX174.fna percent_contigs_greater_10k 0.00
test/phiX174.fna percent_contigs_greater_1k 100.00
gzip Compressed
assembly-scan
includes a simple check (.gz extension) for gzip compressed
assemblies. This example also demonstrates the --json
option output.
assembly-scan test/saureus.fasta.gz --json
{
"sample": "saureus.fasta.gz",
"total_contig": 139,
"total_contig_length": 2761520,
"max_contig_length": 269921,
"mean_contig_length": 19867,
"median_contig_length": 163,
"min_contig_length": 56,
"n50_contig_length": 86756,
"l50_contig_count": 9,
"num_contig_non_acgtn": 0,
"contig_percent_a": "33.74",
"contig_percent_c": "16.50",
"contig_percent_g": "16.21",
"contig_percent_t": "33.54",
"contig_percent_n": "0.00",
"contig_non_acgtn": "0.00",
"contigs_greater_1m": 0,
"contigs_greater_100k": 7,
"contigs_greater_10k": 37,
"contigs_greater_1k": 49,
"percent_contigs_greater_1m": "0.00",
"percent_contigs_greater_100k": "5.04",
"percent_contigs_greater_10k": "26.62",
"percent_contigs_greater_1k": "35.25"
}
Output Columns
Column | Description |
---|---|
sample | Either assembly file basename, or value of --prefix |
total_contig | Total number of contigs in the assembly |
total_contig_length | Sum of all contig lengths |
max_contig_length | Length of the longest contig |
mean_contig_length | Average length of all contigs |
median_contig_length | Median value of all contigs |
min_contig_length | Length of the smallest contig |
n50_contig_length | N50 length of the contigs |
l50_contig_count | L50 number of contigs make up half the total |
num_contig_non_acgtn | Number of contigs with non-A,T,G,C, or N characters |
contig_percent_a | Percent of A nucleotides in contigs |
contig_percent_c | Percent of C nucleotides in contigs |
contig_percent_g | Percent of G nucleotides in contigs |
contig_percent_t | Percent of T nucleotides in contigs |
contig_percent_n | Percent of N nucleotides in contigs |
contig_non_acgtn | Percent of non-A,T,G,C, or N nucleotides in contigs |
contigs_greater_1m | Number of contigs greater than 1,000,000 bp |
contigs_greater_100k | Number of contigs greater than 100,000 bp |
contigs_greater_10k | Number of contigs greater than 10,000 bp |
contigs_greater_1k | Number of contigs greater than 1,000 bp |
percent_contigs_greater_1m | Percent of contigs greater than 1,000,000 bp |
percent_contigs_greater_100k | Percent of contigs greater than 1,000,000 bp |
percent_contigs_greater_10k | Percent of contigs greater than 1,000,000 bp |
percent_contigs_greater_1k | Percent of contigs greater than 1,000,000 bp |
Naming
Originally this was named assembly-stats, but after a quick Google search (which I
didn't do, again, I really should do
better!) I found another assembly-stats
from Sanger Pathogens. So I decided to rename it to assembly-scan
, similar to my
fastq-scan tool, since this process is similar
to the Scan ability found in
some video games/movies/tv etc... In otherwords, it 'scans' an assembly and provides the
user with otherwise hidden information about the assembly.