Awesome
INNUca.py
INNUca - Reads Control and Assembly
INNUENDO quality control of reads, de novo assembly and contigs quality assessment, and possible contamination detection
https://github.com/B-UMMI/INNUca
Requirements
- Illumina paired-end reads (paired end information: sampleName_R1_001 / sampleName_R2_001 OR sampleName_1 / sampleName_2) (gzip compressed: .fastq.gz or .fq.gz)
- Expected species name
- Expected genome size in Mb
Dependencies
Mandatory
- Java JDK
- Kraken >= v0.10.6 with Kraken DB (whenever Kraken module should run)
- mlst >= v2.4 (whenever mlst module should run) (it is recommended to use a mlst version with updated databases)
- ReMatCh >= v4.0.1 (whenever true coverage module should run)
- Plotly Python package (whenever insert_size module with
--insertSizeDist
should run) - gzip >= v1.6 (normally found in Linux OS)
Optional
(executables are provided, but user's own executables can be used with --doNotUseProvidedSoftware
option)
- Bowtie2 >= v2.2.9
- Samtools = v1.3.1
- FastQC = v0.11.5
- Trimmomatic >= v0.36 (make sure the .jar file is executable and it is in your PATH)
- Pear = v0.9.10
- SPAdes >= v3.9.0
- Pilon >= v1.18
Installation
Standalone usage
git clone https://github.com/B-UMMI/INNUca.git
cd INNUca
# Temporarily add ReMatCh to the PATH
export PATH="$(pwd -P):$PATH"
# Permanently add ReMatCh to the PATH
echo export PATH="$(pwd -P):$PATH" >> ~/.profile
Docker usage
Check here
Usage
usage: INNUca.py [-h] [--version] -s "Streptococcus agalactiae" -g 2.1
(-i /path/to/input/directory/ | -f /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz)
[-o /output/directory/] [-j N] [--keepBAM] [--noLog]
[--noGitInfo] [--json] [--jarMaxMemory 10]
[--doNotUseProvidedSoftware]
[--runKraken] [--skipEstimatedCoverage] [--skipTrueCoverage]
[--skipFastQC] [--skipTrimmomatic] [--runPear] [--skipSPAdes]
[--skipAssemblyMapping] [--skipPilon] [--skipMLST]
[--adapters adaptersFile.fasta | --doNotSearchAdapters]
[--krakenDB minikraken_20171013_4GB] [--krakenQuick]
[--krakenProceed] [--krakenIgnoreQC] [--krakenMemory]
[--krakenMinCov 1.5] [--krakenMaxUnclass 1.5]
[--krakenMinQual N]
[--estimatedMinimumCoverage N]
[--trueConfigFile species.config] [--trueCoverageBowtieAlgo="--very-sensitive-local"]
[--trueCoverageProceed] [--trueCoverageIgnoreQC]
[--fastQCkeepFiles] [--fastQCproceed]
[--trimKeepFiles] [--doNotTrimCrops] [--trimCrop N]
[--trimHeadCrop N] [--trimSlidingWindow window:meanQuality]
[--trimLeading N] [--trimTrailing N] [--trimMinLength N]
[--spadesVersion 3.13.0] [--spadesNotUseCareful] [--spadesNotUseIsolate]
[--spadesMinContigsLength N] [--spadesMaxMemory N]
[--spadesMinCoverageAssembly N] [--spadesMinKmerCovContigs N]
[--spadesKmers 55 77 [55 77 ...] | --spadesDefaultKmers]
[--assemblyMinCoverageContigs N] [--maxNumberContigs N]
[--saveExcludedContigs] [--keepIntermediateAssemblies]
[--keepSPAdesScaffolds]
[--pilonKeepFiles]
[--mlstIgnoreQC]
[--pearKeepFiles] [--pearMinOverlap N]
INNUca - Reads Control and Assembly
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-s "Streptococcus agalactiae", --speciesExpected "Streptococcus agalactiae"
Expected species name (default: None)
-g 2.1, --genomeSizeExpectedMb 2.1
Expected genome size in Mb (default: None)
Required INPUT options (one of the following):
-i /path/to/input/directory/, --inputDirectory /path/to/input/directory/
Path to directory containing the fastq files. Can be
organized in separete directories by samples or all
together (default: None)
-f /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz, --fastq /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz
Path to Pair-End Fastq files (default: None)
General options:
-o /output/directory/, --outdir /output/directory/
Path for output directory (default: ./) (default: .)
-j N, --threads N Number of threads (default: 1) (default: 1)
--keepBAM Keep the last BAM file produced (with mapped and
unmapped reads) (default: False)
--noLog Do not create a log file (default: False)
--noGitInfo Do not retrieve GitHub repository information
(default: False)
--json Tells INNUca to save the results also in json format
(default: False)
--jarMaxMemory 10 Sets the maximum RAM Gb usage by jar files
(Trimmomatic and Pilon). Can also be auto or off. When
auto is set, 1 Gb per thread will be used up to the
free available memory. (default: off (default: off)
--doNotUseProvidedSoftware
Tells the software to not use FastQC, Trimmomatic,
SPAdes, Bowtie2, Samtools and Pilon that are provided
with INNUca.py (default: False)
Running modules options:
--runKraken Sets INNUca to run Kraken (default: False)
--skipEstimatedCoverage
Tells the programme to not estimate coverage depth
based on number of sequenced nucleotides and expected
genome size (default: False)
--skipTrueCoverage Tells the programme to not run trueCoverage_ReMatCh
analysis (default: False)
--skipFastQC Tells the programme to not run FastQC analysis
(default: False)
--skipTrimmomatic Tells the programme to not run Trimmomatic (default:
False)
--runPear Tells the programme to run Pear (default: False)
--skipSPAdes Tells the programme to not run SPAdes and consequently
all the modules that are assembly based (Assembly
Mapping check, Pilon correction, MLST analysis and
Kraken on the assembly) (default: False)
--skipAssemblyMapping
Tells the programme to not run Assembly Mapping check
(default: False)
--skipPilon Tells the programme to not run Pilon correction
(default: False)
--skipMLST Tells the programme to not run MLST analysis (default:
False)
--runInsertSize Runs the insert_size module at the end (default: False)
Adapters options (one of the following):
Control how adapters are handle by INNUca. If none of these options are provided, INNUca
will use the Nextera XT and PE TruSeq files found at INNUca/src/Trimmomatic-0.36/adapters/
--adapters adaptersFile.fasta
Fasta file containing adapters sequences to be used in
FastQC and Trimmomatic (default: None)
--doNotSearchAdapters
Tells INNUca.py to not search for adapters and clip
them during Trimmomatic step (default: False)
Kraken options:
--krakenDB minikraken_20171013_4GB
Name of Kraken DB found in path, or complete path to
the directory containing the Kraken DB files (for
example /path/to/directory/minikraken_20171013_4GB)
(default: None)
--krakenQuick Set Kraken to do a quick operation and only use the
first hits (default: False)
--krakenProceed Do not stop INNUca.py if sample fails Kraken (default:
False)
--krakenIgnoreQC Ignore Kraken QA/QC in sample quality assessment.
Useful when analysing data from possible new species
or higher taxonomic levels (higher than species)
(default: False)
--krakenMemory Set Kraken to load the DB into the memory before run
(default: False)
--krakenMinCov 1.5 Minimum percentage of fragments covered to consider
the taxon. If nothing is specified, the hundredth of
the taxon found (species or genus if no species are
available for a given genus) with higher percentage of
fragments covered (excluding unclassified category)
will be used. (default: None)
--krakenMaxUnclass 1.5
Maximum percentage of unclassified fragments allowed.
If nothing is specified, the tenth of 100 minus the
percentage of fragments of the taxon found (species or
genus if no species are available for a given genus)
with higher percentage of fragments covered (excluding
unclassified category) will be used. (default: None)
--krakenMinQual N Sets the minimum base quality to be used in
classification (default: 10) (default: 10)
Estimated Coverage options:
This module estimates the depth of coverage by dividing the number of
sequenced nucleotides (raw or processed reads) by the expected genome size
(in bps)
--estimatedMinimumCoverage N
Minimum estimated coverage to continue INNUca pipeline
(default: 15) (default: 15)
trueCoverage_ReMatCh options:
This module calculates an improved estimation of the true bacterial
chromosome coverage via read mapping against reference gene sequences
distributed throughout the genome. This approach alleviates coverage
estimation bias introduced by mobile genetic elements and other similar
occurrences. Moreover, this module can also detect multiple strains or
species contamination by searching for heterozygous positions. INNUca
provides target sequences for some species together with the desired
settings and the QA/QC decision rules (in
INNUca/modules/trueCoverage_rematch/). If the expected species matches any
of the species provided files, trueCoverage_ReMatCh module will run.
--trueConfigFile species.config
File with trueCoverage_ReMatCh settings. Some species
specific config files can be found in
INNUca/modules/trueCoverage_rematch/ folder. Use those
files as example files. For species with config files
in INNUca/modules/trueCoverage_rematch/ folder (not
pre releases versions, marked with "pre."),
trueCoverage_ReMatCh will run by default, unless
--skipTrueCoverage is specified. Do not use together
with --skipTrueCoverage option (default: None)
--trueCoverageBowtieAlgo="--very-sensitive-local"
Bowtie2 alignment mode to be used via ReMatCh to map the reads and
determine the true coverage. It can be an end-to-end alignment
(unclipped alignment) or local alignment (soft clipped
alignment). Also, can choose between fast or sensitive
alignments. Please check Bowtie2 manual for extra information:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
This option should be provided between quotes and starting
with an empty space (like --bowtieAlgo " --very-fast") or
using equal sign (like --bowtieAlgo="--very-fast")
(default: "--very-sensitive-local")
--trueCoverageProceed Do not stop INNUca.py if sample fails
trueCoverage_ReMatCh (default: False)
--trueCoverageIgnoreQC
Ignore trueCoverage_ReMatCh QA/QC in sample quality
assessment. Useful when analysing data from when
analysing data from species with unknown behaviour
(default: False)
FastQC options:
--fastQCkeepFiles Tells INNUca.py to not remove the output of FastQC
(default: False)
--fastQCproceed Do not stop INNUca.py if sample fails FastQC (default:
False)
Trimmomatic options:
--trimVersion 0.38
Tells INNUca.py which Trimmomatic version to use (available
options: 0.36, 0.38) (default: 0.38)
--trimKeepFiles Tells INNUca.py to not remove the output of
Trimmomatic (default: False)
--doNotTrimCrops Tells INNUca.py to not cut the beginning and end of
reads during Trimmomatic step (unless specified with
--trimCrop or --trimHeadCrop, INNUca.py will search
for nucleotide content bias at both ends and will cut
by there) (default: False)
--trimCrop N Cut the specified number of bases to the end of the
maximum reads length. By default, the number of bases
to cut is calculated using sample FastQC results and
based on the G/C content. The first position of the
second half of the reads with GC bias (80 < GC
percentage > 120) followed by at least two other
biased positions, are marked to be trimmed. (default:
None)
--trimHeadCrop N Trimmomatic: cut the specified number of bases from
the start of the reads. By default, the number of
bases to cut is calculated using sample FastQC results
and based on the G/C content. The first position of
the first half of the reads with GC bias (80 < GC
percentage > 120) followed by at least two other
unbiased positions, are marked to be trimmed.
(default: None)
--trimSlidingWindow window:meanQuality
Trimmomatic: perform a sliding window trimming,
cutting once the average quality within the window
falls below a threshold (default: 5:20) (default:
5:20)
--trimLeading N Trimmomatic: cut bases off the start of a read, if
below a threshold quality (default: 3) (default: 3)
--trimTrailing N Trimmomatic: cut bases off the end of a read, if below
a threshold quality (default: 3) (default: 3)
--trimMinLength N Trimmomatic: drop the read if it is below a specified
length (default: 55) (default: 55)
SPAdes options:
--spadesVersion 3.14.0
Tells INNUca.py which SPAdes version to use (available
options: 3.11.1, 3.13.0, 3.14.0) (default: 3.14.0)
--spadesNotUseCareful
Tells SPAdes to perform the assembly without the --careful option.
When the SPAdes --isolate option is allowed to be used (for SPAdes >= v4.14.0
and in the cases that INNUca --spadesNotUseIsolate option is not used) and the
estimated depth of coverage is >= 100x, the SPAdes --careful option is not used
anyway. (default: False)
--spadesNotUseIsolate
Tells SPAdes to not use --isolate option (only possible for SPAdes >= v3.14.0).
The SPAdes --isolate option is used when the estimated depth of coverage
is >= 100x (unless the INNUca --spadesNotUseIsolate is used) and automatically
turns on the INNUca --spadesNotUseCareful option and consequently do not use
the SPAdes --careful option.
Accordingally to SPAdes, the --isolate option is highly recommended for
high-coverage isolate and multi-cell data (improves the assembly quality and
running time). (default: False)
--spadesMinContigsLength N
Filter SPAdes contigs for length greater or equal than
this value (default: maximum reads size or 200 bp)
(default: None)
--spadesMaxMemory N The maximum amount of RAM Gb for SPAdes to use
(default: 2 Gb per thread will be used up to the free
available memory) (default: None)
--spadesMinCoverageAssembly N
The minimum number of reads to consider an edge in the
de Bruijn graph during the assembly. Can also be auto
or off (default: 2) (default: 2)
--spadesMinKmerCovContigs N
Minimum contigs K-mer coverage. After assembly only
keep contigs with reported k-mer coverage equal or
above this value (default: 2) (default: 2)
SPAdes k-mers options (one of the following):
--spadesKmers 55 77 [55 77 ...]
Manually sets SPAdes k-mers lengths (all values must
be odd, lower than 128) (default values: reads length
>= 175 [55, 77, 99, 113, 127]; reads length < 175 [21,
33, 55, 67, 77]) (default: None)
--spadesDefaultKmers Tells INNUca to use SPAdes default k-mers (default:
False)
Assembly Mapping options:
--assemblyMinCoverageContigs N
Minimum contigs average coverage. After mapping reads
back to the contigs, only keep contigs with at least
this average coverage (default: 1/3 of the assembly
mean coverage or 10x) (default: None)
Assembly options:
--maxNumberContigs N Maximum number of contigs per 1.5 Mb of expected
genome size (default: 100) (default: 100)
--saveExcludedContigs
Tells INNUca.py to save excluded contigs (default:
False)
--keepIntermediateAssemblies
Tells INNUca to keep all the intermediate assemblies
(default: False)
--keepSPAdesScaffolds
Tells INNUca to keep SPAdes scaffolds (default: False)
Pilon options:
--pilonVersion 1.18
Tells INNUca.py which Pilon version to use (available
options: 1.18, 1.23) (default: 1.23)
--pilonKeepFiles Tells INNUca.py to not remove the output of Pilon
(default: False)
MLST options:
--mlstIgnoreQC Ignore MLST QA/QC in sample quality assessment. Useful
when analysing data from possible new species or
higher taxonomic levels (higher than species)
(default: False)
insert_size options:
This module determines the sequencing insert size by mapping the reades used
in the assembly back to the produced assembly it self.
--insertSizeDist Produces a distribution plot of the insert sizes (requires
Plotly) (default: False)
Pear options:
--pearKeepFiles Tells INNUca.py to not remove the output of Pear
(default: False)
--pearMinOverlap N Minimum nucleotide overlap between read pairs for Pear
assembly them into only one read (default: 2/3 of
maximum reads length determine using FastQC, or
Trimmomatic minimum reads length if it runs, or 33
nts) (default: None)
<div style="text-align: right">INNUca's poster presented at **_Bioinformatics Open Days 2017_**, Braga, Portugal (February 23-24)</div>
Combine INNUca reports
In order to combine INNUca reports (Estimate Coverage, True Coverage, Pear, SPAdes, Assembly Mapping, Pilon, MLST), use combine_reports.py found in INNUca modules folder
usage: python combine_reports.py [-h] [--version] -i
/path/to/INNUca/output/directory/
[-o /path/to/output/directory/]
Combine INNUca reports (Estimated Coverage, True Coverage, Pear, SPAdes, Assembly
Mapping, Pilon, MLST)
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-i /path/to/INNUca/output/directory/, --innucaOut /path/to/INNUca/output/directory/
Path to INNUca output directory (default: None)
Facultative options:
-o /path/to/output/directory/, --outdir /path/to/output/directory/
Path to where to store the outputs (default: ['.'])
Combine trueCoverage_ReMatCh module reports
In order to manually combine INNUca trueCoverage_ReMatCh module reports in respect to gene information, use combine_trueCoverage_reports.py found in INNUca modules/trueCoverage_rematch folder
usage: python combine_trueCoverage_reports.py [-h] [--version] -i
/path/to/INNUca/output/directory/
[-o /path/to/output/directory/]
[--minimum_gene_coverage 80]
Combine trueCoverage_ReMatCh module reports in respect to gene information.
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-i /path/to/INNUca/output/directory/, --innucaOut /path/to/INNUca/output/directory/
Path to INNUca output directory (default: None)
Facultative options:
-o /path/to/output/directory/, --outdir /path/to/output/directory/
Path to where to store the outputs (default: .)
--minimum_gene_coverage 80
Minimum percentage of sequence length (with a minimum
of read depth to consider a position to be present) to
determine whether a gene is present. (default: 80)
Citation
MP Machado, J Halkilahti, A Jaakkonen, DN Silva, I Mendes, Y Nalbantoglu, V Borges, M Ramirez, M Rossi, JA Carriço. INNUca GitHub https://github.com/B-UMMI/INNUca
Contact
Miguel Machado mpmachado@medicina.ulisboa.pt
Written with StackEdit.