Awesome
plasmidID <img align="left" src="https://github.com/BU-ISCIII/plasmidID/blob/develop/img/plasmidID_logo.png" alt="Logo" width="100">
<br> <br>- Introduction
- Requirements
- Installation
- Quick usage
- Usage
- Output
- Annotation file
- Illustrated pipeline
- Docker
Introduction
PlasmidID is a mapping-based, assembly-assisted plasmid identification tool that analyzes and gives graphic solution for plasmid identification.
PlasmidID is a computational pipeline implemented in BASH that maps Illumina reads over plasmid database sequences. The k-mer filtered, most covered sequences are clustered by identity to avoid redundancy and the longest are used as scaffold for plasmid reconstruction. Reads are assembled and annotated by automatic and specific annotation. All information generated from mapping, assembly, annotation and local alignment analyses is gathered and accurately represented in a circular image which allow user to determine plasmidic composition in any bacterial sample.
Requirements
Software
- Python >=3.6
- Trimmomatic v0.33(Optional)
- Spades v3.8 (Optional)
- Perl v5.26.0
- NCBI_blast + v2.2.3
- Bedtools v2.25
- Bowtie 2 v2.2.4
- SAMtools v1.2
- prokka v1.12
- cd-hit v4.6.6 (no longer needed since v1.6)
- circos v0.69.3
- mash v2.2
Plasmid database
Since version v1.5.1 plasmid database can be downloaded with the following command:
download_plasmid_database.py -o FOLDER
Installation
Install from source
Install all dependencies and add them to $PATH
git clone https://github.com/BU-ISCIII/plasmidID.git
Add plasmidID and ./bin to $PATH
Install using conda
This option is recomended.
Install Anaconda3
conda install -c conda-forge -c bioconda plasmidid
Wait for the environment to solve
Ignore warnings/errors
Use Docker
Example: Clone the repo:
git clone git@github.com:BU-ISCIII/plasmidID.git
cd plasmidID
Run it with the test data using docker:
Notice that the input files MUST be in your present working directory or in any folder inside it. For example, if I execute this command in /home/smonzon
, my folder with the files would be in /home/smonzon/test
.
docker run -v $PWD:$PWD -w $PWD buisciii/plasmidid plasmidID \
-1 test/KPN_TEST_R1.fastq.gz \
-2 test/KPN_TEST_R2.fastq.gz \
-d test/plasmids_TEST_database.fasta \
-c test/contigs_KPN_TEST.fasta \
--no-trim \
-s KPN
Quick usage
Illumina paired-end
plasmidID \
-1 SAMPLE_R1.fastq.gz \
-2 SAMPLE_R2.fastq.gz \
-d YYYY-MM-DD_plasmids.fasta \
-c SAMPLE_assembled_contigs.fasta \
--no-trim \
-s SAMPLE
SMRT sequencing (only contigs)
plasmidID \
-d YYYY-MM-DD_plasmids.fasta \
-c SAMPLE_contigs.fasta \
-s SAMPLE
Annotate any fasta you want
plasmidID \
-d YYYY-MM-DD_plasmids.fasta \
-c SAMPLE_assembled_contigs.fasta \
-a annotation_file \
-s SAMPLE
More info about annotation file
If there are several samples in the same GROUP folder
summary_report_pid.py -i NO_GROUP/
Usage
usage : plasmidID <-1 R1> <-2 R2> <-d database(fasta)> <-s sample_name> [-g group_name] [options]
Mandatory input data:
-1 | --R1 <filename> reads corresponding to paired-end R1 (mandatory)
-2 | --R2 <filename> reads corresponding to paired-end R2 (mandatory)
-d | --database <filename> database to map and reconstruct (mandatory)
-s | --sample <string> sample name (mandatory), less than 37 characters
Optional input data:
-g | --group <string> group name (optional). If unset, samples will be gathered in NO_GROUP group
-c | --contigs <filename> file with contigs. If supplied, plasmidID will not assembly reads
-a | --annotate <filename> file with configuration file for specific annotation
-o <output_dir> output directory, by default is the current directory
Pipeline options:
--explore Relaxes default parameters to find less reliable relationships within data supplied and database
--only-reconstruct Database supplied will not be filtered and all sequences will be used as scaffold
This option does not require R1 and R2, instead a contig file can be supplied
-w Undo winner takes it all algorithm when clustering by kmer - QUICKER MODE
Trimming:
--trimmomatic-directory Indicate directory holding trimmomatic .jar executable
--no-trim Reads supplied will not be quality trimmed
Coverage and Clustering:
-C | --coverage-cutoff <int> minimun coverage percentage to select a plasmid as scafold (0-100), default 80
-S | --coverage-summary <int> minimun coverage percentage to include plasmids in summary image (0-100), default 90
-f | --cluster <int> kmer identity to cluster plasmids into the same representative sequence (0 means identical) (0-1), default 0.5
-k | --kmer <int> identity to filter plasmids from the database with kmer approach (0-1), default 0.95
Contig local alignment
-i | --alignment-identity <int> minimun identity percentage aligned for a contig to annotate, default 90
-l | --alignment-percentage <int> minimun length percentage aligned for a contig to annotate, default 20
-L | --length-total <int> minimun alignment length to filter blast analysis
--extend-annotation <int> look for annotation over regions with no homology found (base pairs), default 500bp
Draw images:
--config-directory <dir> directory holding config files, default config_files/
--config-file-individual <file-name> file name of the individual file used to reconstruct
Additional options:
-M | --memory <int> max memory allowed to use
-T | --threads <int> number of threads
-v | --version version
-h | --help display usage message
example: ./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d database.fasta -s ECO_553 -G ENTERO
./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d PacBio_sample.fasta -c scaffolds.fasta -C 60 -s ECO_60 -G ENTERO --no-trim
Examples
Under construction
Output
Since v1.6, the more relevant output is located in GROUP/SAMPLE folder:
- SAMPLE_final_results.html(.tab)
- id: Name of the accession number of reference
- length: length of the reference sequence
- species: species of the reference sequence
- description: rest of reference fasta header
- contig_name: number of the contigs that align the minimun required for complete contig track
- SAMPLE:
- Image of the reconstructed plasmid (click to open in new tab)
- MAPPING % (percentage): percentage of reference covered with reads
- X for contig mode (gray colour)
- Orientative colouring (the closer to 100% the better)
- ALIGN FR (fraction_covered): total length of contigs aligned (complete) / reference sequence length
- Orientative colouring (the closer to 1 the better)
Annotation file
Under construction
Illustrated pipeline
This image sumarizes PlasmidID pipeline, including the most important steps. For furder details, including:
- Results interpretation
- and more, please visit: PLASMIDID WIKI