Awesome
digIS
Table of content
<!---toc start--> <!---toc end-->Overview
digIS is a command-line tool developed in Python. It utilizes several external tools such as BLAST, HMMER, and Biopython library. As an input, digIS accepts contigs in FASTA format. Optionally, the user can provide a GenBank annotation file for a given input sequence(s). This annotation is later used to improve the classification of identified IS elements.
The digIS search pipeline operates in the following steps:
- The whole input nucleic acid sequence is translated into amino acid sequences (all six frames).
- The translated sequences are searched using manually curated pHMMs.
- Found hits, referred to as seeds, are filtered by domain bit score and e-value. Those that overlap or follow one another within a certain distance are merged.
- Each seed is matched against the database of known IS elements (ISfinder) and its genomic positions are extended according to the best hit.
- Extended seeds are filtered by noise cutoff score and length. Duplicates, corresponding to the same IS element, are removed.
- Remaining extended seeds are classified based on sequence similarity and GenBank annotation (if available) to assess their quality.
- Finally, the classified outputs are reported in the CSV and GFF3 format.
Requirements
- HMMER 3.3 or higher, download the latest version from http://hmmer.org/download.html
- ncbi-blast+ v 2.6.1 or higher, download the latest version from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- Python3
- Biopython 1.73 or higher
Installation
Install dependencies using package manager (for Ubuntu)
sudo apt-get update
sudo apt-get install hmmer
sudo apt-get install ncbi-blast+
# install python3 and pip3
sudo apt-get install python3.7
sudo apt install python3-pip
# install Biopython, bcbio-gff, and numpy packages
pip3 install biopython
pip3 install bcbio-gff
pip3 install numpy
Download digIS version from github repository
# download the latest version
git clone https://github.com/janka2012/digIS.git
# or download specific release from https://github.com/janka2012/digIS/releases
wget https://github.com/janka2012/digIS/archive/v1.0.tar.gz
tar -xvzf v1.0.tar.gz
# go to digIS directory
cd digIS-v1.0
Usage
Mode with GenBank annotation
export PYTHONPATH=$PYTHONPATH:/path/to/digis/
python3 digIS_search.py -i data/test_data/NC_002608.fasta -g data/test_data/NC_002608.gb -o digis_genbank
Mode without GenBank annotation
export PYTHONPATH=$PYTHONPATH:/path/to/digis/
python3 digIS_search.py -i data/test_data/NC_002608.fasta -o digis_without_genbank
Run digIS in docker container
Install docker
# update software repositories
sudo apt-get update
# uninstall older versions of docker
sudo apt-get remove docker docker-engine docker.io
# install docker
sudo apt install docker.io
# start and automate docker
sudo systemctl start docker
sudo systemctl enable docker
# check docker version (optional)
docker --version
Pull the docker image from Dockerhub
docker pull janka2012/digis
# List created docker images. You should see image with name digis listed.
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
janka2012/digis latest 1f09fc937ee1 14 minutes ago 765MB
Run digIS using digis_docker_wrapper.sh
Instead of typing overwhelmingly long docker commands we are providing digis_docker_wrapper.sh
script which allows you to use digIS docker image in really convinient way. This script takes same arguments as standard digIS.py
script.
sh digis_docker_wrapper.sh -i data/test_data/NC_002608.fasta -g data/test_data/NC_002608.gb -o digis_genbank
Understanding Outputs
digIS output directory structure
digIS stores all results in output directory you specify by using -o
option. The default output directory name is set to digIS_output
. The output directory has following structure:
hmmer
: results from hmmer and phmmer search, separate file for each sequence in input (multi)fasta filelogs
:pep
: translated protein sequences of the input genomic sequencesresults
: digIS results for all sequences in input (multi)fasta file in CSV and GFF3 format together with summarization statistics.input_filename.csv
: results in CSV formatinput_filename.gff
: results in GFF formatinput_filename.sum
: sumarization statistics
digIS output files in results
subfolder
For a given input (multi)fasta file digIS generates three files with results: CSV file, GFF3 file and file with summary statistics.
CSV output
id
: unique record idlevel
: Extension level, possible values: IS, ORF, domain.qid
: name of the query profileqstart
: start position of the hit in the query profileqend
: end position of the hit in the query profilesid
: name of the subject sequencesstart
: start position of the hit in subject sequencesend
: end position of the hit in subject sequencestrand
: the strand on which the hit was foundacc
: taken from HMMER. A measure of how reliable the overall alignment is (from 0 to 1, with 1.00 indicating a completely reliable alignment according to the model).score
: taken from HMMER. TODOevalue
: taken from HMMER. TODOORF_sim
: TODOIS_sim
: TODOGenBank_class
: classification based on GenBank annotation. Possible values: is_related, no, other annotation. If GenBank annotation is not provided, this classification is not available and this field is empty.
Example CSV output
id;level;qid;qstart;qend;sid;sstart;send;strand;acc;score;evalue;ORF_sim;IS_sim;GenBank_class
NC_002608.1_000_is;is;IS200_IS605;1;113;NC_002608.1;181742;183592;-;0.98;105.7;4.9e-34;0.8923076923076924;0.8803465078505684;is_related
NC_002608.1_001_is;is;IS200_IS605;1;113;NC_002608.1;154295;156130;-;0.98;117.5;1.1e-37;0.9922480620155039;1.0;is_related
GFF output
The GFF3 output file has the same content as the CSV output file, but in GFF3 format. Detailed description of GFF3 format can be found here.
Example GFF3 output record
NC_002608.1 digIS transposable_element 309812 311213 1.0 - . id=NC_002608.1_13_is;level=is;qid=IS4Sa_ISH8_IS231_IS4;qstart=1;qend=226;class_sim_orf=strong;class_sim_is=strong;class_sim_all=strong;class_genebank=None;class_level=None
Summary statistics
This file is in tab-delimited format and contains six coloumns:
seqid
: sequence IDfamily
: IS family namenIS
: number of IS elements per individual familybps
: overall number of base pairs of input sequence occupied by IS elementsdnaLen
: length of the input sequence%dna
: overall percentage of input sequence occupied by IS elements
Getting FASTA file using GFF file
The GFF is a standard format for storing of genome features. This file can be used as an input for other tools to process or visualize appropriate genomic features.
For instance, FASTA sequences of IS elements (their catalytic domains) can be obtained using Bedtools and command getfasta
as follows:
bedtools getfasta -fi <input.fasta> -bed <input.gff> -fo <output.fasta>
where input.fasta represents FASTA file used for searching, input.gff is the digIS output GFF file and output.fasta is required output file.
Getting Flank Regions using GFF file
bedtools flank -i <input.gff> -g <genome.size> -l <left flank size> -r <right flank size>
where genome.size is a text file containing information about chromosomes and their sizes in form: chromosome_name<TAB>chromosome_size
. For more information about genome.size file format please see Bedtools documentation.
As the output a new GFF file with positions of flank regions is generated. Then, the appropriate FASTA file with flank sequences can be obtained using bedtools getfasta
command described above.