Awesome
DraftPolisher
DraftPolisher is a tool for fast polishing of draft sequences.
Description
DraftPolisher can produce an improved consensus sequence for a draft genome assembly (or any draft sequence assembly) by using a reference genome and a reference database of sequences (by which the frequences of a defined k-mer, representative of each mismatch, will be evaluated) both in FASTA format.
Prerequisites
DraftPolisher is designed for the polishing of draft sequences. Two different releases of the tool are present: DraftPolisher.py and DraftPolisher_cov.py. The "cov" release (i.e. coverage release) takes into account also the coverage of the sequences from the reference database used for the polishing. The "cov" release was designed based on the SPAdes assembling output file format, where the coverage value is reported in the last part of the sequences IDs, so if using this version of the tool take care of the reference sequences file format (see SPAdes output format for more details).Three files are required by the tool: the query sequence to polish, a reference sequence, a reference database of sequences and the size of the k-mer to use to perform the polishing (we suggest k-mer size = 8). In "Test" folder data for testing the tool are available.
Installation
This tool uses MUSCLE Sequence alignment tool to perform the alignment thus the installation is required (We used MUSCLE v3.8.1551 to test the tool). The last version can be installed as follows:
conda install muscle
Or you can download it from this site and install it manually: http://drive5.com/muscle/downloads.htm
No other installation steps are required. You need just to run the python script of the program. The tool works well with Python 3.6.0 (https://www.python.org/downloads/release/python-360/) but others Python3 versions should work.
The following python packages are required:
- os
- re
- subprocess
- sys
- itertools
- pandas
- numpy
- Bio
- argparse
For an easier installation we suggest to use conda like this:
1- create a conda environment:
conda create --name envname
2- activate the environment:
conda activate envname
3- install the following packadges:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install pandas
conda install -c anaconda numpy
conda install -c conda-forge biopython
conda install -c anaconda argparse python=3.6
Parameters
-
Mandatory
Name | Example value | Description |
---|---|---|
-q <br> --query | query.fa | draft sequence in Fasta format |
-s <br> --subject | subject.fa | reference genome in Fasta format |
-f <br> --sequences_database | reference_contigs.fa | a Fasta file containing all the reads or sequences to use as reference database for the polishing of the draft sequence |
-k <br> --kmer_size | 8 | k-mer size to use for the polishing |
-
Flags
Flags are special parameters without value.
Name | Description |
---|---|
-h | Display help |
Usage
python DraftPolisher.py -q query.fa -s subject.fa -f reads.fa -k 8
python DraftPolisher_cov.py -q query.fa -s subject.fa -f reads.fa -k 8
Main Output
Type | Description |
---|---|
consensus.fa | the polished sequence in Fasta format |
Other outputs
Several files are generated during the process and they can be used to evaluate the error rate, error correction rate and in general to have different checkpoints of the process. At the end of the process will be asked if you want to discard these process files or not.
Contributions
Name | Description | |
---|---|---|
Rosario Nicola Brancaccio | rosariobrancaccio@yahoo.it | Developer to contact for support |
Massimo Tommasino | tommasinom@iarc.fr | |
Tarik Gheit | gheitt@iartc.fr |
Versioning
Versions: 1.0
Authors
Rosario Nicola Brancaccio - (https://github.com/BioRB/)
License
GNU General Public License, version 3
Acknowledgments
I would like to acknowledge Vincenzo Minissale and Sergey Senkin for their help.