Home

Awesome

Build Status License: GPL v3 DOI:10.1093/bioinformatics/btu153 Don't judge me

Prokka: rapid prokaryotic genome annotation

Introduction

Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Installation

Bioconda

If you use Conda you can use the Bioconda channel:

conda install -c conda-forge -c bioconda -c defaults prokka

Brew

If you are using the MacOS Brew or LinuxBrew packaging system:

brew install brewsci/bio/prokka

Docker

Maintained by https://hub.docker.com/u/staphb


docker pull staphb/prokka:latest
docker run staphb/prokka:latest prokka -h

Singularity

singularity build prokka.sif docker://staphb/prokka:latest
singularity exec prokka.sif prokka -h

Ubuntu/Debian/Mint

sudo apt-get install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl
sudo cpan Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

Centos/Fedora/RHEL

sudo yum install git perl-Time-Piece perl-XML-Simple perl-Digest-MD5 perl-App-cpanminus git java perl-CPAN perl-Module-Build
sudo cpanm Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

MacOS

sudo cpan Time::Piece XML::Simple Digest::MD5 Bio::Perl
git clone https://github.com/tseemann/prokka.git $HOME/prokka
$HOME/prokka/bin/prokka --setupdb

Test

Invoking Prokka

Beginner

# Vanilla (but with free toppings)
% prokka contigs.fa

# Look for a folder called PROKKA_yyyymmdd (today's date) and look at stats
% cat PROKKA_yyyymmdd/*.txt

Moderate

# Choose the names of the output files
% prokka --outdir mydir --prefix mygenome contigs.fa

# Visualize it in Artemis
% art mydir/mygenome.gff

Specialist

# Have curated genomes I want to use to annotate from
% prokka --proteins MG1655.gbk --outdir mutant --prefix K12_mut contigs.fa

# Look at tabular features
% less -S mutant/K12_mut.tsv

Expert

# It's not just for bacteria, people
% prokka --kingdom Archaea --outdir mydir --genus Pyrococcus --locustag PYCC

# Search for your favourite gene
% exonerate --bestn 1 zetatoxin.fasta mydir/PYCC_06072012.faa | less

Wizard

# Watch and learn
% prokka --outdir mydir --locustag EHEC --proteins NewToxins.faa --evalue 0.001 --gram neg --addgenes contigs.fa

# Check to see if anything went really wrong
% less mydir/EHEC_06072012.err

# Add final details using Sequin
% sequin mydir/EHEC_0607201.sqn

NCBI Genbank submitter

# Register your BioProject (e.g. PRJNA123456) and your locus_tag prefix (e.g. EHEC) first!
% prokka --compliant --centre UoN --outdir PRJNA123456 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

# Check to see if anything went really wrong
% less PRJNA123456/EHEC-Chr1.err

# Add final details using Sequin
% sequin PRJNA123456/EHEC-Chr1.sqn

European Nucleotide Archive (ENA) submitter

# Register your BioProject (e.g. PRJEB12345) and your locus_tag (e.g. EHEC) prefix first!
% prokka --compliant --centre UoN --outdir PRJEB12345 --locustag EHEC --prefix EHEC-Chr1 contigs.fa

# Check to see if anything went really wrong
% less PRJNA123456/EHEC-Chr1.err

# Install and run Sanger Pathogen group's Prokka GFF3 to EMBL converter
# available from https://github.com/sanger-pathogens/gff3toembl
# Find the closest NCBI taxonomy id (e.g. 562 for Escherichia coli)
% gff3_to_embl -i "Submitter, A." \
    -m "Escherichia coli EHEC annotated using Prokka." \
    -g linear -c PROK -n 11 -f PRJEB12345/EHEC-Chr1.embl \
    "Escherichia coli" 562 PRJEB12345 "Escherichia coli strain EHEC" PRJEB12345/EHEC-Chr1.gff

# Download and run the latest EMBL validator prior to submitting the EMBL flat file
# from http://central.maven.org/maven2/uk/ac/ebi/ena/sequence/embl-api-validator/
# which at the time of writing is v1.1.129
% curl -L -O http://central.maven.org/maven2/uk/ac/ebi/ena/sequence/embl-api-validator/1.1.129/embl-api-validator-1.1.129.jar
% java -jar embl-api-validator-1.1.129.jar -r PRJEB12345/EHEC-Chr1.embl

# Compress the file ready to upload to ENA, and calculate MD5 checksum
% gzip PRJEB12345/EHEC-Chr1.embl
% md5sum PRJEB12345/EHEC-Chr1.embl.gz

Crazy Person

# No stinking Perl script is going to control me
% prokka \
        --outdir $HOME/genomes/Ec_POO247 --force \
        --prefix Ec_POO247 --addgenes --locustag ECPOOp \
        --increment 10 --gffver 2 --centre CDC  --compliant \
        --genus Escherichia --species coli --strain POO247 --plasmid pECPOO247 \
        --kingdom Bacteria --gcode 11 --usegenus \
        --proteins /opt/prokka/db/trusted/Ecocyc-17.6 \
        --evalue 1e-9 --rfam \
        plasmid-closed.fna

Output Files

ExtensionDescription
.gffThis is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbkThis is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fnaNucleotide FASTA file of the input contig sequences.
.faaProtein FASTA file of the translated CDS sequences.
.ffnNucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqnAn ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsaNucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tblFeature Table file, used by "tbl2asn" to create the .sqn file.
.errUnacceptable annotations - the NCBI discrepancy report.
.logContains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled.
.txtStatistics relating to the annotated features found.
.tsvTab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Command line options

General:
  --help            This help
  --version         Print version and exit
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix (default 'PROKKA')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --prodigaltf [X]  Prodigal training file (default '')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    Fasta file of trusted proteins to first annotate from (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
Computation:
  --fast            Fast mode - skip CDS /product searching (default OFF)
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

Option: --proteins

The --proteins option is recommended when you have good quality reference genomes and want to ensure gene naming is consistent. Some species use specific terminology which will be often lost if you rely on the default Swiss-Prot database included with Prokka.

If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail. The use of Genbank is recommended over FASTA, because it will provide /gene and /EC_number annotations that a typical .faa file will not provide, unless you have specially formatted it for Prokka.

Option: --prodigaltf

Instead of letting prodigal train its gene model on the contigs you provide, you can pre-train it on some good closed reference genomes first using the prodigal -t option. Once you've done that, provide prokka the training file using the --prodgialtf option.

Option: --rawproduct

Prokka annotates proteins by using sequence similarity to other proteins in its database, or the databases the user provides via --proteins. By default, Prokka tries to "cleans" the /product names to ensure they are compliant with Genbank/ENA conventions. Some of the main things it does is:

Full details can be found in the cleanup_product() function in the prokka script. If you feel your annotations are being ruined, try using the --rawproduct option, and please file an issue if you find an example of where it is "behaving badly" and I will fix it.

Databases

The Core (BLAST+) Databases

Prokka uses a variety of databases when trying to assign function to the predicted CDS features. It takes a hierarchical approach to make it fast.
A small, core set of well characterized proteins are first searched using BLAST+. This combination of small database and fast search typically completes about 70% of the workload. Then a series of slower but more sensitive HMM databases are searched using HMMER3.

The three core databases, applied in order, are:

  1. ISfinder: Only the tranposase (protein) sequences; the whole transposon is not annotated.

  2. NCBI Bacterial Antimicrobial Resistance Reference Gene Database: Antimicrobial resistance genes curated by NCBI.

  3. UniProtKB (SwissProt): For each --kingdom we include curated proteins with evidence that (i) from Bacteria (or Archaea or Viruses); (ii) not be "Fragment" entries; and (iii) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence.

Making a Core Databases

If you want to modify these core databases, the included script prokka-uniprot_to_fasta_db, along with the official uniprot_sprot.dat, can be used to generate a new database to put in /opt/prokka/db/kingdom/. If you add new ones, the command prokka --listdb will show you whether it has been detected properly.

The Genus Databases

:warning: This is no longer recommended. Please use --proteins instead.

If you enable --usegenus and also provide a Genus via --genus then it will first use a BLAST database which is Genus specific. Prokka comes with a set of databases for the most common Bacterial genera; type prokka --listdb to see what they are.

Adding a Genus Databases

If you have a set of Genbank files and want to create a new Genus database, Prokka comes with a tool called prokka-genbank_to_fasta_db to help. For example, if you had four annotated "Coccus" genomes, you could do the following:

% prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
% cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
% rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
% makeblastdb -dbtype prot -in Coccus
% mv Coccus.p* /path/to/prokka/db/genus/

The HMM Databases

Prokka comes with a bunch of HMM libraries for HMMER3. They are mostly Bacteria-specific. They are searched after the core and genus databases. You can add more simply by putting them in /opt/prokka/db/hmm. Type prokka --listdb to confirm they are recognised.

FASTA database format

Prokka understands two annotation tag formats, a plain one and a detailed one.

The plain one is a standard FASTA-like line with the ID after the > sign, and the protein /product after the ID (the "description" part of the line):

>SeqID product

The detailed one consists of a special encoded three-part description line. The parts are the /EC_number, the /gene code, then the /product - and they are separated by a special "~~~" sequence:

>SeqID EC_number~~~gene~~~product~~~COG

Here are some examples. Note that not all parts need to be present, but the "~~~" should still be there:

>YP_492693.1 2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase~~~COG1234
MNEKNIKHSQNFITSKHNIDKIMTNIRLNEHDNIFEIGSGKGHFTLELVQRCNFVTAIEI
DHKLCKTTENKLVDHDNFQVLNKDILQFKFPKNQSYKIFGNIPYNISTDIIRKIVF*
>YP_492697.1 ~~~traB~~~transfer complex protein TraB~~~
MIKKFSLTTVYVAFLSIVLSNITLGAENPGPKIEQGLQQVQTFLTGLIVAVGICAGVWIV
LKKLPGIDDPMVKNEMFRGVGMVLAGVAVGAALVWLVPWVYNLFQ*
>YP_492694.1 ~~~~~~transposase~~~
MNYFRYKQFNKDVITVAVGYYLRYALSYRDISEILRGRGVNVHHSTVYRWVQEYAPILYQ
QSINTAKNTLKGIECIYALYKKNRRSLQIYGFSPCHEISIMLAS*

The same description lines apply to HMM models, except the "NAME" and "DESC" fields are used:

NAME  PRK00001
ACC   PRK00001
DESC  2.1.1.48~~~ermC~~~rRNA adenine N-6-methyltransferase~~~COG1234
LENG  284

FAQ

cd /path/to/prokka/db/hmm
mkdir new
for D in *.hmm ; do hmmconvert $D > new/$D ; done
cd new
for D in *.hmm ; do hmmpress $D ; done
mv * ..
rmdir new
sed '/^##FASTA/Q' prokka.gff > nosequence.gff

Bugs

Submit problems or requests to the Issue Tracker.

Changes

Citation

Seemann T.
Prokka: rapid prokaryotic genome annotation
Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

Dependencies

Mandatory

Recommended

Optional

Licence

GPL v3

Author