Awesome

PLEASE NOTE - THIS FORK IS PROVIDED AS-IS

This fork is provided as is, I'm not the maintainer, just a bioinformatician that wanted to use PMGA (PubMLST Genome Annotator v2.0) on the command-line. The folks at the CDC, in particular Nadav Topaz deserves all the kudos and credit for creating PMGA!

Given I only wanted to run this on the command-line, please keep that in mind if any issues arise. I might be able to assist with technical issues, but I cannot help interpret the results (e.g. serotype/serogroup). For help with the biological significance of the results, it would probably be best to reach out to Bacterial Meningitis Genome Analysis Platform (BMGAP).

I'm sure they would be willing to help!

To review the original docs I recommend you visit: PMGA (PubMLST Genome Annotator v2.0)

PMGA - PubMLST Genome Annotator

PMGA is a tool for serotyping/serogrouping all Neisseria species and Haemophilus influenzae. PMGA requires you to build a set of BLAST databases using data that is available from PubMLST. You can then query your uncompressed FASTA files against the BLAST databases. If you do not provide a species using the --species option, Mash will be used to determine the species. Once complete, you will get output files (discussed below).

Installation

Currently pmga is available on my personal Conda channel, but is expected to become available on BioConda soon.

mamba create -n pmga -c pmga -c conda-forge -c bioconda pmga

Building BLAST Databases

Prior to running pmga, you will need to build the BLAST databases with pmga-build. During this process, pmga-build will download each loci for all schemes associated with Neisseria scpecies and Haemophilus influenzae. Building the databases can take upwards to 90 minutes, but this is a one time step. After pmga-build is completed, by defualt all the outputs will be written to ./pubmlst_dbs_all.

`pmga-build` Usage

pmga-build --help
usage: pmga-build [-h] [--outdir STR] [--force] [--silent] [--debug] [--version]

pmga-build - Script for creating local BlastDBs from PubMLST alleles

options:
  -h, --help    show this help message and exit
  --outdir STR  Directory to save BLAST databases to (Default: ./pubmlst_dbs_all)
  --force       Overwrite existing directories.
  --silent      Only critical errors will be printed.
  --debug       Print debug related text.
  --version     show program's version number and exit

Running `pmga`

After you have sucessfully build all the necessary BLAST databases, you are now ready to start serotyping/serogrouping your Neisseria and H. influenzae samples! pmga can be executed on an uncompressed FASTA file and will output a number of files.

`pmga` Usage

pmga --help
usage: pmga [-h] [--prefix STR] [--blastdir STR] [--species STR] [-t INT] [-o STR] [--force] [--verbose]
            [--silent] [--version]
            FASTA

pmga - Serotyping, serotyping and MLST of all Neisseria species and Haemophilus influenzae

positional arguments:
  FASTA                 Input FASTA file to analyze

options:
  -h, --help            show this help message and exit
  --prefix STR          Prefix for outputs (Default: Use basename of input FASTA file)
  --blastdir STR        Directory containing BLAST DBs built by pmga-build (Default: ./pubmlst_dbs_all

Additional Options:
  --species STR         Use this as the input species (Default: use Mash distance). Available Choices:
                        neisseria, hinfluenzae
  -t INT, --threads INT
                        Number of cores to use (default=1)
  -o STR, --outdir STR  Directory to output results to (Default: ./pmga)
  --force               Force overwrite existing output file
  --verbose             Print debug related text.
  --silent              Only critical errors will be printed.
  --version             show program's version number and exit

`pmga` Output Files

Below are the expected outputs from each pmga run. The files will be output to the value set by --outdir, which defaults to ./pmga. Each of the output files will use the value of --prefix for filenames. By default the prefix is set to the basename of the input file.

<OUTDIR>
├── <PREFIX>-allele-matrix.txt
├── <PREFIX>-final-blast-results.json.gz
├── <PREFIX>-loci-counts.txt
├── <PREFIX>-raw-blast-results.json.gz
├── <PREFIX>.gff.gz
└── <PREFIX>.txt

Extension	Description
`*-allele-matrix.txt`	A tab-delimitted file with the allele ID for each loci with a hit
`*-blast-final-results.json.gz`	Filtered BLAST results in JSON format
`*-blast-raw-results.json.gz`	Unfiltered BLAST results in JSON format
`*-loci-counts.txt`	A tab-delimitted file with the number of hits per loci
`*.gff.gz`	A GFF3 file annotated with the BALST results
`*.txt`	A tab-delimitted file with final predicted serotype/serogroup for the sample

Example Serotype/Serogroup Output

sample	species	prediction	genes_present	notes
GCF_003355215	neisseria_serogroup	B	csb,cssA,cssB,cssC,ctrA,ctrB,ctrC,ctrD,ctrE,ctrF,tex	B backbone: All essential capsule genes intact and present

Above an example of the output predictions, which contains 5 columns. The 5 columns are

Column	Description
sample	The input sample name
species	The species of the sample either `neisseria_serogroup` or `hinfluenzae_serotype`
prediction	The predicted serotype or serogroup
genes_present	A list of genes present in the sample
notes	Any notes associated with the prediction

Citations

If you make use of this tool, please cite the following:

Bacterial Meningitis Genome Analysis Platform (BMGAP)
An analysis pipeline, ExpressJS API, and ReactJS webapp for the analysis and characterization of bacterial meningitis samples
Buono SA, Kelly RJ, Topaz N, Retchless AC, Silva H, Chen A, Ramos E, Doho G, Khan AN, Okomo-Adhiambo MA, Hu F, Marasini D, Wang X. Web-Based Genome Analysis of Bacterial Meningitis Pathogens for Public Health Applications Using the Bacterial Meningitis Genomic Analysis Platform (BMGAP). Front Genet. 2020 Nov 26;11:601870.
BLAST
Basic Local Alignment Search Tool
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009)
Mash
Fast genome and metagenome distance estimation using MinHash
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016)
Pigz
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Adler, M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015).
PubMLST.org
A database housing MLST shemes for many bacterial species.
Jolley KA, Bray JE, Maiden MCJ Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3, 124 (2018)