Home

Awesome

README for emm-typing-tool

The emm gene typing tool assigns emm type and subtype by querying the CDC M-type specific database (ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tsemm/). Genomic reads are mapped to the latest version using bowtie2 (version 2.1.0; following options used: --fr --no-unal --minins 300 --maxins 1100 -k 99999 -D 20 -R 3 -N 0 -L 20 -I S,1,0.50) 16. At this stage the validated emm genes (emm1-125) are separated from the non-validated (emm125+, STC and STG) and both sets are analysed in parallel. In each set, alleles with 100% coverage (minimum depth of 5 reads per bp) over the length of their sequences and >90% identity were selected and the allele with the highest percentage identity was reported. Following this selection, a decision algorithm (see decision_algorithm.png) is implemented to determine (a) whether the validated or not validated emm type will be reported, (b) whether emm type or subtype will be reported and (c) whether further investigation is necessary (i.e. contamination, new type or presence of two emm subtypes).

Table of content


Dependencies


emm-typing-tool is written with Python 2.7.5 and requires the following packages installed before running:

cd <path-to>/samtools-0.1.19
make

Before running emm-typing-tool


Prior to running the tool for the first time the user needs to create a reference directory. The reference directory should contain an indexed reference.seq file and a multi-fasta file containing the trimmed EMM variant sequences (180 bps) available in the CDC database.

(a) reference.seq: Download reference sequence (AE004092.2; Streptococcus pyogenes M1 GAS)in fasta format and index the reference sequence by using makeblastdb command usually available in /usr/local/blast+/bin/

<path-to>/makeblastdb -in reference.seq -dbtype nucl  -out reference

(b) EMM variant sequence: fasta file saved as emm_tsdna.fas Within the EMM_data directory run the following command to download the trimmed fasta sequences of the emm alleles from the CDC database:

wget ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tsemm/trimmed.tfa

Then copy the edit_allele_file.sh into the EMM_data directory and run:

sh edit_allele_file.sh

This will rename the trimmed.tfa file to emm_tsdna_fas and make some edits to the content so that it can be parsed by the emmm typing tool.

NOTES:

Running emm-typing-tool


usage: emm_typing.py [-h] [-input_directory INPUT_DIRECTORY]
                     [--fastq_1 FASTQ_1] [--fastq_2 FASTQ_2]
                     [--profile_file_directory PROFILE_FILE_DIRECTORY]
                     [--output_directory OUTPUT_DIRECTORY] [--bowtie BOWTIE]
                     [--samtools SAMTOOLS] [--log_directory LOG_DIRECTORY]
                     [--verbose]

optional arguments:
  -h, --help            show this help message and exit
  -input_directory INPUT_DIRECTORY, -i INPUT_DIRECTORY
                        please provide an input directory
  --fastq_1 FASTQ_1, -1 FASTQ_1
                        Fastq file pair 1
  --fastq_2 FASTQ_2, -2 FASTQ_2
                        Fastq file pair 2
  --profile_file_directory PROFILE_FILE_DIRECTORY, -m PROFILE_FILE_DIRECTORY
                        emm variants directory
  --output_directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
                        please provide an output directory
  --bowtie BOWTIE, -b BOWTIE
                        please provide the path for bowtie2
  --samtools SAMTOOLS, -sam SAMTOOLS
                        please provide the path for samtools
  --log_directory LOG_DIRECTORY, -log LOG_DIRECTORY
                        please provide the path for log directory
  --verbose, -v         if selected a summary.yml file is generated in the tmp
                        directory

There are 2 pathways by which the script can accessed by:

  1. Specify EMM_data dir and glob fastq files from input-dir

    python emm_typing.py -m profile_file_directory -i input_directory -o output_directory

  2. Specify the fastq_1, fastq_2 files, EMM_data dir and output_directory

python emm_typing.py -m profile_file_directory -1 fastq_1 -2 fastq_2 -o output_directory

Input directory

Either an input directory with two fastq files (paired reads) or the paths to the two fastq files should be provided (options -i for input directory and -1 sample1.R1.fastq.gz -2 sample1.R2.fastq.gz for fastq files). Paired reads should be indicated as sampleID.R*.fastq*

Output directory

If no output dir is provided, an output dir is created in the input directory, called emm_typing. However, if the -1 -2 option is used then an output directory must be provided.

emm-typing-tool Output


Output directory should contain EMM_log.txt, id.results.xml a logs directory with stderr and stdout files (if not redirected using -log option) and a tmp directory.

(a) EMM_log.txt: this file is useful for troubleshooting unresolved emm type assignemnt. It contains:

(b) Result XML file: the XML file extracts the following values from the EMM log file:

For both EMM_validated and EMM_NonValidated alleles these metrics are provided:

An example result.xml file is provided (see Example.result.xml).

(c) tmp directory: the tmp directory contains files useful for troubleshooting:

Troubleshooting:


Formats for the Final_EMM_type:

Errors while setting up

Traceback (most recent call last):
  File "/Users/georgiakapatai/repositories/emm-typing-tool/modules/utility_functions.py", line 54, in try_and_except
    return function(*parameters, **named_parameters)
  File "/Users/georgiakapatai/repositories/emm-typing-tool/modules/EMM_determiner_functions.py", line 131, in prep_SRST
    process = subprocess.Popen(['seqret',seq,'-firstonly','-auto','-out',output_directory+ '/' + bait], stderr=subprocess.PIPE, stdout=subprocess.PIPE)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

SOLUTION: The path for emboss needs to be defined in the PATH environmental variable:

PATH="$PATH:/usr/local/emboss/bin"
export PATH

If similar error is raised for the lines where subprocess is calling bowtie2 and samtools then either provide the paths with the -b and -sam options respectively or add the paths to the PATH variable as described above.

Contact Information


Georgia Kapatai

Email: Georgia.Kapatai@phe.gov.uk

Twitter: @gkapatai

Licence Agreement


This software is covered by GNU General Public License, version 3 (GPL-3.0).