Home

Awesome

Kraken Tools

KrakenTools is a suite of scripts to be used alongside the Kraken, KrakenUniq, Kraken 2, or Bracken programs. These scripts are designed to help Kraken users with downstream analysis of Kraken results.

For news and updates, refer to the github page: https://github.com/jenniferlu717/KrakenTools/

Citation

KrakenTools has been published on September 28, 2022 as part of a protocol paper for using the Kraken software suite. Please cite the following when using any KrakenTools script:

[Lu J, Rincon N, Wood D E, Breitwieser F P, Pockrandt C, Langmead B, Salzberg S L, Steinegger M. Metagenome analysis using the Kraken software suite. Nature Protocols, doi: 10.1038/s41596-022-00738-y (2022)] (https://www.nature.com/articles/s41596-022-00738-y)

Please also cite the relevant paper for usage of KrakenTools with any of the listed programs.

  1. Kraken 1
  2. Kraken 2
  3. KrakenUniq
  4. Bracken

For issues with any of the above programs, please open a github issue on their respective github pages. This github repository is dedicated to only the scripts provided here.


Scripts included in KrakenTools

  1. extract_kraken_reads.py
  2. combine_kreports.py
  3. kreport2krona.py
  4. kreport2mpa.py
  5. combine_mpa.py
  6. filter_bracken_out.py
  7. fix_unmapped.py
  8. make_ktaxonomy.py
  9. make_kreport.py
  10. alpha_diversity.py (see Diversity/README.md)
  11. beta_diversity.py (see Diversity/README.md)

Dependencies

Some of the scripts in this package will require installation of Biopython: https://biopython.org/. Otherwise, scripts should work with installation of python.

Running Scripts:

No installation required. All scripts are run on the command line as described.

Users can make scripts executable by running

chmod +x myscript.py
./myscript.py -h 

extract_kraken_reads.py

This program extract reads classified at any user-specified taxonomy IDs. User must specify the Kraken output file, the sequence file(s), and at least one taxonomy ID. Additional options are specified below. As of April 19, 2021, this script is compatible with KrakenUniq/Kraken2Uniq reports.

1. extract_kraken_reads.py usage/options

python extract_kraken_reads.py

Optional:

2. extract_kraken_reads.py input files

Input sequence files must be either FASTQ or FASTA files. Input files can be gzipped or not. The program will automatically detect whether the file is gzipped and whether it is FASTQ or FASTA formatted based on the first character in the file (">" for FASTA, "@" for FASTQ)

3. extract_kraken_reads.py paired input/output

Users that ran Kraken using paired reads should input both read files into extract_kraken_reads.py as follows:

extract_kraken_reads.py -k myfile.kraken -s1 read1.fq -s2 reads2.fq

Given paired reads, the script requires users to provide two output file names to contain extracted reads:

extract_kraken_reads.py -k myfile.kraken -s1 read1.fq -s2 reads2.fq -o extracted1.fq -o2 extracted2.fq

The delimiter (--delimiter or -d) option has been removed.

`extract_kraken_reads.py -k myfile.kraken ... -o reads_S1.fa -o2 reads_s2.fa

4. extract_kraken_reads.py --exclude flag

By default, reads classified at specified taxonomy IDs will be extracted (and any taxids selected using --include-parents/--include-children. However, specifying --exclude will cause the reads NOT classified at any specified taxonomy IDs.

For example:

  1. extract_kraken_reads.py -k myfile.kraken ... --taxid 9606 --exclude ==> extract all reads NOT classified as Human (taxid 9606).
  2. extract_kraken_reads.py -k myfile.kraken ... --taxid 2 --exclude --include-children ==> extract all reads NOT classified as Bacteria (taxid 2) or any classification in the Bacteria subtree.
  3. extract_kraken_reads.py -k myfile.kraken ... --taxid 9606 --exclude --include-parents ==> extract all reads NOT classified as Human or any classification in the direct ancestry of Human (e.g. will exclude reads classified at the Primate, Chordata, or Eukaryota levels).

5. extract_kraken_reads.py --include-parents/--include-children flags

By default, only reads classified exactly at the specified taxonomy IDs will be extracted. Options --include-children and --include parents can be used to extract reads classified within the same lineage as a specified taxonomy ID. For example, given a Kraken report containing the following:

    [%]     [reads] [lreads][lvl]   [tid]       [name]
    100     1000    0       R       1           root
    100     1000    0       R1      131567        cellular organisms
    100     1000    50      D       2               Bacteria
    0.95    950     0       P       1224              Proteobacteria
    0.95    950     0       C       1236                Gammaproteobacteria
    0.95    950     0       O       91347                 Enterobacterales
    0.95    950     0       F       543                     Enterobacteriaceae
    0.95    950     0       G       561                       Escherichia
    0.95    950     850     S       562                         Escherichia coli
    0.05    50      50      S1      498388                        Escherichia coli C
    0.05    50      50      S1      316401                        Escherichia coli ETEC

  1. extract_kraken_reads.py [options] -t 562 ==> 850 reads classified as E. coli will be extracted
  2. extract_kraken_reads.py [options] -t 562 --include-parents ==> 900 reads classified as E. coli or Bacteria will be extracted
  3. extract_kraken_reads.py [options] -t 562 --include-children ==> 950 reads classified as E. coli, E. coli C, or E. coli ETEC will be extracted
  4. extract_kraken_reads.py [options] -t 498388 ==> 50 reads classified as E. coli C will be extracted
  5. extract_kraken_reads.py [options] -t 498388 --include-parents ==> 950 reads classified as E. coli C, E. coli, or Bacteria will be extracted
  6. extract_kraken_reads.py [options] -t 1 --include-children ==> All classified reads will be extracted

combine_kreports.py

This script combines multiple Kraken reports into a combined report file.

1. combine_kreports.py usage/options

python complete_kreports.py

Optional:

2. combine_kreports.py output

Percentage is only reported for the summed read counts, not for each individual sample.

The output file therefore contains the following tab-delimited columns:


kreport2krona.py

This program takes a Kraken report file and prints out a krona-compatible TEXT file

1. kreport2krona.py usage/options

python kreport2krona.py

Optional:

2. kreport2krona.py example usage

kraken2 --db KRAKEN2DB --threads THREADNUM --report MYSAMPLE.KREPORT \
    --paired SAMPLE_1.FASTA SAMPLE_2.FASTA > MYSAMPLE.KRAKEN2
python kreport2krona.py -r MYSAMPLE.KREPORT -o MYSAMPLE.krona 
ktImportText MYSAMPLE.krona -o MYSAMPLE.krona.html

Krona information: see https://github.com/marbl/Krona.

3. kreport2krona.py example output

--no-intermediate-ranks

    6298        Unclassified
    8           k__Bacteria
    4           k__Bacteria     p_Proteobacteria
    6           k__Bacteria     p_Proteobacteria    c__Gammaproteobacteria
    ...

--intermediate-ranks

    6298        Unclassified
    79          x__root
    0           x__root     x__cellular_organisms
    8           x__root     x__cellular organisms   k__Bacteria
    4           x__root     x__cellular organisms   k__Bacteria     p__Proteobacteria
    6           x__root     x__cellular organisms   k__Bacteria     p__Proteobacteria   c__Gammaproteobacteria
    ....

kreport2mpa.py

This program takes a Kraken report file and prints out a mpa (MetaPhlAn) -style TEXT file

1. kreport2mpa.py usage/options

python kreport2mpa.py

Optional:

2. kreport2mpa.py example usage

kraken2 --db KRAKEN2DB --threads THREADNUM --report MYSAMPLE.KREPORT \
    --paired SAMPLE_1.FASTA SAMPLE_2.FASTA > MYSAMPLE.KRAKEN2
python kreport2mpa.py -r MYSAMPLE.KREPORT -o MYSAMPLE.MPA.TXT 

3. kreport2mpa.py example output

The output will contain one tab character inbetween the classification and the read count.

--no-intermediate-ranks/--read-count

    #Classification                                           SAMPLE.KREPORT
    k__Bacteria                                               36569
    k__Bacteria|p__Proteobacteria                             21001
    k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria      11648
    ... 

--intermediate-ranks/--read-count

    #Classification                                           SAMPLE.KREPORT
    x__cellular_organisms                                     38462
    x__cellular_organisms|k__Bacteria                         36569 
    x__cellular_organisms|k__Bacteria|p__Proteobacteria       21001
    ... 

combine_mpa.py

This program combines multiple outputs from kreport2mpa.py. Files to be combined must have been generated using the same kreport2mpa.py options.

Important:

  1. Input files to combine_mpa.py cannot be a mix of intermediate/no intermediate rank outputs.
  2. Input files should be generated using the same Kraken database.
  3. Input files cannot be a mix of read counts/percentage kreport2mpa.py outputs. combine_mpa.py will not test the input files prior to combining.

If no header is in a given sample file, the program will number the files "Sample #1", "Sample #2", etc.

1. combine_mpa.py usage/options

python combine_mpa.py

2. combine_mpa.py example output

    #Classification                                           Sample #1    Sample #2
    k__Bacteria                                               36569         20034
    k__Bacteria|p__Proteobacteria                             21001         18023
    k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria      11648         15000

filter_bracken_out.py

This program takes the output file of a Bracken report and filters the desired taxonomy IDs.

1. filter_bracken_out.py usage/options

python filter_bracken_out.py

User should specify either taxonomy IDs with --include or --exclude. If both are specified, taxonomy IDs should not be in both lists and only taxonomies to include will be evaluated.

When specifying the --include flag, only lines for the included taxonomy IDs will be extracted to the filtered output file. The percentages in the filtered file will be re-calculated so the total percentage in the output file will sum to 100%.

When specifying the --exclude flag alone, all lines in the Bracken file will be preserved EXCEPT for the lines matching taxonomy IDs provided.

2. filter_bracken_out.py example usage

This program can be useful for isolating a subset of species to better understand the distribution of those particular species in the sample.

For example:

In other cases, users may want to focus on the distribution of all species that are NOT the host species in a given sample. This program can then recalculate percentage distributions for species when excluding reads for the host.

For example, given this output:

    name                     tax_id      tax_lvl     kraken....  added...   new.... fraction...
    Homo sapiens             9606        S           ...         ....       999000  0.999000
    Streptococcus pyogenes   1314        S           ...         ....       10      0.000001
    Streptococcus agalactiae 1311        S           ...         ....       5       0.000000
    Streptococcus pneumoniae 1313        S           ...         ....       3       0.000000
    Bordetella pertussis     520         S           ...         ....       20      0.000002
    ...

Users may not be interested in the 999,000 reads that are host DNA, but would rather know the percentage of non-host reads for each of the non-host species. Using python filter_bracken_out.py [options] --exclude 9606 allows better resolution of the non-host species, allowing each of the fraction of reads to be recalculated out of 1,000 instead of 1,000,000 reads in the above example. The output would then be:

    name                     tax_id      tax_lvl     kraken....  added...   new.... fraction...
    Streptococcus pyogenes   1314        S           ...         ....       10      0.01000
    Streptococcus agalactiae 1311        S           ...         ....       5       0.05000
    Streptococcus pneumoniae 1313        S           ...         ....       3       0.03000
    Bordetella pertussis     520         S           ...         ....       200     0.20000
    ...

fix_unmapped.py

When building a Kraken database, an "unmapped.txt" file may be generated if a taxonomy for a given sequence is not found. This script can search through any accession2taxid files provided and the unmapped.txt file and generate a seqid2taxid.map file to be appended to the one already generated.

1. fix_unmapped.py usage/options

python fix_unmapped.py

Optional:

2. fix_unmapped.py example usage

rm *.k2d 
mv seqid2taxid.map seqid2taxid_1.map
python fix_unmapped -i unmapped.txt --accession2taxid taxonomy/*accession2taxid -o seqid2taxid_temp.map 
cat seqid2taxid_1.map seqid2taxid_temp.map
kraken2-build --build --db . --threads 4

make_ktaxonomy.py

For future KrakenTools scripts, this program generates a single text file that contains all of the taxonomy information required. This program is intended to generate a single text taxonomy file for any Kraken 1, Kraken 2, or KrakenUniq database.

Important: The output of this program does not replace any Kraken database file (do not replace your taxo.k2d or .db files).

1. make_ktaxonomy.py usage/options

python make_ktaxonomy.py

The program will inform users if a taxonomy ID is listed in the seqid2taxid.map file but not in either the nodes.dmp or the names.dmp files.

2. make_ktaxonomy.py output file format

The output file is similar to the nodes.dmp/names.dmp file format, but not identical. Each of the following columns is separated by a tab-vertical line-tab (e.g. \t|\t).

  1. taxonomy ID
  2. parent taxonomy ID
  3. rank type (R = root, D = domain/superkingdom, P = phylum, etc.)
  4. level number (distance from root)
  5. name

For ranks outside of the traditional taxonomy ranks (R, D, P, C, O, F, G, S), the rank type will be assigned based on the closest parent, with a number to specify distance from that parent. For example, the strains will be labeled with S1 while ranks inbetween Genus and Species will be labeled with G1, G2, etc.

Currently, names for each node are selected based on the first name listed in the names.dmp file or the name designated as scientific name. scientific names will be preferred over all others.

3. make_ktaxonomy.py required input

  1. taxonomy/nodes.dmp
  2. taxonomy/names.dmp
  3. seqid2taxid.map

4. KrakenTools scripts requiring make_ktaxonomy.py output:

  1. make_kreport.py

make_kreport.py

This program will generate a kraken-style report file from the kraken output file. Currently, this only generates reports for Kraken 1 or Kraken 2. This program does not currently work for KrakenUniq output files (to be completed in a future project).

This program requires that users first generate the taxonomy file created by make_ktaxonomy.py.

1. make_kreport.py usage/options

python make_kreport.py

Optional

2. make_kreport.py example

Given a Kraken 2 database KRAKENDB/ and sample file EXAMPLE_READS.fq, the following commands can be used to generate a Kraken report file with this script.

python make_ktaxonomy.py --nodes KRAKENDB/taxonomy/nodes.dmp --names KRAKENDB/taxonomy/names.dmp --seqid2taxid KRAKENDB/seqid2taxid.map -o KRAKENDB/mydb_taxonomy.txt 
kraken2 --db KRAKENDB --threads 4 EXAMPLE_READS.fq > EXAMPLE.kraken2 
python make_kreport.py -i EXAMPLE.kraken2 -t KRAKENDB/mydb_taxonomy.txt -o EXAMPLE.kreport2 

3. make_kreport.py --use-read-len option

By default, the output Kraken report will list read counts for each taxonomy ID. However, if all read lengths are not the same, users can add the --use-read-len option, which will result in reporting summed read lengths for each taxon.

4. make_kreport.py output format

The output format for kreport.py is identical to the format generated by kraken-report or the --report switch with kraken2. The output file contains 6 tab-delimited columns as follows:

  1. Percentage of total reads
  2. Reads classified within sub-tree
  3. Reads classified at this specific node (reads cannot be more specifically classified)
  4. Level type (R = root, K = kingdom, P = phylum, etc)
  5. Taxonomy ID
  6. Name (preceeded by spaces to indicate distance from root)

Author Information

Jennifer Lu jennifer.lu717@gmail.com jlu26@jhmi.edu Page Updated: 2020/05/10