Home

Awesome

geNomad

geNomad: Identification of mobile genetic elements

Features

geNomad's primary goal is to identify viruses and plasmids in sequencing data (isolates, metagenomes, and metatranscriptomes). It also provides a couple of additional features that can help you in your analysis:

Documentation

For installation instructions, information about how geNomad works, and a detailed explanation of how to execute it, please check the full documentation: https://portal.nersc.gov/genomad/

Web app

geNomad is available as a web app in the NMDC EDGE platform. There you can upload your sequence data, visualize the results in your browser, and download the data to your computer.

Citing geNomad

If you use geNomad in your work, please consider citing its manuscript:

Identification of mobile genetic elements with geNomad

Camargo, A. P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., Chain, P. S. G., Nayfach, S., & Kyrpides, N. C. — Nature Biotechnology (2023), DOI: 10.1038/s41587-023-01953-y.

Quick start

We recommend users to read the documentation before starting to use geNomad. If you are in a rush, however, you can follow this quick step-by-step example.

Installation

First, you need to install geNomad. There's a couple of ways to do that, but two convinient options are using Pixi or Mamba. Both of them will handle the installation of all dependencies for you.

Pixi allows you to install geNomad as a globally available command for easy execution.

pixi global install -c conda-forge -c bioconda genomad

With Mamba, you will create an environment for geNomad and activate it before being able to use it.

# Create an environment for geNomad
mamba create -n genomad -c conda-forge -c bioconda genomad
# Activate the geNomad environment
mamba activate genomad

Another option is to use geNomad through Docker.

# Pull the image
docker pull antoniopcamargo/genomad
# Run the image
docker run --rm -ti -v "$(pwd):/app" antoniopcamargo/genomad

Downloading the database

geNomad depends on a database that contains the profiles of the markers that are used to classify sequences, their taxonomic information, their functional annotation, etc. So, you should first download the database to your current directory:

genomad download-database .

The database will be contained within the genomad_db directory.

If you prefer, you can also download the database from Zenodo and extract it manually.

Executing geNomad

Now you are ready to go! geNomad works by executing a series of modules sequentially (you can find more information about this in the pipeline documentation), but we provide a convenient end-to-end command that will execute the entire pipeline for you in one go.

In this example, we will use an Klebsiella pneumoniae genome (GCF_009025895.1) as input. You can use any FASTA file containing nucleotide sequences as input. geNomad will work for isolate genomes, metagenomes, and metatranscriptomes.

The command to execute geNomad is structured like this:

genomad end-to-end [OPTIONS] INPUT OUTPUT DATABASE

So, to run the full geNomad pipeline (end-to-end command), taking a nucleotide FASTA file (GCF_009025895.1.fna.gz) and the database (genomad_db) as input, we will execute the following command:

genomad end-to-end --cleanup --splits 8 GCF_009025895.1.fna.gz genomad_output genomad_db

The results will be written inside the genomad_output directory.

Three important details about the command above:

[!NOTE] By default, geNomad applies a series of post-classification filters to remove likely false positives. For example, sequences are required to have a plasmid or virus score of at least 0.7 and sequences shorter than 2,500 bp are required to encode at least one hallmark gene. If you want to disable the post-classification filters, add the --relaxed flag to your command. On the other hand, if you want to be very conservative with your classification, you may use the --conservative flag. This will make the post-classification filters more aggressive, preventing sequences without strong support from being classified as plasmid or virus. You can check out the default, relaxed, and conservative post-classification filters here.

Understanding the outputs

In this example, the results of geNomad's analysis will be written to the genomad_output directory, which will look like this:

genomad_output
├── GCF_009025895.1_aggregated_classification
├── GCF_009025895.1_aggregated_classification.log
├── GCF_009025895.1_annotate
├── GCF_009025895.1_annotate.log
├── GCF_009025895.1_find_proviruses
├── GCF_009025895.1_find_proviruses.log
├── GCF_009025895.1_marker_classification
├── GCF_009025895.1_marker_classification.log
├── GCF_009025895.1_nn_classification
├── GCF_009025895.1_nn_classification.log
├── GCF_009025895.1_summary
╰── GCF_009025895.1_summary.log

As mentioned above, geNomad works by executing several modules sequentially. Each one of these will produce a log file (<prefix>_<module>.log) and a subdirectory (<prefix>_<module>).

For this example, we will only look at the files within GCF_009025895.1_summary. The <prefix>_summary directory contains files that summarize the results that were generated across the pipeline. If you just want a list of the plasmids and viruses identified in your input, this is what you are looking for.

genomad_output
╰── GCF_009025895.1_summary
    ├── GCF_009025895.1_plasmid.fna
    ├── GCF_009025895.1_plasmid_genes.tsv
    ├── GCF_009025895.1_plasmid_proteins.faa
    ├── GCF_009025895.1_plasmid_summary.tsv
    ├── GCF_009025895.1_summary.json
    ├── GCF_009025895.1_virus.fna
    ├── GCF_009025895.1_virus_genes.tsv
    ├── GCF_009025895.1_virus_proteins.faa
    ╰── GCF_009025895.1_virus_summary.tsv

First, let's look at GCF_009025895.1_virus_summary.tsv:

seq_name                                 length   topology              coordinates       n_genes   genetic_code   virus_score   fdr   n_hallmarks   marker_enrichment   taxonomy
--------------------------------------   ------   -------------------   ---------------   -------   ------------   -----------   ---   -----------   -----------------   -----------------------------------------------------------------
NZ_CP045015.1|provirus_2885510_2934610   49101    Provirus              2885510-2934610   69        11             0.9776        NA    14            76.0892             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
NZ_CP045015.1|provirus_3855947_3906705   50759    Provirus              3855947-3906705   79        11             0.9774        NA    16            75.1552             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
NZ_CP045018.1                            51887    No terminal repeats   NA                57        11             0.9774        NA    14            67.7749             Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;;
…

This tabular file lists all the viruses that geNomad found in your input and gives you some convenient information about them. Here's what each column contains:

In our example, geNomad identified several proviruses integrated into the K. pneumoniae genome and one extrachromosomal phage. Since they all have high scores and marker enrichment, we can be confident that these are indeed viruses. They were all predicted to use the genetic code 11 and were assigned to the Caudoviricetes class, which contains all the tailed bacteriphages. In the taxonomy field for these viruses, after Caudoviricetes, there are two consecutive semicolons because geNomad could only assign them to the class level, leaving the order and family ranks empty.

Another important file is GCF_009025895.1_virus_genes.tsv. During its execution, geNomad annotates the genes encoded by the input sequences using a database of chromosome, plasmid, and virus-specific markers. The <prefix>_virus_genes.tsv file summarizes the annotation of the genes encoded by the identified viruses.

gene              start   end     length   strand   gc_content   genetic_code   rbs_motif     marker              evalue       bitscore   uscg   plasmid_hallmark   virus_hallmark   taxid   taxname          annotation_conjscan   annotation_amr   annotation_accessions              annotation_description
---------------   -----   -----   ------   ------   ----------   ------------   -----------   -----------------   ----------   --------   ----   ----------------   --------------   -----   --------------   -------------------   --------------   --------------------------------   --------------------------------------------------------------------------------------
NZ_CP045018.1_1   1       399     399      1        0.536        11             None          GENOMAD.108715.VP   2.536e-32    123        0      0                  1                2561    Caudoviricetes   NA                    NA               PF05100;COG4672;TIGR01600          Phage minor tail protein L
NZ_CP045018.1_2   401     1111    711      1        0.568        11             AGGAG         GENOMAD.168265.VP   9.279e-47    170        0      0                  0                2561    Caudoviricetes   NA                    NA               PF14464;COG1310;K21140;TIGR02256   Proteasome lid subunit RPN8/RPN11, contains Jab1/MPN domain metalloenzyme (JAMM) motif
NZ_CP045018.1_3   1143    1493    351      1        0.382        11             AGGAG         GENOMAD.147875.VV   1.495e-14    71         0      0                  0                2561    Caudoviricetes   NA                    NA               COG5633;TIGR03066                  NA
NZ_CP045018.1_4   1509    2120    612      1        0.477        11             GGA/GAG/AGG   GENOMAD.143103.VP   1.958e-50    179        0      0                  1                2561    Caudoviricetes   NA                    NA               PF06805;COG4723;TIGR01687          Phage-related protein, tail component
NZ_CP045018.1_5   2183    13516   11334    1        0.566        11             None          GENOMAD.159864.VP   1.225e-268   923        0      0                  0                2561    Caudoviricetes   NA                    NA               PF12421;PF09327                    Fibronectin type III protein
NZ_CP045018.1_6   13585   15084   1500     1        0.550        11             AGGAG         GENOMAD.195756.VP   2.017e-14    79         0      0                  0                2561    Caudoviricetes   NA                    NA               NA                                 NA
NZ_CP045018.1_7   15163   16128   966      -1       0.469        11             GGAGG         NA                  NA           NA         0      0                  0                1       NA               NA                    NA               NA                                 NA
…

The columns in this file are:

In the example above we can see the information of the first seven genes encoded by NZ_CP045018.1. The last entry didn't match any geNomad marker. The first six were all assigned to protein families, some of which are typical of tailed bacteriphages (such as the minor tail protein), reassuring us that these are indeed Caudoviricetes.

One important detail here is that the primary purpose of geNomad's markers is classification. They were designed to be specific to chromosomes, plasmids, or viruses, enabling the distinction of sequences belonging to these classes. Therefore, you should not expect that every single viral gene will be annotated with a geNomad marker. If you want to annotate the genes within your sequences as throughly as possible, you should use databases such as Pfam or COG.

The other two virus-related files within the summary directory are GCF_009025895.1_virus.fna and GCF_009025895.1_virus_proteins.faa. These are FASTA files of the identified virus sequences and their proteins, respectively. Proviruses are automatically excised from the host sequence.

Moving on to plasmids, the data related to their identification can be found in the <prefix>_plasmid_summary.tsv, <prefix>_genes.tsv, <prefix>_plasmid.fna, and <prefix>_plasmid_proteins.faa files. These are mostly very similar to their virus counterparts. The differences in <prefix>_plasmid_summary.tsv (shown below) are the following:

seq_name        length   topology              n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes                                                                                       amr_genes
-------------   ------   -------------------   -------   ------------   -------------   ---   -----------   -----------------   -----------------------------------------------------------------------------------------------------   -----------------------------------
NZ_CP045020.1   28729    No terminal repeats   36        11             0.9955          NA    7             25.8098             F_traE                                                                                                  NA
NZ_CP045022.1   50635    No terminal repeats   61        11             0.9947          NA    9             46.4657             T_virB1;T_virB3;virb4;T_virB5;T_virB6;T_virB8;T_virB9                                                   NA
NZ_CP045019.1   44850    No terminal repeats   52        11             0.9945          NA    3             28.7110             F_traE                                                                                                  NA
NZ_CP045016.1   82240    No terminal repeats   110       11             0.9939          NA    11            33.4021             T_virB8;T_virB9;F_traF;F_traH;F_traG;T_virB1                                                            NF000225;NF000270;NF012171;NF000052
NZ_CP045017.1   61331    No terminal repeats   76        11             0.9934          NA    16            36.2817             I_trbB;I_trbA;MOBP1;I_traI;I_traK;I_traL;I_traN;I_traO;I_traP;I_traQ;I_traR;traU;I_traW;I_traY;F_traE   NA
NZ_CP045021.1   5251     No terminal repeats   7         11             0.9910          NA    1             1.4225              NA                                                                                                      NA