Home

Awesome

Open In Colab

Anaconda-Server Badge Bioconda Downloads PyPI version Downloads

phold - Phage Annotation using Protein Structures

<p align="center"> <img src="img/phold_logo.png" alt="phold Logo" height=250> </p>

phold is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.

phold uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of over 1 million phage protein structures mostly predicted using Colabfold.

<p align="center"> <img src="img/phold_workflow.png" alt="phold workflow" height=300> </p>

Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 using the parameters --structures and --structure_dir with phold compare.

Benchmarking is ongoing, but phold strongly outperforms Pharokka, particularly for less characterised phages such as those from metagenomic datasets.

The below plot shows the percentage of annotated coding sequences (CDS) for 179 metagenomic phage genomes assembled with phables. Phold v0.2.0 run both in default settings (with ProstT5) settings and where predicted protein structures (with Colabfold) were compared against Pharokka v1.7.0.

<p align="center"> <img src="img/phables_bench.jpeg" alt="phables benchmarking" height=200> </p>

If you have already annotated your phage(s) with Pharokka, phold takes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!

Tutorial

Check out the phold tutorial at https://phold.readthedocs.io/en/latest/tutorial/.

Google Colab Notebooks

If you don't want to install phold locally, you can run it without any code using one of the following Google Colab notebooks:

Table of Contents

Documentation

Check out the full documentation at https://phold.readthedocs.io.

Installation

For more details (particularly if you are using a non-NVIDIA GPU), check out the installation documentation.

The best way to install phold is using mamba, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies.

To install phold using mamba:

mamba create -n pholdENV -c conda-forge -c bioconda phold 

To utilise phold with GPU, a GPU compatible version of pytorch must be installed. By default conda/mamba will install a CPU-only version.

If you have an NVIDIA GPU, please try:

mamba create -n pholdENV -c conda-forge -c bioconda phold pytorch=*=cuda*

If you have a Mac running an Apple Silicon chip (M1/M2/M3), phold should be able to use the GPU. Please try:

mamba create -n pholdENV python==3.11  
conda activate pholdENV
mamba install pytorch::pytorch torchvision torchaudio -c pytorch 
mamba install -c conda-forge -c bioconda phold 

If you are having trouble with pytorch see this link for more instructions. If you have an older version of CUDA installed, then you might find this link useful.

Once phold is installed, to download and install the database run:

phold install

Quick Start

phold run -i tests/test_data/NC_043029.gbk  -o test_output_phold -t 8
  1. Predict the 3Di sequences with ProstT5 using phold predict. This is massively accelerated if a GPU available.
phold predict -i tests/test_data/NC_043029.gbk -o test_predictions 
  1. Compare the the 3Di sequences to the phold structure database with Foldseek using phold compare. This does not utilise a GPU.
phold compare -i tests/test_data/NC_043029.gbk --predictions_dir test_predictions -o test_output_phold -t 8 

Output

Usage

Usage: phold [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  citation          Print the citation(s) for this tool
  compare           Runs Foldseek vs phold db
  createdb          Creates foldseek DB from AA FASTA and 3Di FASTA input...
  install           Installs ProstT5 model and phold database
  plot              Creates Phold Circular Genome Plots
  predict           Uses ProstT5 to predict 3Di tokens - GPU recommended
  proteins-compare  Runs Foldseek vs phold db on proteins input
  proteins-predict  Runs ProstT5 on a multiFASTA input - GPU recommended
  remote            Uses Foldseek API to run ProstT5 then Foldseek locally
  run               phold predict then comapare all in one - GPU recommended
Usage: phold run [OPTIONS]

  phold predict then comapare all in one - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input file in Genbank format or
                                 nucleotide FASTA format  [required]
  -o, --output PATH              Output directory   [default: output_phold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: phold]
  -d, --database TEXT            Specific path to installed phold database
  -f, --force                    Force overwrites the output directory
  --batch_size INTEGER           batch size for ProstT5. 1 is usually fastest.
                                 [default: 1]
  --cpu                          Use cpus only.
  --omit_probs                   Do not output 3Di probabilities from ProstT5
  --finetune                     Use finetuned ProstT5 model (PhrostT5).
                                 Experimental and not recommended for now
  --finetune_path TEXT           Path to finetuned model weights
  --save_per_residue_embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save_per_protein_embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  -e, --evalue FLOAT             Evalue threshold for Foldseek  [default:
                                 1e-3]
  -s, --sensitivity FLOAT        Sensitivity parameter for foldseek  [default:
                                 9.5]
  --keep_tmp_files               Keep temporary intermediate files,
                                 particularly the large foldseek_results.tsv
                                 of all Foldseek hits
  --card_vfdb_evalue FLOAT       Stricter Evalue threshold for Foldseek CARD
                                 and VFDB hits  [default: 1e-10]
  --separate                     Output separate GenBank files for each contig
  --max_seqs INTEGER             Maximum results per query sequence allowed to
                                 pass the prefilter. You may want to reduce
                                 this to save disk space for enormous datasets
                                 [default: 10000]
  --only_representatives         Foldseek search only against the cluster
                                 representatives (i.e. turn off --cluster-
                                 search 1 Foldseek parameter)
  --ultra_sensitive              Runs phold with maximum sensitivity by
                                 skipping Foldseek prefilter. Not recommended
                                 for large datasets.

Plotting

phold plot will allow you to create Circos plots with pyCirclize for all your phage(s). For example:

phold plot -i tests/test_data/NC_043029_phold_output.gbk  -o NC_043029_phold_plots -t '${Stenotrophomonas}$ Phage SMA6'  
<p align="center"> <img src="img/NC_043029.png" alt="NC_043029" height=600> </p>

Citation

phold is a work in progress, a preprint will be coming soon - if you use it please cite the GitHub repository https://github.com/gbouras13/phold for now.

Please be sure to cite the following core dependencies and PHROGs database:

Please also consider citing these supplementary databases where relevant: