Home

Awesome

In-silio Protein Design Pipeline

This repository contains the in-silico protein design and evaluation pipeline that we used for assessing Genie 2. We set this up separately from Genie 2 repository to facilitate assessments of different structure-based protein diffusion models. The pipeline consists of:

Set up

Assume the environment has a cuda-compatiable PyTorch installed and Python <= 3.9. For example, on our own machine, the environment is created and initialized by running.

python3.9 -m venv insilico_pipeline_venv
source insilico_pipeline_venv/bin/activate
module load cuda11.8
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118

The setup process consists of three parts:

Additional notes

When setting up the environment for ESMFold, we install OpenFold v1.0.1 to ensure compatibility. One known issue for this OpenFold installation is its compatibility with deepspeed. This raises AttributeError: module 'deepspeed.utils' has no attribute 'is_initialized' when running the pipeline and could be fixed by replacing all occurences of deepspeed.utils.is_initialized() with deepspeed.comm.comm.is_initialized().

Pipelines

Our design package consists of three separate pipelines:

Standard pipeline (pipeline/standard)

Evaluate a set of generated structures by running

python pipeline/standard/evaluate.py --version [VERSION] --rootdir [ROOTDIR]

Our standard pipeline currently supports evaluation of structures from unconditional generation (by setting version to unconditional) and motif scaffolding (scaffold). For both modes, we assume that the root directory contains a folder named pdbs, which contains the PDB files of generated structures to be evaluated. For motif scaffolding, we additionally assume that the root directory contains a folder named motif_pdbs, which contains the PDB files of the corresponding motif structures (with the same filename as the generated structure and residue index aligned). Note that for motif scaffolding, we also support evaluations of multiple problems at the same time. This means that the root directory could contain a list of subdirectories, each of which consists of a pdbs and motif_pdbs folder detailed above. When evaluating multiple motif scaffolding problems, our pipeline supports distribution of tasks across multiple GPUS by adding the following flags --num_devices [NUM_GPUS] --num_processes [NUM_GPUS].

Evaluation results are stored in the root directory, which contains:

Note that for secondary structure evaluations, we use the P-SEA algorithm, which allows us to predict secondary structures based on Ca atoms only.

Diversity pipeline (pipeline/diversity)

Assume that a set of generated structure is assessed by the above standard pipeline. Evaluate this set of generated structures on tertiary diversity by running

python pipeline/diversity/evaluate.py --rootdir [ROOTDIR] --num_cpus [NUM_CPUS]

Our default value of num_cpus is 1. We found this can be slow, so we recommend setting the value to the number of physical cores or the number of processes you want to run in parallel.

Results are stored by updating info.csv in the root directory to include

ColumnDescription
single_cluster_idxIndex of cluster that the generated structure belongs <br>(hierarchically clustered via single linkage)
complete_cluster_idxIndex of cluster that the generated structure belongs <br>(hierarchically clustered via complete linkage)
average_cluster_idxIndex of cluster that the generated structure belongs <br>(hierarchically clustered via average linkage)

Note that for hierarchical clustering, we use TMalign to compute pairwise TM scores among all generated structures and a TM score threshold of 0.6 in the clustering process.

Novelty pipeline (pipeline/novelty)

Assume that a set of generated structure is assessed by the above standard pipeline. Evaluate this set of generated structures on novelty by running

python pipeline/novelty/evaluate.py --rootdir [ROOTDIR] --dataset [DATASET] --datadir [DATADIR] --num_cpus [NUM_CPUS]

where DATASET is the name of the reference dataset and DATADIR is the directory for the reference dataset (with each reference structure stored in a PDB format). Results are stored by updating info.csv in the root directory to include

ColumnDescription
max_[DATASET]_nameName of structure in the dataset that is most <br>similar to the generated structure
max_[DATASET]_tmTM score between the generated structure and the <br>most similar structure in the dataset

Examples

In the examples directory, we provide three examples (together with their correponding outputs) to demonstrate the input and output to our evaluation pipeline. Examples include:

Profiling

Unconditional generation

Assume that the standard (designability) and diversity pipelines are run. To show the evaluation metrics on the set of generated structures, run

python scripts/analysis/profile_unconditional.py --rootdir [ROOTDIR]

This reports designability, diversity and F1 score on the set of generated structures. It also reports PDB novelty and/or AFDB novelty, provided that the corresponding novelty pipeline is run. Details on these evaluation metrics are found in the Genie 2 paper.

Motif scaffolding

Assume that the standard (designability) and diversity pipelines are run. To show the evaluation metrics on the set of generated structures, run

python scripts/analysis/profile_scaffold.py --rootdir [ROOTDIR]

This reports the number of solved motif scaffolding problems and the total number of unique clusters, aggregated across all problems. Details on these evaluation metrics are found in the Genie 2 paper. Here, we assume that the root directory contains a set of subdirectories, where each subdirectory starts with a prefix of motif= and contains inputs and outputs for a motif scaffolding problem (check out examples/scaffold_single and examples/scaffold_multi for detailed examples).