Home

Awesome

Protein Workshop

PyPI version Zenodo doi badge Tests Project Status: Active – The project has reached a stable, usable state and is being actively developed. License: MIT Docs <a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>

Overview of the Protein Workshop

Documentation

This repository provides the code for the protein structure representation learning benchmark detailed in the paper Evaluating Representation Learning on the Protein Structure Universe (ICLR 2024).

In the benchmark, we implement numerous featurisation schemes, datasets for self-supervised pre-training and downstream evaluation, pre-training tasks, and auxiliary tasks.

The benchmark can be used as a working template for a protein representation learning research project, a library of drop-in components for use in your projects, or as a CLI tool for quickly running protein representation learning evaluation and pre-training configurations.

Processed datasets and pre-trained weights are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.

Configuration files to run the experiments described in the manuscript are provided in the proteinworkshop/config/sweeps/ directory.

Contents

Installation

Below, we outline how one may set up a virtual environment for proteinworkshop. Note that these installation instructions currently target Linux-like systems with NVIDIA CUDA support. Note that Windows and macOS are currently not officially supported.

From PyPI

proteinworkshop is available for install from PyPI. This enables training of specific configurations via the CLI or using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired.

# install `proteinworkshop` from PyPI
pip install proteinworkshop

# install PyTorch Geometric using the (now-installed) CLI
workshop install pyg

# set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`
export DATA_PATH="where/you/want/data/" # e.g., `export DATA_PATH="proteinworkshop/data"`

However, for full exploration we recommend cloning the repository and building from source.

Building from source

With a local virtual environment activated (e.g., one created with conda create -n proteinworkshop python=3.10):

  1. Clone and install the project

    git clone https://github.com/a-r-j/ProteinWorkshop
    cd ProteinWorkshop
    pip install -e .
    
  2. Install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired

    # e.g., to install PyTorch with CUDA 11.8 support on Linux:
    pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118
    
  3. Then use the newly-installed proteinworkshop CLI to install PyTorch Geometric

    workshop install pyg
    
  4. Configure paths in .env (optional, will override default paths if set). See .env.example for an example.

  5. Download PDB data:

    python proteinworkshop/scripts/download_pdb_mmtf.py
    

Tutorials

We provide a five-part tutorial series of Jupyter notebooks to provide users with examples of how to use and extend proteinworkshop, as outlined below.

  1. Training a new model
  2. Customizing an existing dataset
  3. Adding a new dataset
  4. Adding a new model
  5. Adding a new task

Quickstart

Downloading datasets

Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:

workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..

If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:

workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py

Training a model

Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):

workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu

This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:

workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu

Finetuning a model

Finetuning a model additionally requires specification of a checkpoint.

workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu

Running a sweep/experiment

We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.

See proteinworkshop/config/sweeps/ for examples.

  1. Create the sweep with weights and biases
wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml
  1. Launch job workers

With wandb:

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8

Or an example SLURM submission script:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --array=0-32

source ~/.bashrc
source $(conda info --base)/envs/proteinworkshop/bin/activate

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1

Reproduce the sweeps performed in the manuscript:

# reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)
wandb sweep proteinworkshop/config/sweeps/baseline_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8

# reproduce the model pre-training sweep
wandb sweep proteinworkshop/config/sweeps/pre_train.yaml
wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8

# reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)
wandb sweep proteinworkshop/config/sweeps/pt_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8

Embedding a dataset

We provide a utility in proteinworkshop/embed.py for embedding a dataset using a pre-trained model. To run it:

python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME

See the embed section of proteinworkshop/config/embed.yaml for additional parameters.

Visualising pre-trained model embeddings for a given dataset

We provide a utility in proteinworkshop/visualise.py for visualising the UMAP embeddings of a pre-trained model for a given dataset. To run it:

python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png

See the visualise section of proteinworkshop/config/visualise.yaml for additional parameters.

Performing attribution of a pre-trained model

We provide a utility in proteinworkshop/explain.py for performing attribution of a pre-trained model using integrated gradients.

This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder.

To run the attribution:

python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY

See the explain section of proteinworkshop/config/explain.yaml for additional parameters.

Verifying a config

python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding

Using proteinworkshop modules functionally

One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop functionally by importing them directly. When installing this package using PyPi, this makes building on top of the assets of proteinworkshop straightforward and convenient.

For example, to use any datamodule available in proteinworkshop:

from proteinworkshop.datasets.cath import CATHDataModule

datamodule = CATHDataModule(path="data/cath/", pdb_dir="data/pdb/", format="mmtf", batch_size=32)
datamodule.download()

train_dl = datamodule.train_dataloader()

To use any model or featuriser available in proteinworkshop:

from proteinworkshop.models.graph_encoders.dimenetpp import DimeNetPPModel
from proteinworkshop.features.factory import ProteinFeaturiser
from proteinworkshop.datasets.utils import create_example_batch

model = DimeNetPPModel(hidden_channels=64, num_layers=3)
ca_featuriser = ProteinFeaturiser(
    representation="CA",
    scalar_node_features=["amino_acid_one_hot"],
    vector_node_features=[],
    edge_types=["knn_16"],
    scalar_edge_features=["edge_distance"],
    vector_edge_features=[],
)

example_batch = create_example_batch()
batch = ca_featuriser(example_batch)

model_outputs = model(example_batch)

Read the docs for a full list of modules available in proteinworkshop.

Models

Invariant Graph Encoders

NameSourceProtein Specific
GearNetZhang et al.
DimeNet++Gasteiger et al.
SchNetSchütt et al.
CDConvFan et al.

Equivariant Graph Encoders

(Vector-type)

NameSourceProtein Specific
GCPNetMorehead et al.
GVP-GNNJing et al.
EGNNSatorras et al.

(Tensor-type)

NameSourceProtein Specific
Tensor Field NetworkCorso et al.
Multi-ACEBatatia et al.

Sequence-based Encoders

NameSourceProtein Specific
ESM2Lin et al.

Datasets

To download a (processed) dataset from Zenodo, you can run

workshop download <DATASET_NAME>

where <DATASET_NAME> is given the first column in the tables below.

Otherwise, simply starting a training run will download and process the data from source.

Structure-based Pre-training Corpuses

Pre-training corpuses (with the exception of pdb, cath, and astral) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb is provided as a collection of MMTF files, which are significantly smaller in size than conventional .pdb or .cif file.

NameDescriptionSourceSizeDisk SizeLicense
astralSCOPe domain structuresSCOPe/ASTRAL1 - 2.2 GbPublicly available
afdb_rep_v4Representative structures identified from the AlphaFold database by FoldSeek structural clusteringBarrio-Hernandez et al.2.27M Chains9.6 GbGPL-3.0
afdb_rep_dark_v4Dark proteome structures identied by structural clustering of the AlphaFold database.Barrio-Hernandez et al.~800k2.2 GbGPL-3.0
afdb_swissprot_v4AlphaFold2 predictions for SwissProt/UniProtKBKim et al.542k Chains2.9 GbGPL-3.0
afdb_uniprot_v4AlphaFold2 predictions for UniProtKim et al.214M Chains1 TbGPL-3.0 / CC-BY 4.0
cathCATH 4.2 40% split by CATH topologies.Ingraham et al.~18k chains4.3 GbCC-BY 4.0
esmatlasESMAtlas predictions (full)Kim et al.1 TbGPL-3.0 / CC-BY 4.0
esmatlas_v2023_02ESMAtlas predictions (v2023_02 release)Kim et al.137 GbGPL-3.0 / CC-BY 4.0
highquality_clust30ESMAtlas High Quality predictionsKim et al.37M Chains114 GbGPL-3.0 / CC-BY 4.0
igfold_paired_oasIGFold Predictions for Paired OASRuffolo et al.104,994 paired Ab chainsCC-BY 4.0
igfold_jaffeIGFold predictions for Jaffe2022 dataRuffolo et al.1,340,180 paired Ab chainsCC-BY 4.0
pdbExperimental structures deposited in the RCSB Protein Data BankwwPDB consortium~800k Chains23 GbCC0 1.0
<details> <summary>Additionally, we provide several species-specific compilations (mostly reference species)</summary>
NameDescriptionSourceSize
a_thalianaArabidopsis thaliana (thale cress) proteomeAlphaFold2
c_albicansCandida albicans (a fungus) proteomeAlphaFold2
c_elegansCaenorhabditis elegans (roundworm) proteomeAlphaFold2
d_discoideumDictyostelium discoideum (slime mold) proteomeAlphaFold2
d_melanogasterDrosophila melanogaster (fruit fly) proteomeAlphaFold2
d_rerioDanio rerio (zebrafish) proteomeAlphaFold2
e_coliEscherichia coli (a bacteria) proteomeAlphaFold2
g_maxGlycine max (soy bean) proteomeAlphaFold2
h_sapiensHomo sapiens (human) proteomeAlphaFold2
m_jannaschiiMethanocaldococcus jannaschii (an archaea) proteomeAlphaFold2
m_musculusMus musculus (mouse) proteomeAlphaFold2
o_sativaOryza sative (rice) proteomeAlphaFold2
r_norvegicusRattus norvegicus (brown rat) proteomeAlphaFold2
s_cerevisiaeSaccharomyces cerevisiae (brewer's yeast) proteomeAlphaFold2
s_pombeSchizosaccharomyces pombe (a fungus) proteomeAlphaFold2
z_maysZea mays (corn) proteomeAlphaFold2
</details>

Supervised Datasets

NameDescriptionSourceLicense
antibody_developabilityAntibody developability predictionChen et al.CC-BY 3.0
atom3d_mspMutation stability predictionTownshend et al.MIT
atom3d_ppiProtein-protein interaction predictionTownshend et al.MIT
atom3d_psrProtein structure rankingTownshend et al.MIT
atom3d_resResidue identity predictionTownshend et al.MIT
ccpdb_ligandsLigand binding residue predictionAgrawal et al.Publicly Available
ccpdb_metalMetal ion binding residue predictionAgrawal et al.Publicly Available
ccpdb_nucleicNucleic acid binding residue predictionAgrawal et al.Publicly Available
ccpdb_nucleotidesNucleotide binding residue predictionAgrawal et al.Publicly Available
deep_sea_proteinsGene Ontology prediction (Biological Process)Sieg et al.Public domain
go-bpGene Ontology prediction (Biological Process)Gligorijevic et alCC-BY 4.0
go-ccGene Ontology (Cellular Component)Gligorijevic et alCC-BY 4.0
go-mfGene Ontology (Molecular Function)Gligorijevic et alCC-BY 4.0
ec_reactionEnzyme Commission (EC) Number PredictionHermosilla et al.MIT
fold_foldFold prediction, split at the fold levelHou et al.CC-BY 4.0
fold_familyFold prediction, split at the family levelHou et al.CC-BY 4.0
fold_superfamilyFold prediction, split at the superfamily levelHou et al.CC-BY 4.0
masif_siteProtein-protein interaction site predictionGainza et al.Apache 2.0
metal_3dZinc Binding Site PredictionDuerr et al.MIT
ptmPost Translational Modification Side PredictionYan et al.CC-BY 4.0

Tasks

Self-Supervised Tasks

NameDescriptionSource
inverse_foldingPredict amino acid sequence given structure
residue_predictionMasked residue type prediction
distance_predictionMasked edge distance predictionZhang et al.
angle_predictionMasked triplet angle predictionZhang et al.
dihedral_angle_predictionMasked quadruplet dihedral predictionZhang et al.
multiview_contrastContrastive learning with multiple crops and InfoNCE lossZhang et al.
structural_denoisingDenoising of atomic coordinates with SE(3) decoders

Generic Supervised Tasks

Generic supervised tasks can be applied broadly across datasets. The labels are directly extracted from the PDB structures.

These are likely to be most frequently used with the pdb dataset class which wraps the PDB Dataset curator from Graphein.

NameDescriptionRequires
binding_site_predictionPredict ligand binding residuesHETATM ligands (for training)
ppi_site_predictionPredict protein binding residuesgraph_y attribute in data objects specifying the desired chain to select interactions for (for training)

Featurisation Schemes

Part of the goal of the proteinworkshop benchmark is to investigate the impact of the degree to which increasing granularity of structural detail affects performance. To achieve this, we provide several featurisation schemes for protein structures.

Invariant Node Features

N.B. All angular features are provided in [sin, cos] transformed form. E.g.: $\textrm{dihedrals} = [sin(\phi), cos(\phi), sin(\psi), cos(\psi), sin(\omega), \cos(\omega)]$, hence their dimensionality will be double the number of angles.

NameDescriptionDimensionality
residue_typeOne-hot encoding of amino acid type21
positional_encodingTransformer-like positional encoding of sequence position16
alphaVirtual torsion angle defined by four $C_\alpha$ atoms of residues $I_{-1}, I, I_{+1}, I_{+2}$2
kappaVirtual bond angle (bend angle) defined by the three $C_\alpha$ atoms of residues $I_{-2}, I, I_{+2}$2
dihedralsBackbone dihedral angles $(\phi, \psi, \omega)$6
sidechain_torsionsSidechain torsion angles $(\chi_{1-4})$8

Equivariant Node Features

NameDescriptionDimensionality
orientationForward and backward node orientation vectors (unit-normalized)2

Edge Construction

We predominanty support two types of edges: $k$-NN and $\epsilon$ edges.

Edge types can be specified as follows:

python proteinworkshop/train.py ... features.edge_types=[knn_16, knn_32, eps_16]

Where the suffix after knn or eps specifies $k$ (number of neighbours) or $\epsilon$ (distance threshold in angstroms).

Invariant Edge Features

NameDescriptionDimensionality
edge_distanceEuclidean distance between source and target nodes1
node_featuresConcatenated scalar node features of the source and target nodesNumber of scalar node features $\times 2$
edge_typeType annotation for each edge1
sequence_distanceSequence-based distance between source and target nodes1
pos_embStructured Transformer-inspired positional embedding of $i - j$ for source node $i$ and target node $j$16

Equivariant Edge Features

NameDescriptionDimensionality
edge_vectorsEdge directional vectors (unit-normalized)1

For Developers

Dependency Management

We use poetry to manage the project's underlying dependencies and to push updates to the project's PyPI package. To make changes to the project's dependencies, follow the instructions below to (1) install poetry on your local machine; (2) customize the dependencies; or (3) (de)activate the project's virtual environment using poetry:

  1. Install poetry for platform-agnostic dependency management using its installation instructions

    After installing poetry, to avoid potential keyring errors, disable its keyring usage by adding PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring to your shell's startup configuration and restarting your shell environment (e.g., echo 'export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring' >> ~/.bashrc && source ~/.bashrc for a Bash shell environment and likewise for other shell environments).

  2. Install, add, or upgrade project dependencies

      poetry install  # install the latest project dependencies
      # or
      poetry add XYZ  # add dependency `XYZ` to the project
      # or
      poetry show  # list all dependencies currently installed
      # or
      poetry lock  # standardize the (now-)installed dependencies
    
  3. Activate the newly-created virtual environment following poetry's usage documentation

      # activate the environment on a `posix`-like (e.g., macOS or Linux) system
      source $(poetry env info --path)/bin/activate
    
      # activate the environment on a `Windows`-like system
      & ((poetry env info --path) + "\Scripts\activate.ps1")
    
      # if desired, deactivate the environment
      deactivate
    

Code Formatting

To keep with the code style for the proteinworkshop repository, using the following lines, please format your commits before opening a pull request:

# assuming you are located in the `ProteinWorkshop` top-level directory
isort .
autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports .
black --config=pyproject.toml .

Documentation

To build a local version of the project's Sphinx documentation web pages:

# assuming you are located in the `ProteinWorkshop` top-level directory
pip install -r docs/.docs.requirements # one-time only
rm -rf docs/build/ && sphinx-build docs/source/ docs/build/ # NOTE: errors can safely be ignored

Citing ProteinWorkshop

Please consider citing proteinworkshop if it proves useful in your work.

@inproceedings{
  jamasb2024evaluating,
  title={Evaluating Representation Learning on the Protein Structure Universe},
  author={Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
}