Home

Awesome

<div align="center">

Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a> <a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br> arXiv OpenReview Conference

</div>

๐Ÿ“œ Description

Str2Str is a score-based framework (which means it can accommodate any diffusion/flow matching architecture) for protein conformation sampling in a zero-shot manner. The idea behind is very straightforward: given an input protein structure, which can be obtained from your favorite folding module, such as AlphaFold2 or ESMFold, Str2Str perturbs the structure geometry in the corresponding conformation space, followed by a sequence of denoising steps as annealing.

<p align="center"> <img src="assets/inference.png" width="450"/> </p>

This simply yet effective sampling pipeline draws abundant inspiration from traditional methods to sample conformations. Differently, it is guided by a neural score network train on the PDB database.

<p align="center"> <img src="assets/illus.png" width="450"/> </p>

๐Ÿš€ Installation

git clone https://github.com/lujiarui/Str2Str.git
cd Str2Str

# Create conda environment.
conda env create -f environment.yml
conda activate myenv

# Support import as a package.
pip install -e .

๐Ÿงช Download and process PDB data

# Edit this accordingly.
DOWNLOAD_DIR=/path/to/download/directory
PROCESSED_DIR=/path/to/processing/directory

# Download.
bash scripts/pdb/download_pdb_mmcif.sh $DOWNLOAD_DIR

# Preprocess.
python scripts/pdb/preprocess.py --mmcif_dir $DOWNLOAD_DIR/pdb_mmcif/mmcif_files --output_dir $PROCESSED_DIR

The processed directory will be like:

/path/to/processing/directory
โ”œโ”€โ”€ 00
โ”‚   โ””โ”€โ”€ 200l.pkl
โ”œโ”€โ”€ 01
โ”‚   โ”œโ”€โ”€ 101m.pkl
โ”‚   โ””โ”€โ”€ 201l.pkl
โ”œโ”€โ”€ 02
โ”‚   โ”œโ”€โ”€ 102l.pkl
โ”‚   โ””โ”€โ”€ 102m.pkl
โ”œโ”€โ”€ 03
โ”‚   โ”œโ”€โ”€ 103l.pkl
โ”‚   โ””โ”€โ”€ 103m.pkl
โ””โ”€โ”€ <a metadata file>.csv

(Optional) To tailor the training set, a subset of proteins can be specified to to the preprocess.py, such as the culled list from PISCES. To enable this, simply download the target list and add flag --pisces:

PISCES_FILE=/path/to/pisces/file

python scripts/pdb/preprocess.py --mmcif_dir $DOWNLOAD_DIR/pdb_mmcif/mmcif_files --output_dir $PROCESSED_DIR --pisces $PISCES_FILE

๐Ÿ”จ How to Run

First thing first, the path to train / test data should be configured right. Simply rename .env.example -> .env and specify the top three environment variables.

Training

# Training w/ default configs.
python src/train.py

# w/ single GPU.
python src/train.py trainer=gpu

# w/ DDP on 2 GPUS. 
python src/train.py trainer=ddp trainer.devices=2 data.batch_size=4

# w/ the chosen experiment configuration.
python src/train.py experiment=example trainer.devices=2 data.batch_size=4

# limit training epochs (100) or time (12hrs)
python src/train.py +trainer.max_epochs=100
python src/train.py +trainer.max_time="00:12:00:00"

Inference

To perform inference, a trained model checkpoint is required and can be specified using ckpt_path=/path/to/checkpoint. A pretrained PyTorch checkpoint can be accessed from Google Drive. Download and put it into data/ckpt.

# inference (sampling & evaluation on metrics)
python src/eval.py task_name=inference

# inference (if you want to load checkpoint elsewhere)
python src/eval.py task_name=inference ckpt_path=/path/to/some/checkpoint

# sampling only (skip evaluation)
python src/eval.py task_name=inference target_dir=null

# evaluation only (skip sampling)
python src/eval.py task_name=eval_only pred_dir=/path/to/some/pdb/directory

To run evaluation, you may need to specify the REFERENCE_DATA variable in the .env file. This path (points to a directory) should contain the reference trajectory/ensembles, one .pdb file per target, aligned with the directory arch like TEST_DATA. In the paper, the long MD trajectories from the study in this paper are used as reference.

๐Ÿช Tools

We also provide several useful scripts based on existing tools, which can be helpful along with Str2Str.

To run MD simulation using OpenMM:

python script/simulate.py --pdb_dir /path/to/some/pdb/directory --output_dir /path/to/output/directory

To fold (sequence -> structure) using ESMFold.

python script/fold.py -i /path/to/some/fasta/or/pdb/directory -o /path/to/output/directory

To pack (backbone atoms -> full heavy atoms) using FASPR.

python script/pack.py -i /path/to/some/pdb/directory -o /path/to/output/directory --n_cpu 4

๐Ÿ“ง Contact

๐Ÿ“ Cite Str2Str

If you find this codebase inspiring in your research, please cite the following paper. ๐Ÿฅฐ

@inproceedings{lu2024strstr,
    title={Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling},
    author={Lu, Jiarui and Zhong, Bozitao and Zhang, Zuobai and Tang, Jian},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024}
}