Home

Awesome

<!-- <div align="center"> -->

ByProt

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a> <a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br> Paper

<!-- [![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020) --> <!-- </div> --> <!-- ## Description -->

ByProt is a versatile toolkit designed for generative learning in protein research. It currently focuses primarily on structure-based sequence design (a.k.a., fixedbb), offering the following key features:

LM-Design

We are continuously expanding ByProt's capabilities to encompass a broader range of tasks and features. Stay tuned for updates as we strive to provide an even more comprehensive toolkit for protein research.

Installation

# clone project
git clone --recursive https://url/to/this/repo/ByProt.git
cd ByProt

# create conda virtual environment
env_name=ByProt

conda create -n ${env_name} python=3.7 pip
conda activate ${env_name}

# automatically install everything else
bash install.sh

Structure-based protein sequence design (inverse folding)

Pretrained model weights (Zenodo)

modeltraining datacheckpoint
protein_mpnn_cmlmcath_4.2link
lm_design_esm1b_650mcath_4.2link
lm_design_esm2_650mcath_4.2link
lm_design_esm2_650mmultichainlink

Data

Download the preproceesd CATH datasets

bash scripts/download_cath.sh

Go check configs/datamodule/cath_4.*.yaml and set data_dir to the path of the downloaded CATH data.

Dowload PDB complex data (multichain)

This dataset curated protein (multichain) complexies from Protein Data Bank (PDB). It is provided by Robust deep learning-based protein sequence design using ProteinMPNN. See their github page for more details.

bash scripts/download_multichain.sh

Go check configs/datamodule/multichain.yaml and set data_dir to the path of the downloaded multichain data.

OK we now get everything ready and can start to train a model.

<!-- <br> -->

Training

In the following sections, we will use CATH 4.2 dataset as an runing example. You can likewise build your models on the multichain dataset to accommodate protein complexies.

Example 1: Non-autoregressive (NAR) ProteinMPNN baseline

Training NAR ProteinMPNN with conditional masked language modeling (CMLM)

export CUDA_VISIBLE_DEVICES=0
# or use multi-gpu training when you want:
# export CUDA_VISIBLE_DEVICES=0,1

exp=fixedbb/protein_mpnn_cmlm  
dataset=cath_4.2
name=fixedbb/${dataset}/protein_mpnn_cmlm

python ./train.py \
    experiment=${exp} datamodule=${dataset} name=${name} \
    logger=tensorboard trainer=ddp_fp16 

Some flags for training:

ArgumentUsage
experimentexperiment config. see ByProt/configs/experiment/ folder
datamoduledataset config. see ByProt/configs/datamodule folder
nameexperiment name, deciding the directory path your experiment saving to, e.g., /root/research/projects/ByProt/run/logs/${name}
loggerconfig of which ml experiment logger to use, e.g., tensorboard.
train.force_restartset to true to force retrain the experiment under ${name}. otherwise will resume training from the last checkpoint.

Example 2: LM-Design

Training <span style="font-variant:small-caps;">LM-Design</span> upon ESM-1b 650M.

Training would take approxmiately 6 hours on one A100 GPU.

exp=fixedbb/lm_design_esm1b_650m
dataset=cath_4.2
name=fixedbb/${dataset}/lm_design_esm1b_650m

./train.py \
    experiment=${exp} datamodule=${dataset} name=${name} \
    logger=tensorboard trainer=ddp_fp16 

Building <span style="font-variant:small-caps;">LM-Design</span> upon ESM-2 series using exp=fixedbb/lm_design_esm2*. Please check ByProt/configs/experiment/fixedbb.

Evaluation/inference on valid/test datasets

dataset=cath_4.2
# name=fixedbb/${dataset}/protein_mpnn_cmlm
name=fixedbb/${dataset}/lm_design_esm1b_650m
exp_path=/root/research/projects/ByProt/run/logs/${name}

python ./test.py \                                                                 
    experiment_path=${exp_path} \
    data_split=test ckpt_path=best.ckpt mode=predict \
    task.generator.max_iter=5

Some flags for generation

ArgumentUsage
experiment_pathfolder that saves experiment (.hydra, checkpoints, tensorboard, etc)
data_splitvalid or test dataset.
modepredict for generating sequence & calculating amino acid sequence recovery; test for evaluation for nll, ppl
task.generatorarguments for sequence generator/sampler
- max_iter=<int>maximum decoding iteration (default: 5 for LM-Design, 1 for ProtMPNN-CMLM)
- strategy=[denoise, mask_predict]decoding strategy. (default: denoise for LM-Design, mask_predict for ProtMPNN-CMLM)
- temperature=<float>temperature for sampling. set to 0 to disable for deterministic sampling (default: 0)
- eval_sc=<bool>additional evaluating scTM score using ESMFold. (default: false)

Designing sequences from a pdb file using a trained model in Notebook

Example 1: ProteinMPNN-CMLM

from byprot.utils.config import compose_config as Cfg
from byprot.tasks.fixedbb.designer import Designer

# 1. instantialize designer
exp_path = "/root/research/projects/ByProt/run/logs/fixedbb/cath_4.2/protein_mpnn_cmlm"
cfg = Cfg(
    cuda=True,
    generator=Cfg(
        max_iter=1,
        strategy='mask_predict',
        temperature=0,
        eval_sc=False,  
    )
)
designer = Designer(experiment_path=exp_path, cfg=cfg)

# 2. load structure from pdb file
pdb_path = "/root/research/projects/ByProt/data/3uat_variants/3uat_GK.pdb"
designer.set_structure(pdb_path)

# 3. generate sequence from the given structure
designer.generate()

# 4. calculate evaluation metircs
designer.calculate_metrics()
## prediction: SSYNPPILLLGPFAEELEEELVEENPERAGRPVPFTTEPPSPDETEGETYLYISSLEEAEELIESNRFLEAGEENNELVGISLEAIRSVARAGKLAILDTGGEAVEKLEEANIEPIVIFLVPKSVEDVRRVFPDLTEEEAEELTSEDEELLEEFKELLDAVVSGSTLEEVLEEIREVIEEASS
## recovery: 0.37158469945355194

Example 2: <span style="font-variant:small-caps;">LM-Design</span>

from byprot.utils.config import compose_config as Cfg
from byprot.tasks.fixedbb.designer import Designer

# 1. instantialize designer
exp_path = "/root/research/projects/ByProt/run/logs/fixedbb/cath_4.2/lm_design_esm2_650m"
cfg = Cfg(
    cuda=True,
    generator=Cfg(
        max_iter=5,
        strategy='denoise', 
        temperature=0,
        eval_sc=False,  
    )
)
designer = Designer(experiment_path=exp_path, cfg=cfg)

# 2. load structure from pdb file
pdb_path = "/root/research/projects/ByProt/data/3uat_variants/3uat_GK.pdb"
designer.set_structure(pdb_path)

# 3. generate sequence from the given structure
designer.generate()
# you can override generator arguments by passing generator_args, e.g.,
designer.generate(
    generator_args={
        'max_iter': 5, 
        'temperature': 0.1,
    }
)

# 4. calculate evaluation metircs
designer.calculate_metrics()
## prediction: LNYTRPVIILGPFKDRMNDDLLSEMPDKFGSCVPHTTRPKREYEIDGRDYHFVSSREEMEKDIQNHEFIEAGEYNDNLYGTSIESVREVAMEGKHCILDVSGNAIQRLIKADLYPIAIFIRPRSVENVREMNKRLTEEQAKEIFERAQELEEEFMKYFTAIVEGDTFEEIYNQVKSIIEEESG
## recovery: 0.7595628415300546

** Example 3: Inpainting ** For some use cases, you may want to do inpainting on some segments of interest only while the rest of the protein remains the same (e.g., designing antibody CDRs). Here is a simple example with inpaint interface:

pdb_path = "/root/research/projects/ByProt/data/pdb_samples/5izu_proc.pdb"
designer.set_structure(pdb_path)

start_ids = [1, 50]
end_ids = [10, 100]

for i in range(5):
    out, ori_seg, designed_seg = designer.inpaint(
        start_ids=start_ids, end_ids=end_ids, 
        generator_args={'temperature': 1.0}
    )
    print(designed_seg)
print('Original Segments:')
print(ori_seg)

The output looks like:

loading backbone structure from /root/research/projects/ByProt/data/pdb_samples/5izu_proc.pdb.
[['MVKSLFRHRT'], ['DEPIEEFTPTPAFPALQRLSSVDVEGVAWRAGLRTGDFLLEVNGVNVVKVG']]
[['MTKALFRHQT'], ['ETPIEEFTPTPAFPALQHLSSVDVEGAAYRAGLRTGDFLIEVNGVNVVKVG']]
[['STESLFRHAT'], ['ETPIEEFTPTPAFPALQHLSSVDVEGVAWRAGLRTGDFLIEVNGINVVKVG']]
[['ATARMFRHLT'], ['ETPIEEFTPTPAFPALQYLSSVDVEGVAWRAGLKTGDFLIEVNGVNVVKVG']]
[['ARKAKFRRYT'], ['ETPIEEFTPTPAFPALQVLSSVDVEGVAWRAGMRTGDFLLEVNGVNVVKVG']]
[['ADARLFREYT'], ['ETPIEEFTPTPAFPALQHLSAVDVEGVAWRAGLLTGDFLIEVNGVNVVKVG']]
[['ALRALFKHST'], ['DTPIEEFTPTPAFPALQYMSSVEVEGVAWRAGLRTGDFLIEVNGVNVVKVG']]
[['MLKMLFRHYT'], ['ETPIEEFTPTPAFPALQYLSSVDIDGMAWRAGLRTGDFLIEVNGDNVVKVG']]
[['ADKALFRHHT'], ['STPIEEFTPTPAFPALQYLESVDVDGVAYRAGLCTGDFLIEVNGVNVVKVG']]
[['AAAAAFRHST'], ['KTPIEEFTPTPAFPALQYLSRVEVDGMAWRAGLRTGDFLLEVNGVNVVRVG']]
Original Segments:
[['RTKRLFRHYT'], ['ETPIEEFTPTPAFPALQYLESVDVEGVAWRAGLRTGDFLIEVNGVNVVKVG']]

Acknowledgements

ByProt extends its gratitude to the following projects and individuals:

ByProt draws inspiration and leverages/modifies implementations from the following repositories:

We express our sincere appreciation to the authors of these repositories for their invaluable contributions to the development of ByProt.

Citation

@inproceedings{zheng2023lm_design,
    title={Structure-informed Language Models Are Protein Designers},
    author={Zheng, Zaixiang and Deng, Yifan and Xue, Dongyu and Zhou, Yi and YE, Fei and Gu, Quanquan},
    booktitle={International Conference on Machine Learning},
    year={2023}
}