Home

Awesome

ProteinDT: A Text-guided Protein Design Framework

Authors: Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao<sup>*</sup>, Jian Tang<sup>*</sup>, Hongyu Guo<sup>*</sup>, Anima Anandkumar<sup>*</sup>

<sup>*</sup> jointly supervised

[Project Page] [ArXiv] [Datasets on HuggingFace] [Checkpoints on HuggingFace]

<p align="center"> <img src="figures/pipeline.png" /> </p> <p align="left"> <img src="figures/final.gif" width="100%" /> </p>

1 Environment

conda create -n ProteinDT python=3.7
conda activate ProteinDT

conda install -y numpy networkx scikit-learn

pip install torch==1.10.*

pip install transformers
pip install lxml

# for TAPE
pip install lmdb
pip install seqeval

# for baseline ChatGPT
pip install openai

# for baseline Galactica
pip install accelerate

# for visualization
pip install matplotlib

# for binding editing
pip install h5py
pip install torch_geometric==2.0 torch_scatter torch_sparse torch_cluster
pip install biopython

# for ESM folding
pip install "fair-esm[esmfold]"
pip install dm-tree omegaconf ml-collections einops
pip install fair-esm[esmfold]==2.0.0  --no-dependencies # Override deepspeed==0.5 
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'

conda install -c conda-forge -yq mdtraj

# for ProteinDT
pip install .

2 Pretraining Datasets (SwissProtCLAP) Preparation

Please check folder preprocess/SwissProtCLAP for SwissProtCLAP construction from UniProt.

We also provide a copy of SwissProtCLAP at this HuggingFace link. Or you can use the following script:

from huggingface_hub import HfApi, snapshot_download
api = HfApi()
snapshot_download(repo_id="chao1224/ProteinDT", repo_type="dataset", cache_dir='./')

Then move the data under ./data folder. The data structure is

./data/
└── SwissProtCLAP
    ├── protein_sequence.txt
    └── text_sequence.txt

3 Pretraining

Go to folder examples, and do the pretraining in 5 steps. We summarize the logics of these 5 steps as below:

<p align="center"> <img src="figures/pretraining_roadmap.png" width="75%" /> </p>

The pretrained checkpoints can be found at this HuggingFace link. Before getting started, first we need to define our output home folder, e.g., export OUTPUT_DIR=../output/ProteinDT/hyper_01.

4 Downstream Tasks

We include three types of downstream tasks, as will be introduced below. You can find the scripts for first two downstream tasks under folder scripts.

4.1 Text-to-Protein Generation

First let's go to the folder examples/downstream_Text2Protein.

Then we sample text sequences for text-to-protein generation:

python step_01_text_retrieval.py

We also provide the sampled text data in step_01_text_retrieval.txt. You can replace it with the text sequences you want to use.

Now we can do the text-to-sequence generation, e.g., if we use T5 as the decoder:

export OUTPUT_DIR=../../output/ProteinDT/hyper_01

python step_02_inference_ProteinDT.py \
--decoder_distribution=T5Decoder --score_network_type=T5Base \
--num_workers=0 --hidden_dim=16 --batch_size=8 \
--pretrained_folder="$OUTPUT_DIR" \
--step_04_folder="$OUTPUT_DIR"/step_04_T5 \
--num_repeat=16 --use_facilitator --AR_generation_mode=01 \
--output_text_file_path="$OUTPUT_DIR"/step_04_T5/downstream_Text2Protein/step_02_inference.txt

4.2 Zero-shot Text-guided Protein Editing

First let's go to the folder examples/downstream_Editing.

The dataset preparation can be found at examples/downstream_Edting/README.md. You can also find it on this HuggingFace link. We include three types of editing tasks: stability, structure, and peptide binding. In terms of the methods, we have two types: latent optimization and latent interpolation. The demo scripts are explained below.

4.2.1 Latent Optimization

4.2.2 Latent Interpolation

Notice that for latent interpolation, we have three models: auto-regressive (T5), denoising diffusion model (RNN and BERT). We provide demos scripts using T5.

4.3 Protein Property Prediction

First please download the TAPE data following instructions here. We also provide it at this HuggingFace link.

Under examples, and the script is downstream_TAPE.py. We follow the exactly same hyper-parameter as OntoProtein.

python downstream_TAPE.py \
--task_name=ss3 \
--seed=3 \
--learning_rate=3e-5 \
--num_train_epochs=5 \
--per_device_train_batch_size=2 \
--gradient_accumulation_steps=8 \
--warmup_ratio=0.08 \
--pretrained_model=ProteinDT \
--pretrained_folder="$OUTPUT_DIR" \
--output_dir="$OUTPUT_DIR"/downstream_TAPE

Cite Us

Feel free to cite this work if you find it useful to you!

@article{liu2023text,
    title={A Text-guided Protein Design Framework},
    author={Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, Anima Anandkumar},
    journal={arXiv preprint arXiv:2302.04611},
    year={2023}
}