Home

Awesome

SaProt: Protein Language Modeling with Structure-aware Vocabulary (AA+3Di)

<a href="https://www.biorxiv.org/content/10.1101/2023.10.01.560349v3"><img src="https://img.shields.io/badge/Paper-bioRxiv-green" style="max-width: 100%;"></a> <a href="https://huggingface.co/westlake-repl/SaProt_650M_AF2"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-red?label=Model" style="max-width: 100%;"></a> <a href="https://portal.valencelabs.com/blogs/post/saprot-protein-language-modeling-with-structure-aware-vocabulary-uyLPrUZqyDF60Yr" alt="blog"><img src="https://img.shields.io/badge/Blog-Portal-violet" /></a> <a href="https://zhuanlan.zhihu.com/p/664754366" alt="zhihu"><img src="https://img.shields.io/badge/Zhihu-知乎-blue" /></a>

The repository is an official implementation of SaProt: Protein Language Modeling with Structure-aware Vocabulary.

If you have any question about the paper or the code, feel free to raise an issue! Saprot should outperform ESM-2 in most tasks under fair evaluation settings.

The laboratory is hiring research assistants, interns, doctoral students, and postdoctoral researchers. Please contact the corresponding author for details.

实验室招聘科研助理,实习生,博士生和博士后,请联系通讯作者

<details open><summary><b>Table of contents</b></summary> </details>

News

Overview

We propose a structure-aware vocabulary for protein language modeling. The vocabulary is constructed by encoding the protein structure into discrete 3D tokens by using the foldseek. We combine the residue tokens and the structure tokens to form a structure-aware sequence. Through large-scale pre-training, our model, i.e. SaProt, can learn the relationship between the structure and the sequence. For more details, please refer to our paper https://www.biorxiv.org/content/10.1101/2023.10.01.560349v2.

Environment installation

Create a virtual environment

conda create -n SaProt python=3.10
conda activate SaProt

Install packages

bash environment.sh  

Prepare the SaProt model

We provide two ways to use SaProt, including through huggingface class and through the same way in esm github. Users can choose either one to use.

Model checkpoints

NameSizeDataset
SaProt_35M_AF235M parameters40M AF2 structures
SaProt_650M_PDB650M parameters40M AF2 structures (phase1) + 60K PDB structures (phase2)
SaProt_650M_AF2650M parameters40M AF2 structures

New experimental results

Some experimental results are listed below. For more details, please refer to our paper.

35M Model

ModelClinVarProteinGymThermostabilityHumanPPIMetal Ion BindingECGO-MFGO-BPGO-CCDeepLoc-SubcellularDeepLoc-Binary
AUCSpearman's ρSpearman's ρAcc%Acc%FmaxFmaxFmaxFmaxAcc%Acc%
ESM-2 (35M)0.7220.3390.66980.7973.080.8250.6160.4160.40476.5891.60
SaProt-Seq (35M)0.7380.3370.67280.5673.230.8210.6080.4130.40376.6791.16
SaProt (35M)0.7940.3920.69281.1174.290.8470.6420.4310.41878.0991.97

650M Model

ModelClinVarProteinGymThermostabilityHumanPPIMetal Ion BindingECGO-MFGO-BPGO-CCDeepLoc-SubcellularDeepLoc-Binary
AUCSpearman's ρSpearman's ρAcc%Acc%FmaxFmaxFmaxFmaxAcc%Acc%
ESM-2 (650M)0.8620.4750.68076.6771.560.8680.6700.4730.47082.0991.96
SaProt (650M)0.9090.4780.72486.4175.750.8820.6820.4860.47985.5793.55

AlphaFold2 vs. ESMFold

We compare structures predicted by AF2 or ESMFold, which is shown below:

modelClinVarProteinGymThermostabilityHumanPPIMetal Ion BindingECGO-MFGO-BPGO-CCDeepLoc-SubcellularDeepLoc-Binary
AUCSpearman's ρSpearman's ρAcc%Acc%FmaxFmaxFmaxFmaxAcc%Acc%
SaProt (ESMFold)0.8960.4550.71785.7874.100.8710.6780.4800.47482.8293.19
SaProt (AF2)0.9090.4780.72486.4175.750.8820.6820.4860.47985.5793.55

ProteinGym benchmark

SaProt achieved first position on ProteinGym benchmark! The checkpoint was trained on Sep. 2023. figures/proteingym_benchmark.jpg

figures/proteingymofficial.png

Load SaProt

Hugging Face model

The following code shows how to load the model based on huggingface class. Note masking lower pLDDT regions for AF2 structures is beneficial ,see below.

from transformers import EsmTokenizer, EsmForMaskedLM

model_path = "/your/path/to/SaProt_650M_AF2" # Note this is the directory path of SaProt, not the ".pt" file
tokenizer = EsmTokenizer.from_pretrained(model_path)
model = EsmForMaskedLM.from_pretrained(model_path)

#################### Example ####################
device = "cuda"
model.to(device)

seq = "M#EvVpQpL#VyQdYaKv" # Here "#" represents lower plDDT regions (plddt < 70)
tokens = tokenizer.tokenize(seq)
print(tokens)

inputs = tokenizer(seq, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

outputs = model(**inputs)
print(outputs.logits.shape)

"""
['M#', 'Ev', 'Vp', 'Qp', 'L#', 'Vy', 'Qd', 'Ya', 'Kv']
torch.Size([1, 11, 446])
"""

Load SaProt using esm repository

User could also load SaProt by esm implementation. The checkpoint is stored in the same huggingface folder, named SaProt_650M_AF2.pt. We provide a function to load the model.

from utils.esm_loader import load_esm_saprot

model_path = "/your/path/to/SaProt_650M_AF2.pt"
model, alphabet = load_esm_saprot(model_path)

Convert protein structure into structure-aware sequence

We provide a function to convert a protein structure into a structure-aware sequence. The function calls the foldseek binary file to encode the structure. You can download the binary file from here and place it in the bin folder . The following code shows how to use it.

from utils.foldseek_util import get_struc_seq
pdb_path = "example/8ac8.cif"

# Extract the "A" chain from the pdb file and encode it into a struc_seq
# pLDDT is used to mask low-confidence regions if "plddt_mask" is True. Please set it to True when
# use AF2 structures for best performance.
parsed_seqs = get_struc_seq("bin/foldseek", pdb_path, ["A"], plddt_mask=False)["A"]
seq, foldseek_seq, combined_seq = parsed_seqs

print(f"seq: {seq}")
print(f"foldseek_seq: {foldseek_seq}")
print(f"combined_seq: {combined_seq}")

Predict mutational effect

We provide a function to predict the mutational effect of a protein sequence. The example below shows how to predict the mutational effect at a specific position. If using the AF2 structure, we strongly recommend that you add pLDDT mask (see below).

from model.saprot.saprot_foldseek_mutation_model import SaprotFoldseekMutationModel


config = {
    "foldseek_path": None,
    "config_path": "/your/path/to/SaProt_650M_AF2", # Note this is the directory path of SaProt, not the ".pt" file
    "load_pretrained": True,
}
model = SaprotFoldseekMutationModel(**config)
tokenizer = model.tokenizer

device = "cuda"
model.eval()
model.to(device)

seq = "M#EvVpQpL#VyQdYaKv" # Here "#" represents lower plDDT regions (plddt < 70)

# Predict the effect of mutating the 3rd amino acid to A
mut_info = "V3A"
mut_value = model.predict_mut(seq, mut_info)
print(mut_value)

# Predict mutational effect of combinatorial mutations, e.g. mutating the 3rd amino acid to A and the 4th amino acid to M
mut_info = "V3A:Q4M"
mut_value = model.predict_mut(seq, mut_info)
print(mut_value)

# Predict all effects of mutations at 3rd position
mut_pos = 3
mut_dict = model.predict_pos_mut(seq, mut_pos)
print(mut_dict)

# Predict probabilities of all amino acids at 3rd position
mut_pos = 3
mut_dict = model.predict_pos_prob(seq, mut_pos)
print(mut_dict)

Get protein embeddings

If you want to generate protein embeddings, you could refer to the following code. The embeddings are the average of the hidden states of the last layer.

from model.saprot.base import SaprotBaseModel
from transformers import EsmTokenizer


config = {
    "task": "base",
    "config_path": "/your/path/to/SaProt_650M_AF2", # Note this is the directory path of SaProt, not the ".pt" file
    "load_pretrained": True,
}

model = SaprotBaseModel(**config)
tokenizer = EsmTokenizer.from_pretrained(config["config_path"])

device = "cuda"
model.to(device)

seq = "M#EvVpQpL#VyQdYaKv" # Here "#" represents lower plDDT regions (plddt < 70)
tokens = tokenizer.tokenize(seq)
print(tokens)

inputs = tokenizer(seq, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

embeddings = model.get_hidden_states(inputs, reduction="mean")
print(embeddings[0].shape)

Perform protein inverse folding

We provide a function to perform protein inverse folding. Please see the example below.

from model.saprot.saprot_if_model import SaProtIFModel

# Load model
config = {
    # Please download the weights from https://huggingface.co/westlake-repl/SaProt_650M_AF2_inverse_folding
    "config_path": "/your/path/to/SaProt_650M_AF2_inverse_folding",
    "load_pretrained": True,
}

device = "cuda"
model = SaProtIFModel(**config)
model = model.to(device)

aa_seq = "##########" # All masked amino acids will be predicted. You could also partially mask the amino acids.
struc_seq = "dddddddddd"

# Predict amino acids given the structure sequence
pred_aa_seq = model.predict(aa_seq, struc_seq)
print(pred_aa_seq)

Prepare dataset

Pre-training dataset

We provide the dataset for pre-training SaProt. The dataset can be downloaded from here.

Downstream tasks

We provide datasets that are used in the paper. Datasets can be downloaded from here.

Once downloaded, the datasets need to be decompressed and placed in the LMDB folder for supervised fine-tuning.

Fine-tune SaProt

We provide a script to fine-tune SaProt on the datasets. The following code shows how to fine-tune SaProt on specific downstream tasks. Before running the code, please make sure that the datasets are placed in the LMDB folder and the huggingface version of SaProt 650M model is placed in the weights/PLMs folder. Note that the default training setting is not as same as in the paper because of the hardware limitation for different users. We recommend users to modify the yaml file flexibly based on their own conditions (i.e. batch_size, devices and accumulate_grad_batches).

# Fine-tune SaProt on the Thermostability task
python scripts/training.py -c config/Thermostability/saprot.yaml

# Fine-tune ESM-2 on the Thermostability task
python scripts/training.py -c config/Thermostability/esm2.yaml

Record the training process (optional)

If you want to record the training process using wandb, you could modify the config file and set Trainer.logger = True and then paste your wandb API key in the config key setting.os_environ.WANDB_API_KEY.

Evaluate zero-shot performance

We provide a script to evaluate the zero-shot performance of models (foldseek binary file is required to be placed in the bin folder):

# Evaluate the zero-shot performance of SaProt on the ProteinGym benchmark
python scripts/mutation_zeroshot.py -c config/ProteinGym/saprot.yaml

# Evaluate the zero-shot performance of ESM-2 on the ProteinGym benchmark
python scripts/mutation_zeroshot.py -c config/ProteinGym/esm2.yaml

The results will be saved in the output/ProteinGym folder.

For ClinVar benchmark, you can use the following script to calculate the AUC metric:

# Evaluate the zero-shot performance of SaProt on the ClinVar benchmark
python scripts/mutation_zeroshot.py -c config/ClinVar/saprot.yaml
python scripts/compute_clinvar_auc.py -c config/ClinVar/saprot.yaml

Citation

If you find this repository useful, please cite our paper:

@article{su2023saprot,
  title={SaProt: Protein Language Modeling with Structure-aware Vocabulary},
  author={Su, Jin and Han, Chenchen and Zhou, Yuyang and Shan, Junjie and Zhou, Xibin and Yuan, Fajie},
  journal={bioRxiv},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}