Awesome
MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding
This is an official implementation of the paper MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding.
Overview
MULAN is a MULtimodal encoder PLM for both sequence and ANgle-based structure processing. MULAN uses the pre-trained base PLM and has the Structure Adapter – a module we introduce to incorporate the knowledge about the 3D protein structure. Our model uses ESM-2 architecture, initializing the base PLM from ESM-2 or SaProt models. However, MULAN can be based on other PLMs.
MULAN offers a cheap increase of the quality of the protein representations compared to the protein language models used to initialize it, which is illustrated on various downstream tasks.
Also, MULAN demonstrates the structural awareness of the obtained embeddings.
Installation
To install repo run the command inside the folder MULAN:
git clone https://github.com/DFrolova/MULAN.git
cd MULAN
pip install -e .
Also, you should download and put the foldseek binary file into the mulan/bin
folder following the instructions provided in the SaProt repo.
Currently, they provide it here.
Do not forget to add the rights for execution for foldseek (chmod +x bin/foldseek
).
If you do not need foldseek sequences (you use only MULAN based on ESM-2), you can pass extract_foldseek_in_tokenizer=False
when initializing the ProteinDataset
.
Thus, you do not need to download and use the foldseek binary file.
Quick start
The code for the MULAN inference and embedding extraction is presented in example.ipynb
.
Also, we detail the required steps in a step-by-step fashion below.
- Collect protein structures.
Put the protein structures (either
.pdb
or.cif
files) you need to encode into the folder. The whole path to this folder isprotein_data_path
. If you need to extract certain chains from the structure (for experimental structures), you need to specify the chain name in the name of the file in the format:{protein_id}_{chain}.pdb
. Othervise, the first chain would be extracted.
In case of experimental structures, you firstly need to preprocess them with pdbfixer
to restore missing atoms, remove water from the PDB file, etc.
Firstly, you need to install pdbfixer
. It should be installed in a separate environment for structure processing:
conda create -n pdb_processing python=3.7
conda activate pdb_processing
conda install pdbfixer=1.8.1 -c conda-forge
conda install biopython=1.78
Then, you run the script, providing correct paths for initial_structure_path
and preprocessed_structure_path
.
python scripts/preprocess_experimental_structures.py -i <initial_structure_path> -o <preprocessed_structure_path>
- Load the dataset and preprocess the data.
Also, prepare the dataloader. By default, the dataset would not aggeragate proteins into batches (1 portein = 1 batch).
However, you can pass
use_sorted_batching=True
to the dataset (and still passbatch_size=1
to the dataloader!) to aggregate proteins with similar lengths into the batch (maximimbatch_limit
tokens per batch) for a faster inference. Further steps of code suppose that each batch contains one protein.
import numpy as np
import torch
from torch.utils.data import DataLoader
from mulan.dataset import ProteinDataset, data_collate_fn_dynamic
protein_data_path = <> # specify the path to the folder with pdb files you need to pass to the model
saved_dataset_path = <> # specify the path where to save the preprocessed dataset
use_foldseek_sequences = False # True if use SaProt initialization for MULAN. Else False
is_experimental_structure = False # flag for either AlphaFold structures or experimental ones
dataset = ProteinDataset(
protein_data_path=protein_data_path,
saved_dataset_path=saved_dataset_path,
use_foldseek_sequences=use_foldseek_sequences,
# batch_limit=5000, # default number of tokens per batch (you can increase it for a faster inference with `use_sorted_batching=True`)
# use_sorted_batching=True,
# extract_foldseek_in_tokenizer=False,
is_experimental_structure=is_experimental_structure,
)
Note that initializing dataset requires firstly to preprocess and tokenize all provided protein structures.
The results will be stored in the saved_dataset_path
folder.
Further, if you reuse this dataset, the preprocessing step is not required if you store the data inside the saved_dataset_path
.
During preprocessing, foldseek sequences are also extracted.
This is done because SaProt-initialized MULAN uses Foldseek.
If you use only ESM-2, you can pass argument extract_foldseek_in_tokenizer=False
into the ProteinDataset
class.
- Load the model. Load the pre-trained model from Huggingface MULAN-small.
from mulan.model import StructEsmForMaskedLM
model = StructEsmForMaskedLM.from_pretrained(
'DFrolova/MULAN-small',
device_map="auto",
)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model.to(device)
- Prepare the dataloader
from mulan.model_utils import auto_detect_base_tokenizer
esm_tokenizer = auto_detect_base_tokenizer(model.config, use_foldseek_sequences)
# Initialize dataloader
def data_collator(x):
if use_foldseek_sequences:
one_letter_aas = esm_tokenizer.all_tokens[5:]
else:
one_letter_aas = dataset.tokenizer.one_letter_aas
return data_collate_fn_dynamic(x,
esm_tokenizer=esm_tokenizer,
nan_value=np.deg2rad(dataset.tokenizer.nan_fill_value),
mask_inputs=False,
all_amino_acids=one_letter_aas,
use_foldseek_sequences=use_foldseek_sequences)
dataloader = DataLoader(dataset, batch_size=1, shuffle=False, collate_fn=data_collator)
- Run inference to extract residue-level embeddings from the last layer of MULAN. Each batch contains one protein sequence.
with torch.no_grad():
for i, (batch, batch_names) in enumerate(zip(dataloader, dataset.protein_names)):
struct_inputs = [struct_input.to(device) for struct_input in batch['struct_inputs']]
# extract embeddings for each batch (1 sequence per batch)
embeddings = model(
input_ids=batch['input_ids'].to(device),
attention_mask=batch['attention_mask'].to(device),
struct_inputs=struct_inputs,
output_hidden_states=True
)['hidden_states'][-1]
embeddings = embeddings[0][1:-1] # residue-level embeddings for the case of 1 protein per batch
# If you want to get protein-level embeddings, you shoud perform average pooling:
protein_embedding = embeddings.mean(dim=0)
print(batch_names, embeddings.shape, protein_embedding.shape)
If you want to get protein-level embeddings, you shoud perform average pooling:
embeddings = embeddings.mean(dim=0)
Pre-trained models
The pre-trained model weights for MULAN-small are released in DFrolova/MULAN-small. Medium-sized MULAN models will be added within several months.
Training dataset and downstream datasets
Training dataset
The raw training data (AF-17M) can be downloaded from the ProstT5 repository.
The angle information should be extracted from these structures using the script detailed below.
Also, all angles should be preprocessed using the Tokenizer
class for the faster use during training.
The small dataset (AF-0.5M) that was used for MULAN-small training is released in DFrolova/MULAN_datasets. You should manually download AF05_pretraining.zip.
Downstream datasets
All downstream datasets that are used in our paper are publicly available. The raw downstream dataset structures can be downloaded from the AlphaFold database by their corresponding UniProt IDs.
The preprocessed downstream datasets are available for download at DFrolova/MULAN_datasets.
You should manually download each downstream task archive.
Each dataset folder contains id2label.json
file with targets for each dataset split and a folder dataset
with the preprocessed dataset ready to pass into the ProteinDataset
class.
Inference
To run inference on all downstream datasets, run
CUDA_VISIBLE_DEVICES=0 python mulan/eval_adapter.py -c configs/eval/eval_mulan_small.yaml
This script would save protein-level embeddings for all downstream tasks into the folder you would specify in the config file.
The same procedure should be done for other models (MULAN-ESM2 or MULAN-SaProt).
You need to select the correct config for both training and inference.
Also, you need to set correct model name for the trained_adapter_name
argument in the config file: it should be the same as the trained model path that contains model checkpoint.
Downstream task evaluation
Finally, to run predictions for all types of downstream tasks, you need to run the script
python scripts/downstream.py -c configs/downstream.yaml
This script requires the pre-computed and saved protein embeddings for all downstream tasks. You need to specify the correct path to the folder with downstream datasets as well as the desired downstream tasks you want to evaluate. Also, you should specify the model names that were used to produce embeddings (names of folders with embeddings for downstream tasks).
Training
The configs below require the correct data paths provided inside the config file.
To run training for MULAN-small, run
CUDA_VISIBLE_DEVICES=0 python mulan/train_adapter.py -n mulan_small -c configs/mulan_small.yaml
The same should be done for other models using the corresponding config files. The name that you pass during training is used to create the output folder with training checkpoints.
Citation
If you use MULAN in your work, please cite our paper:
@article{frolova2024mulan,
title={MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding},
author={Frolova, Daria and Pak, Marina and Litvin, Anna and Sharov, Ilya and Ivankov, Dmitry and Oseledets, Ivan},
journal={bioRxiv},
pages={2024--05},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}