Home

Awesome

MSAGPT

<table> <tr> <td> <h2>MSAGPT</h2> <p>📖 Paper: <a href="https://arxiv.org/abs/2406.05347">MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training</a></p> <p><b>MSAGPT</b> is a powerful protein language model (PLM). MSAGPT has 3 billion parameters with three versions of the model, MSAGPT, MSAGPT-Sft, and MSAGPT-Dpo, <b>supporting zero-shot and few-shot MSA generation</b>.</p> <p><b>MSAGPT achieves state-of-the-art structural prediction performance on natural MSA-scarce scenarios</b>.</p> </td> </tr> </table>

Overall Framework

<p align="center"> <img src="resources/overall_frame.png" alt="描述文字" style="display: block; margin: auto; width: 90%;"> </p>

Visualized Cases

Visualization of improved structure prediction compared with nature MSA. <font color=orange>Yellow</font>: Ground truth; <font color=purple>Purple</font>: Predictions based on MSA generated by MSAGPT; <font color=cyan>Cyan</font>: Predictions from MSA generated by natural MSA.

<p align="center"> <img src="resources/app_case.png" alt="描述文字" style="display: block; margin: auto; width: 90%;"> </p>

Get Started:

Option 1:Deploy MSAGPT by yourself

We support GUI for model inference.

First, we need to install the dependencies.

# CUDA >= 11.8
pip install -r requirements.txt

Model List

You can choose to manually download the necessary weights. Then UNZIP it and put it into the checkpoints folder.

ModelTypeSeq LengthDownload
MSAGPTBase16K🤗 Huggingface 🔨 SwissArmyTransformer
MSAGPT-SFTSft16K🤗 Huggingface 🔨 SwissArmyTransformer
MSAGPT-DPORlhf16K🤗 Huggingface 🔨 SwissArmyTransformer

Situation 1.1 CLI (SAT version)

Run CLI demo via:

# Online Chat
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source chat --stream_chat --max-gen-length 1024

The program will automatically interact in the command line. You can generate replies entering the protein sequence you need to generate virtual MSAs (or add a few MSAs as a prompt, connected by "<M>"), for example: "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG<M>VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG", where "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG" is the main sequence, and "VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG" are MSA prompts, and pressing enter. Enter stop to stop the program. The chat CLI looks like:

<p align="center"> <img src="resources/demo.gif" alt="描述文字" style="display: block; margin: auto; width: 90%;"> </p>

You can also enable the offline generation by set the --input-source <your input file> and --output-path <your output path>. We set an input file example: msa_input.

# Offline Generation
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source <your input file> --output-path <your output path> --max-gen-length 1024

Situation 1.2 CLI (Huggingface version)

(TODO)

Situation 1.3 Web Demo

(TODO)

Option 2:Finetuning MSAGPT

(TODO)

Hardware requirement

Natural MSA-scarce benchmark

Please find the 199 cases along with their retrieved MSAs in the natural-msa-scarce-cases.txt file. Each line is structured as follows:

<PDB-id> <Primary Sequence> <M> <MSA1> <M> ... <MSAn>

Explanation of the Data Structure

License

The code in this repository is open source under the Apache-2.0 license.

If you find our work helpful, please consider citing the our paper

@article{chen2024msagpt,
  title={MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training},
  author={Chen, Bo and Bei, Zhilei and Cheng, Xingyi and Li, Pan and Tang, Jie and Song, Le},
  journal={arXiv preprint arXiv:2406.05347},
  year={2024}
}