Awesome
MSAGPT
<table> <tr> <td> <h2>MSAGPT</h2> <p>📖 Paper: <a href="https://arxiv.org/abs/2406.05347">MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training</a></p> <p><b>MSAGPT</b> is a powerful protein language model (PLM). MSAGPT has 3 billion parameters with three versions of the model, MSAGPT, MSAGPT-Sft, and MSAGPT-Dpo, <b>supporting zero-shot and few-shot MSA generation</b>.</p> <p><b>MSAGPT achieves state-of-the-art structural prediction performance on natural MSA-scarce scenarios</b>.</p> </td> </tr> </table>Overall Framework
<p align="center"> <img src="resources/overall_frame.png" alt="描述文字" style="display: block; margin: auto; width: 90%;"> </p>Visualized Cases
Visualization of improved structure prediction compared with nature MSA. <font color=orange>Yellow</font>: Ground truth; <font color=purple>Purple</font>: Predictions based on MSA generated by MSAGPT; <font color=cyan>Cyan</font>: Predictions from MSA generated by natural MSA.
<p align="center"> <img src="resources/app_case.png" alt="描述文字" style="display: block; margin: auto; width: 90%;"> </p>Get Started:
Option 1:Deploy MSAGPT by yourself
We support GUI for model inference.
First, we need to install the dependencies.
# CUDA >= 11.8
pip install -r requirements.txt
Model List
You can choose to manually download the necessary weights. Then UNZIP it and put it into the checkpoints folder.
Model | Type | Seq Length | Download |
---|---|---|---|
MSAGPT | Base | 16K | 🤗 Huggingface 🔨 SwissArmyTransformer |
MSAGPT-SFT | Sft | 16K | 🤗 Huggingface 🔨 SwissArmyTransformer |
MSAGPT-DPO | Rlhf | 16K | 🤗 Huggingface 🔨 SwissArmyTransformer |
Situation 1.1 CLI (SAT version)
Run CLI demo via:
# Online Chat
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source chat --stream_chat --max-gen-length 1024
The program will automatically interact in the command line. You can generate replies entering the protein sequence you need to generate virtual MSAs (or add a few MSAs as a prompt, connected by "<M>"), for example: "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG<M>VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG", where "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG" is the main sequence, and "VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG" are MSA prompts, and pressing enter. Enter stop
to stop the program. The chat CLI looks like:
You can also enable the offline generation by set the --input-source <your input file> and --output-path <your output path>. We set an input file example: msa_input.
# Offline Generation
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source <your input file> --output-path <your output path> --max-gen-length 1024
Situation 1.2 CLI (Huggingface version)
(TODO)
Situation 1.3 Web Demo
(TODO)
Option 2:Finetuning MSAGPT
(TODO)
Hardware requirement
-
Model Inference: For BF16: 1 * A100(80G)
-
Finetuning:
For BF16: 4 * A100(80G) [Recommend].
Natural MSA-scarce benchmark
Please find the 199 cases along with their retrieved MSAs in the natural-msa-scarce-cases.txt file. Each line is structured as follows:
<PDB-id> <Primary Sequence> <M> <MSA1> <M> ... <MSAn>
Explanation of the Data Structure
<PDB-id>
: The unique identifier for the protein structure.<Primary Sequence>
: The amino acid sequence of the protein.<M>
: A delimiter separating different sections.<MSA1> ... <MSAn>
: The multiple sequence alignments retrieved for each case.
License
The code in this repository is open source under the Apache-2.0 license.
If you find our work helpful, please consider citing the our paper
@article{chen2024msagpt,
title={MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training},
author={Chen, Bo and Bei, Zhilei and Cheng, Xingyi and Li, Pan and Tang, Jie and Song, Le},
journal={arXiv preprint arXiv:2406.05347},
year={2024}
}