Home

Awesome

<h2 align="center"> <a href="hhttps://arxiv.org/abs/2402.16445">ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing</a></h2> <h5 align="center">

arXiv Model1 Model2 Dataset

<!-- [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Lyu6PosHao/ProLLaMA/blob/main/LICENSE) --> <br> </h5>

📣 News

🗝️ Abstract

Large Language Models (LLMs), including GPT-x and LLaMA2, have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Large Language Models (ProLLMs) trained on protein corpora excel at de novo protein sequence generation. However, as of now, unlike LLMs in NLP, no ProLLM is capable of multiple tasks in the Protein Language Processing (PLP) field. We introduce a training framework to transform any general LLM into a ProLLM capable of handling multiple PLP tasks. Specifically, our framework utilizes low-rank adaptation and employs a two-stage training approach, and it is distinguished by its universality, low overhead, and scalability. Through training under this framework, we propose the ProLLaMA model, the first known ProLLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. In the protein property prediction task, ProLLaMA achieves nearly 100% accuracy across many categories. The latter two tasks are beyond the reach of other ProLLMs

<p align="center"><img src="img/intro.png" title="" height="500"></p> <details open><summary> I also have other AI for Science projects that may interest you. </summary><p> <!-- may -->

TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation <br> Zongying Lin, Li Hao, Liuzhenghao Lv, Bin Lin, Junwu Zhang, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian<br> github
arXiv <br>

</p ></details>

💡Highlights

Powerful model

General training framework

Excellent performance

😮Main Results

🚀Pipeline

The training framework we propose is as follows:

<p align="center"><img src="img/train_framework_v3.png" title=""></p>

🛠️Quick Inference

As ProLLaMA's architecture is the same as LLaMA2, you can use ProLLaMA for inference like using LLaMA2.

Follow the steps below to use our ProLLaMA for inference.

1.Install Requirements

git clone https://github.com/Lyu6PosHao/ProLLaMA.git
cd ProLLaMA
pip install -r requirements.txt

2.Download Model

Download from Hugging Face

3.Usage

Just like using LLaMA2, three ways are provided here:

CUDA_VISIBLE_DEVICES=0 python ./scripts/infer.py --model "GreatCaptainNemo/ProLLaMA" --interactive
#You can replace the model_path with your local path
#Make sure you use only one GPU for inference
#Use "python ./scripts/infer.py -h" for more details
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,GenerationConfig
from tqdm import tqdm
device=torch.device('cuda:0')

##You can replace the file_path with your local path
tokenizer = AutoTokenizer.from_pretrained("GreatCaptainNemo/ProLLaMA", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GreatCaptainNemo/ProLLaMA", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
generation_config = GenerationConfig(temperature=0.2,top_k=40, top_p=0.9,do_sample=True,num_beams=1,repetition_penalty=1.2,max_new_tokens=400)
model.eval()
print("####Enter 'exit' to exit.")
with torch.no_grad():
    while True:
        messages = []
        user=str(input("Input:"))
        if user.strip()=="exit":
            break
        inputs = tokenizer(user, return_tensors="pt").to(device)
        generate_ids = model.generate(inputs.input_ids,generation_config).to(device)
        response=tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print("Output:", response)
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
python ./src/cli_demo.py \
      --model_name_or_path /path_to_your_model \
      --template llama2

4.Input Format

The instructions which you input to the model should follow the following format:

[Generate by superfamily] Superfamily=<xxx>
or
[Determine superfamily] Seq=<yyy>

Here are some examples of the input:

[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>
#You can also specify the first few amino acids of the protein sequence:
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily> Seq=<MKRVL
[Determine superfamily] Seq=<MAPGGMPREFPSFVRTLPEADLGYPALRGWVLQGERGCVLYWEAVTEVALPEHCHAECWGVVVDGRMELMVDGYTRVYTRGDLYVVPPQARHRARVFPGFRGVEHLSDPDLLPVRKR>

See this on all the optional superfamilies.

🛠️Quick Train

Stage 1

  1. Prepare the dataset: put your dataset under ./scripts/pretraining_dataset. You dataset should be one or several txt files. Each line in the txt file should be one protein sequence in the format of "Seq=<xxx>". We provided ./scripts/pretraining_dataset/example.txt as an example.
  2. Run ./scripts/run_pt.sh

Stage 2

  1. Prepare the dataset: download our instruction_dataset from HuggingFace and put the train_split under ./scripts/instruction_tuning_dataset. We provided ./scripts/instruction_tuning_dataset/example.json as an example.
  2. Run ./scripts/run_it.sh
  3. If you want to fine-tune our ProLLaMA on your own dataset instead of our instruction_dataset, you should process your data into the similar format like our instruction_dataset (or example.json).
  4. It may be better to fine-tune ProLLaMA_Stage_1 instead of ProLLaMA if your dataset is relatively small and not relevant to superfamily tasks.

✒️Others

ProLLaMA of Stage 1

ProLLaMA_Stage_1 refers to the model obtained by continual pre-training LLaMA2 on the UniRef50 dataset, as shown in the pipeline. Model Weights

You can use ProLLaMA_Stage_1 in the same way as ProLLaMA. For example:

CUDA_VISIBLE_DEVICES=0 python ./scripts/infer.py --model "GreatCaptainNemo/ProLLaMA_Stage_1" --interactive
#You can replace the model_path with your local path
#Make sure you use only one GPU for inference
#Use "python ./scripts/infer.py -h" for more details

However, ProLLaMA_Stage_1's input format is a little different from ProLLaMA, since the former is only trained on pure protein sequences without nautral language instructions.

The input format:

Seq=
#You can also specify the first few amino acids of the protein sequence:
Seq=<MAPGGMPRE

You can perform instruction tuning on ProLLaMA_Stage_1 (or ProLLaMA) with your custom datasets, in order to make the model capable of your insterested PLP tasks.

We plan to build a more powerful ProLLaMA_Stage_1.

✏️Citation

If you find our repo helpful, please consider citing us.

@article{lv2024prollama,
  title={ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing},
  author={Lv, Liuzhenghao and Lin, Zongying and Li, Hao and Liu, Yuyang and Cui, Jiaxi and Chen, Calvin Yu-Chian and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2402.16445},
  year={2024}
}