Home

Awesome

<h1 align="center"> 🧪 Mol-Instructions </h1> <h3 align="center"> An open, large-scale biomolecular instruction dataset for large language models. </h3> <p align="center"> 📃 <a href="https://arxiv.org/abs/2306.08018" target="_blank">Paper</a> • ⏬ <a href="https://huggingface.co/datasets/zjunlp/Mol-Instructions" target="_blank">Dataset</a><br> </p>

Code License Data License

<div align=center><img src="fig/abs.png" width="100%" height="100%" /></div>

🆕 News

📌 Contents

<h2 id="1">1. Overview</h2> <h3 id="1-1"> 📊 1.1 Data Stats</h3> <div align=center><img src="fig/stat.png" width="90%" height="90%" /></div>

Mol-Instructions comprises three cardinal components:

<h3 id="1-2"> 🛠️ 1.2 Data Construction</h3> <div align=center><img src="fig/construction.png" width="100%" height="100%" /></div> <h3 id="1-3"> 🤗 1.3 Data Release</h3>

We release the dataset on Hugging Face at zjunlp/Mol-Instructions.

<h2 id="2">2. Tasks</h2> <h3 id="2-1"> 🔬 2.1 Molecule-oriented</h3> <details> <summary><b>Molecule description generation</b></summary> </details> <details> <summary><b>Description-guided molecule design</b></summary> </details> <details> <summary><b>Forward reaction prediction</b></summary> </details> <details> <summary><b>Retrosynthesis</b></summary> </details> <details> <summary><b>Reagent prediction</b></summary> </details> <details> <summary><b>Property prediction</b></summary> </details> <h3 id="2-2"> 🧬 2.2 Protein-oriented</h3> <details> <summary><b>Protein design</b></summary>
  1. The presence of Mg(2+) is necessary for the protein to function in the desired environment.
  2. The AMP, (6S)-NADPHX binding site should be located in a region of the protein that is accessible to the ligand.
  3. The designed protein should have ATP binding, NADPHX epimerase activity, metal ion binding, ADP-dependent NAD(P)H-hydrate dehydratase activity to facilitate nicotinamide nucleotide metabolic process.
  4. For general function, the protein need meet that Catalyzes the epimerization of the S- and R-forms of NAD(P)HX, a damaged form of NAD(P)H that is a result of enzymatic or heat-dependent hydration
MSNELVLSREQVRRVDQRAIEAYGVPGIVLMENAGRGAAEIIRAACPSAQRVLIACGPGNNGGDGFVIARHLANAGWMVELLLACPADRITGDAQGNHEIIRRMNLPCAVMADARDLEAANDRFATADVIVDALLGTGASGPPREPIASLIRAINEAHRRVSAQPAPSVFAVDIPSGLDCDTGEAANPTVRADHTITFVARKIGFRNPAARDLLGRVHVVDIGAPRAAIQDALTGKSG
</details> <details> <summary><b>Catalytic activity prediction</b></summary> </details> <details> <summary><b>Protein function prediction</b></summary> </details> <details> <summary><b>Functional description generation</b></summary> </details> <details> <summary><b>Domain/Motif prediction</b></summary> </details> <h3 id="2-3"> 🥼 2.3 Biomolecule text</h3> <details> <summary><b>Chemical entity recognition</b></summary> </details> <details> <summary><b>Chemical-disease interaction extraction</b></summary> </details> <details> <summary><b>Chemical-protein interaction extraction</b></summary> </details> <details> <summary><b>Multiple-choice question</b></summary> </details> <details> <summary><b>True or False question</b></summary> </details> <details> <summary><b>Open question</b></summary> </details> <h2 id="3">3. Demo</h2> <div align=left><img src="fig/logo.png" width="22%" height="22%" /></div> <h3 id="3-1"> 🤗 3.1 Model Weight Release</h3>

We release the model weights on Hugging Face at:

<h3 id="3-2"> 📝 3.2 Model Usage Guide</h3>

We have provided a web version demo based on Gradio. To use it, you first need to download this repository:

>> git clone https://github.com/zjunlp/Mol-Instruction
>> cd demo

Step 1, install Gradio by running:pip install gradio.

Step 2, specify the parameters in the generate.sh file.

>> CUDA_VISIBLE_DEVICES=0 python generate.py \
    --CLI False\
    --protein False\
    --load_8bit \
    --base_model $BASE_MODEL_PATH \
    --share_gradio True\
    --lora_weights $FINETUNED_MODEL_PATH \

For models fine-tuned on molecule-oriented and biomolecular text instructions, please set $FINETUNED_MODEL_PATH to 'zjunlp/llama-molinst-molecule-7b' or 'zjunlp/llama-molinst-biotext-7b'.

For the model fine-tuned on protein-oriented instructions, you need to perform additional steps as described in this folder.

Step 3, run the generate.sh file in the repository:

>> sh generate.sh

We offer two methods: the first one is command-line interaction, and the second one is web-based interaction, which provides greater flexibility.

  1. Use the following command to enter web-based interaction:
>> python generate.py

The program will run a web server and output an address. Open the output address in a browser to use it.

  1. Use the following command to enter command-line interaction:
>> python generate.py --CLI True

The disadvantage is the inability to dynamically change decoding parameters.

<p align="center"> <img alt="Demo" src=fig/gradio_interface_gif.gif style="width: 700px; height: 340px;"/> </p> <h3 id="3-3"> 💡 3.3 Quantitative Experiments</h3>

To investigate whether Mol-Instructions can enhance LLM’s understanding of biomolecules, we conduct the following quantitative experiments. For detailed experimental settings and analysis, please refer to our paper. Please refer to the evaluation code to conduct the same experiments.

🧪 Molecular generation tasks

MetricExact↑BLEU↑Levenshtein↓RDK FTS↑MACC FTS↑Morgan FTS↑Validity↑
Alpaca0.0000.00451.0880.0060.0290.0000.002
Baize0.0000.00653.7960.0000.0000.0000.002
ChatGLM0.0000.00453.1570.0050.0000.0000.005
LLaMa0.0000.00359.8640.0050.0000.0000.003
Vicuna0.0000.00660.3560.0060.0010.0000.001
Galactica0.0000.19244.1520.1350.2380.0880.992
Text+Chem T50.0970.50841.8190.3520.4740.3530.721
MolT50.1120.54638.2760.4000.5380.2950.773
Ours (LLaMA2-chat)0.0020.34541.3670.2310.4120.1471.000
Ours (LLaMA3-Instruct)0.0250.52138.7420.3580.5200.2211.000
MetricExact↑BLEU↑Levenshtein↓RDK FTS↑MACC FTS↑Morgan FTS↑Validity↑
Alpaca0.0000.06541.9890.0040.0240.0080.138
Baize0.0000.04441.5000.0040.0250.0090.097
ChatGLM0.0000.18340.0080.0500.1000.0440.108
LLaMa0.0000.02042.0020.0010.0020.0010.039
Vicuna0.0000.05741.6900.0070.0160.0060.059
Galactica0.0000.46835.0210.1560.2570.0970.946
Text+Chem T50.2390.78220.4130.7050.7890.6520.762
Ours (LLaMA2-chat)0.0450.65427.2620.3130.5090.2621.000
Ours (LLaMA3-Instruct)0.5030.88313.4100.7560.8630.7081.000
MetricExact↑BLEU↑Levenshtein↓RDK FTS↑MACC FTS↑Morgan FTS↑Validity↑
Alpaca0.0000.06346.9150.0050.0230.0070.160
Baize0.0000.09544.7140.0250.0500.0230.112
ChatGLM0.0000.11748.3650.0560.0750.0430.046
LLama0.0000.03646.8440.0180.0290.0170.010
Vicuna0.0000.05746.8770.0250.0300.0210.017
Galactica0.0000.45234.9400.1670.2740.1340.984
Text+Chem T50.1410.76524.0430.6850.7650.5850.698
Ours (LLaMA2-chat)0.0090.70531.2270.2830.4870.2301.000
Ours (LLaMA3-Instruct)0.3330.84217.6420.7040.8150.6461.000
MetricExact↑BLEU↑Levenshtein↓RDK FTS↑MACC FTS↑Morgan FTS↑Validity↑
Alpaca0.0000.02629.0370.0290.0160.0010.186
Baize0.0000.05130.6280.0220.0180.0040.099
ChatGLM0.0000.01929.1690.0170.0060.0020.074
LLaMa0.0000.00328.0400.0370.0010.0010.001
Vicuna0.0000.01027.9480.0380.0020.0010.007
Galactica0.0000.14130.7600.0360.1270.0510.995
Text+Chem T50.0000.22549.3230.0390.1860.0520.313
Ours (LLaMA2-chat)0.0440.22423.1670.2370.3640.2131.000
Ours (LLaMA3-Instruct)0.1010.64818.3260.4120.5210.3751.000

🔍 Molecular property prediction task & Molecule and protein understanding tasks

MetricMAE↓
Alpaca322.109
Baize261.343
ChatGLM-
LLaMa5.553
Vicuna860.051
Galactica0.568
Ours (LLaMA2-chat)0.013
Ours (LLaMA3-Instruct)15.059
MetricBLEU-2↑BLEU-4↑ROUGE-1↑ROUGE-2↑ROUGE-L↑METEOR↑
Alpaca0.0680.0140.1780.0410.1360.107
Baize0.0640.0150.1890.0530.1480.106
ChatGLM0.0550.0110.1630.0360.1210.105
LLaMa0.0590.0140.1640.0660.1480.184
Vicuna0.0520.0110.1510.0550.1300.168
Galactica0.0240.0080.0740.0150.0630.065
Text+Chem T50.0620.0360.1260.0750.1190.139
MolT50.0020.0010.0360.0010.0340.033
Ours (LLaMA2-chat)0.2170.1430.3370.1960.2910.254
Ours (LLaMA3-Instruct)0.4190.3610.7190.6460.7090.637
TaskProtein FunctionFunctional DescriptionCatalytic ActivityDomain/Motif
MetricROUGE-L↑ROUGE-L↑ROUGE-L↑ROUGE-L↑
Alpaca0.200.100.230.12
Baize0.200.150.220.13
ChatGLM0.150.140.130.10
LLaMa0.120.120.130.09
Vicuna0.150.140.160.12
Galactica0.070.080.080.06
Ours (LLaMA)0.430.440.520.46

🧫 Bioinformatic NLP tasks

TaskTrue or FalseMulti-choiceChemical Entity RecognitionChemical-disease Interaction ExtractionChemical-protein Interaction Extraction
MetricAcc↑Acc↑F1↑F1↑F1↑
Alpaca0.3300.2860.2130.0370.002
Baize0.4800.2370.0090.0040.004
ChatGLM0.1800.2230.1500.0200.003
LLaMa0.2700.2970.0000.0500.003
Vicuna0.1200.2900.0240.0840.013
Galactica0.4200.3120.1660.0260.001
PMC_LLaMa0.5100.6250.0030.0000.000
Ours (LLaMA2-chat)0.5500.6490.7530.3990.224
Ours (LLaMA3-instruct)0.6000.9610.6940.3550.177
TaskBLEU↑ROUGE-1↑BertScore↑
Alpaca0.0030.0880.824
Baize0.0050.1000.811
ChatGLM0.0030.0900.795
LLaMa0.0030.1000.814
Vicuna0.0040.0970.814
Galactica0.0000.0390.794
PMC_LLaMA0.0070.7880.625
Ours (LLaMA2-chat)0.0240.2210.837
Ours (LLaMA3-instruct)0.0100.1980.846
<h3 id="3-4"> 💡 3.4 FAQ</h3> <h2 id="4">4. Notices</h2> <h3 id="4-1"> 🚨 4.1. Usage and License</h3>

Please note that all data and model weights of Mol-Instructions is exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY-NC-SA 4.0, which permits solely non-commercial usage.

We emphatically urge all users to adhere to the highest ethical standards when using our dataset, including maintaining fairness, transparency, and responsibility in their research. Any usage of the dataset that may lead to harm or pose a detriment to society is strictly forbidden.

In terms of dataset maintenance, we pledge our commitment to provide necessary upkeep. This will ensure the continued relevance and usability of the dataset in light of evolving research landscapes. This commitment encompasses regular updates, error checks, and amendments in accordance with field advancements and user feedback.

<h3 id="4-2"> ⚠️ 4.2. Limitations</h3>

The current state of the model, obtained via instruction tuning, is a preliminary demonstration. Its capacity to handle real-world, production-grade tasks remains limited. Moreover, there is a vast reservoir of rich instruction data that remains to be collected and exploited.

<h2 id="5">5. About</h2> <h3 id="5-1"> 📚 5.1 References</h3> If you use our repository, please cite the following related paper:
@inproceedings{fang2023mol,
  author       = {Yin Fang and
                  Xiaozhuan Liang and
                  Ningyu Zhang and
                  Kangwei Liu and
                  Rui Huang and
                  Zhuo Chen and
                  Xiaohui Fan and
                  Huajun Chen},
  title        = {Mol-Instructions: {A} Large-Scale Biomolecular Instruction Dataset
                  for Large Language Models},
  booktitle    = {{ICLR}},
  publisher    = {OpenReview.net},
  year         = {2024},
  url          = {https://openreview.net/pdf?id=Tlsdsb6l9n}
}
<h3 id="5-2"> 🫱🏻‍🫲 5.2 Acknowledgements</h3>

We appreciate LLaMA, Huggingface Transformers Llama, Alpaca, Alpaca-LoRA, Chatbot Service and many other related works for their open-source contributions. The logo of the model is automatically generated by Wenxin Yige.

Star History Chart