Home

Awesome

InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery (COLING 2025)

Codes for our paper InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery

<!-- *Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.*-->

[Project Page] [Paper]

Overview

<p align="center"> <a> <img src="assets/static/teaser.png" width="100%"> </a> </p> The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.

Architecture

The diagram presented below provides an overview of the architectural design of the InstructMol model, along with its two-stage training paradigm. The example molecule in the figure is Terephthalaldehyde (CID 12173).

<p align="center"> <a> <img src="assets/static/overview.png" width="80%"> </a> </p>

Release

Code License Data License Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna, LLaVA, Mol-Instructions and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Contents

Install

Mostly refer to LLaVA installation

  1. Clone this repository and navigate to project folder

  2. Install Package

conda create -n instructmol python=3.10 -y
conda activate instructmol
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# Install Graph related packages. We use torch-112 with CUDA-11.6, please change accordingly.
pip install -r requirements.txt
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation

Weights

Component Weights Download

Create a folder named checkpoints in the root directory of this project.

mkdir checkpoints
cd checkpoints

Download the following weights and put them in the checkpoints folder.

# Under the checkpoints folder
# get the weights for the vicuna model (https://huggingface.co/lmsys/vicuna-7b-v1.3)
ln -s YOUR_PATH_TO_vicuna_v1_3_7b vicuna-v1-3-7b
# get the weights for MoleculeSTM model
mkdir MoleculeSTM
wget https://huggingface.co/chao1224/MoleculeSTM/resolve/main/demo/demo_checkpoints_Graph/molecule_model.pth -P MoleculeSTM
# download the weights for scibert_scivocab_uncased model (https://huggingface.co/allenai/scibert_scivocab_uncased)
ln -s YOUR_PATH_TO_scibert_scivocab_uncased scibert_scivocab_uncased
cd .. # back to the root directory

InstructMol Weights

Dataset

CLI Inference

Chat with InstructMol without the need of Gradio interface.

#!/bin/bash
# NOTE: Insert path of model here.(e.g., checkpoints/Graph-LLaVA/llava-moleculestm-vicuna-v1-3-7b-pretrain)
MODEL_PATH="" 
python -m llava.serve.cli_graph \
    --model-path $MODEL_PATH \
    --model-base checkpoints/vicuna-v1-3-7b \
    --graph-checkpoint-path checkpoints/graphmvp.pth 

Train

LLaVA training consists of two stages:

Stage 1: Alignment Pretraining

See pretrain.sh for an example of how to run the pretraining stage.

Stage 2: Task-specific Instruction Tuning

You can train all specific tasks combine together finetune_all.sh or train them separately, (e.g., molecule description generation task).

Evaluation

See Evaluation.md for detailed instructions on how to evaluate the model.

Citation

If you find InstructMol useful for your your research and applications, please cite using this BibTeX:

@misc{cao2023instructmol,
      title={InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery}, 
      author={He Cao and Zijing Liu and Xingyu Lu and Yuan Yao and Yu Li},
      year={2023},
      eprint={2311.16208},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM}
}

Acknowledgement