Awesome
<h1 align="center"> βοΈ MolGen </h1> <h3 align="center"> Domain-Agnostic Molecular Generation with Chemical Feedback </h3> <p align="center"> π <a href="https://arxiv.org/abs/2301.11259" target="_blank">Paper</a> β’ π€ <a href="https://huggingface.co/zjunlp/MolGen-large" target="_blank">Model</a> β’ π¬ <a href="https://huggingface.co/spaces/zjunlp/MolGen" target="_blank">Space</a> <br> </p> <div align=center><img src="molgen.png" width="100%" height="100%" /></div>π News
2024-2
We've released ChatCell, a new paradigm that leverages natural language to make single-cell analysis more accessible and intuitive. Please visit our homepage and Github page for more information.2024-1
Our paper Domain-Agnostic Molecular Generation with Chemical Feedback is accepted by ICLR 2024.2024-1
Our paper Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models is accepted by ICLR 2024.2023-10
We open-source MolGen-7b, which now supports de novo molecule generation!2023-6
We open-source KnowLM, a knowledgeable LLM framework with pre-training and instruction fine-tuning code (supports multi-machine multi-GPU setup).2023-6
We release Mol-Instructions, a large-scale biomolecule instruction dataset for large language models.2023-5
We propose Knowledge graph-enhanced molecular contrAstive learning with fuNctional prOmpt (KANO) onNature Machine Intelligence
, exploiting fundamental domain knowledge in both pre-training and fine-tuning.2023-4
We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.2023-3
We release our pre-trained and fine-tuned model on π€ Hugging Face at MolGen-large and MolGen-large-opt.2023-2
We provide a demo on π€ Hugging Face at Space.
π Requirements
To run the codes, You can configure dependencies by restoring our environment:
conda env create -f environment.yaml
and thenοΌ
conda activate my_env
π Resource Download
You can download the pre-trained and fine-tuned models via Huggingface: MolGen-large and MolGen-large-opt.
You can also download the model using the following link: https://drive.google.com/drive/folders/1Eelk_RX1I26qLa9c4SZq6Tv-AAbDXgrW?usp=sharing
Moreover, the dataset used for downstream tasks can be found here.
The expected structure of files is:
moldata
βββ checkpoint
βΒ Β βββ molgen.pkl # pre-trained model
β βββ syn_qed_model.pkl # fine-tuned model for QED optimization on synthetic data
β βββ syn_plogp_model.pkl # fine-tuned model for p-logP optimization on synthetic data
β βββ np_qed_model.pkl # fine-tuned model for QED optimization on natural product data
β βββ np_plogp_model.pkl # fine-tuned model for p-logP optimization on natural product data
βββ finetune
βΒ Β βββ np_test.csv # nature product test data
βΒ Β βββ np_train.csv # nature product train data
βΒ Β βββ plogp_test.csv # synthetic test data for plogp optimization
βΒ Β βββ qed_test.csv # synthetic test data for plogp optimization
βΒ Β βββ zinc250k.csv # synthetic train data
βββ generate # generate molecules
βββ output # molecule candidates
βββ vocab_list
βββ zinc.npy # SELFIES alphabet
π How to run
-
Fine-tune
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
output
.
cd MolGen bash preprocess.sh
- Then utilize the self-feedback paradigm. The fine-tuned model will be stored in the folder
checkpoint
.
bash finetune.sh
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
-
Generate
To generate molecules, run this script. Please specify the
checkpoint_path
to determine whether to use the pre-trained model or the fine-tuned model.cd MolGen bash generate.sh
π₯½ Experiments
We conduct experiments on well-known benchmarks to confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. For detailed experimental settings and analysis, please refer to our paper.
-
MolGen captures real-word molecular distributions
-
MolGen mitigates molecular hallucinations
Targeted molecule discovery
<img width="480" alt="image" src="https://github.com/zjunlp/MolGen/assets/61076726/51533e08-e465-44c8-9e78-858775b59b4f"> <img width="595" alt="image" src="https://github.com/zjunlp/MolGen/assets/61076726/6f17a630-88e4-46f6-9cb1-9c3637a264fc"> <img width="376" alt="image" src="https://github.com/zjunlp/MolGen/assets/61076726/4b934314-5f23-4046-a771-60cdfe9b572d">Constrained molecular optimization
<img width="350" alt="image" src="https://github.com/zjunlp/MolGen/assets/61076726/bca038cc-637a-41fd-9b53-48ac67c4f182">Citation
If you use or extend our work, please cite the paper as follows:
@inproceedings{fang2023domain,
author = {Yin Fang and
Ningyu Zhang and
Zhuo Chen and
Xiaohui Fan and
Huajun Chen},
title = {Domain-Agnostic Molecular Generation with Chemical feedback},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/pdf?id=9rPyHyjfwP}
}