Awesome
Text-to-graph molecule generation
The PyTorch implementation of MoMu and Moflow-based zero-shot text-to-graph molecule generation, described in "Natural Language-informed Understanding of Molecule Graphs".
License & disclaimer
The codes can be used for research purposes only. This package is strictly for non-commercial academic use only.
Acknowledgments
We adapted the code of the PyTorch implementation of MoFlow which is publicly available at https://github.com/calvin-zcx/moflow. Please also check the license and usage there if you want to make use of this code.
Install
- Operating system: Linux version 4.18.0-80.7.1.el8_0.x86_64, with a single NVIDIA Titan RTX GPU, cuda 11.2; also tested on Linux version 4.15.0-189-generic, using a single NVIDIA TITAN V GPU, CUDA Version 10.1.243.
- Please refer to https://github.com/calvin-zcx/moflow for the requirements. We use the same packages as follows:
conda create --name TGgeneration python pandas matplotlib (conda 4.6.7, python 3.8.5, pandas 1.1.2, matplotlib 3.3.2)
conda activate TGgeneration
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch (pytorch 1.6.0, torchvision 0.7.0)
conda install rdkit (rdkit 2020.03.6)
conda install orderedset (orderset 2.0.3)
conda install tabulate (tabulate 0.8.7)
conda install networkx (networkx 2.5)
conda install scipy (scipy 1.5.0)
conda install seaborn (seaborn 0.11.0)
pip install cairosvg (cairosvg 2.4.2)
pip install tqdm (tqdm 4.50.0)
- Our implementation also requires the following additional dependencies or packages (torch-geometric, transformers, spacy):
pip install torch_scatter-2.0.6-cp38-cp38-linux_x86_64.whl
pip install torch_sparse-0.6.9-cp38-cp38-linux_x86_64.whl
pip install torch_cluster-1.5.9-cp38-cp38-linux_x86_64.whl
pip install torch_spline_conv-1.2.1-cp38-cp38-linux_x86_64.whl
pip install torch-geometric
pip install transformers
pip install spacy
(The .whl files can be downloaded from https://pytorch-geometric.com/whl/torch-1.6.0%2Bcu101.html For other cuda versions, please select from https://pytorch-geometric.com/whl/)
It takes about half an hour to install all the packages.
Prepare pre-trained models
Downloading the MoFlow model trained on the zinc250k dataset in
https://drive.google.com/drive/folders/1runxQnF3K_VzzJeWQZUH8VRazAGjZFNF
Put the folder "zinc250k_512t2cnn_256gnn_512-64lin_10flow_19fold_convlu2_38af-1-1mask" in the folder ./MoleculeGeneration/results
Downloading the pre-trained graph and text encoders of MoMu
Put the pretrained files "littlegin=graphclinit_bert=scibert_epoch=299-step=18300.ckpt" for MoMu-S and "littlegin=graphclinit_bert=kvplm_epoch=299-step=18300.ckpt" in the folder ./MoleculeGeneration (Download from https://pan.baidu.com/s/1jvMP_ysQGTMd_2sTLUD45A password: 1234) Pretrained model when Bert is initized by the KV-PLM checkpoint:
checkpoints/littlegin=graphclinit_bert=kvplm_epoch=299-step=18300.ckpt
Pretrained model when Bert is initized by the SciBert checkpoint:
checkpoints/littlegin=graphclinit_bert=scibert_epoch=299-step=18300.ckpt
Downloading the per-trained Bert model
Download the folder "bert_pretrained" from https://huggingface.co/allenai/scibert_scivocab_uncased Put the folder "bert_pretrained" in the folder ./MoleculeGeneration
Testing & Useage
Generating molecules with the query texts used in the paper:
default: MoMu-S; To use MoMu-K, uncomment line 683 and comment line 682 in Graph_generate.py
cd MoleculeGeneration
python Graph_generate.py --model_dir results/zinc250k_512t2cnn_256gnn_512-64lin_10flow_19fold_convlu2_38af-1-1mask -snapshot model_snapshot_epoch_200 --gpu 0 --data_name zinc250k --hyperparams-path moflow-params.json --temperature 0.85 --batch-size 1 --n_experiments 5 --save_fig true --correct_validity true
Generating molecules with the query texts
Put the custom text descriptions in the list in line 816-825 of Graph_generate.py.
Results
The generated 60 (the number of generated molecules can be specified in lines 834-835 of Graph_generate.py) molecule graphs with respect to the {id}-the text description are saved in the subfolder "generated/sci/text_{id}/" of the folder "MoleculeGeneration". The corresponding SMILES and negative similarities between the text and the molecule graph are also output. For example, for the 0-th input text description, the output has the following forms:
0
['O[IH]CI(O)CC[IH]OI=CF', ... , 'CC(CCCOO)O[IH]OI=[IH](C)CC[IH]O']
[-2.299729347229004, ... , -2.235506772994995]
It takes about half an hour to generate 60 molecule graphs given an input text description.
Citation
Please cite the following paper if you use the codes:
@article{su2022natural,
title={Natural Language-informed Understanding of Molecule Graphs},
author={Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, Ji-Rong Wen},
year={2022}
}