Awesome
Fragment Tree Transformer-based VAE (FRATTVAE)
This repository contains training and generation code for FRATTVAE. FRATTVAE can handle large amount of varius compounds ranging from 'Drug-like' to 'Natural'. In addition, the latent space constructed by FRATTVAE is useful to molecular generation and optimization. FRATTVAE is implemented in this papaer.
Requirements
- Python==3.10.8
- Scipy==1.11.3
- Pytorch>=1.12.1 (We only testd 1.12.1 and 2.0.1)
- DGL>=1.1.1
- RDkit==2023.2.1
- molvs (https://github.com/mcs07/MolVS)
- moses (https://github.com/molecularsets/moses)
- guacamol (https://github.com/BenevolentAI/guacamol)
To install these packages, follow the respective instructions.
Data availability
- ZINC250K: https://github.com/kamikaze0923/jtvae/tree/master/data/zinc
- MOSES: https://github.com/molecularsets/moses/tree/master/data
- GuacaMol: https://figshare.com/projects/GuacaMol/56639
- Polymer: https://github.com/wengong-jin/hgraph2graph/tree/master/data/polymers
- NaturalProducts: https://bioinf-applied.charite.de/supernatural_3
And, you can download several standardized datasets here (ZINC250K, MOSES, GuacaMol, Polymer, NaturalProducts).
Quick Start
Essential packages can be installed via pip
, but the version of CUDA is up to you (Default: 11.3).
Please execute enviroment.sh
in your python virtual enviroment.
sh enviroment.sh
If you use Docker, you can use the Dockerfile to build your environment.
docker build . --network=host -t <IMAGE NAME>
docker run -itd --runtime=nvidia --shn-size 32g -t <CONTAINER NAME> <IMAGE ID>
0. Preparation
(0.0.) Standardize SMILES
To canonicalize and sanitize SMILES, run exec_standardize.sh
only once. your data must be in csv format and have a column named 'SMILES'. If there is not a column called 'test' in your data, it will be split into train/valid/test data sets (0: train, 1: test, -1: valid). The standardized data is saved as *_standardized.csv
.
exec_standardize.sh:
#!/bin/bash
export PYTHONPATH="/yourdirectory/scripts:$PYTHONPATH"
export DGLBACKEND="pytorch"
source yourenviroment
data_path="/yourdirectory/data/example.csv"
python utils/standardize_smiles.py $data_path --n_jobs 24 >> prepare.log
--n_jobs
: Number of cpu workers.
Please change 'yourdirectory' and 'yourenviroment' to the correct paths.
0.1. Setting Hyperparameters and Directory to save results
exec_prepare.sh:
python preparation.py "/yourdirectory/data/example_standardized.csv" \
--seed 0 \
--maxLength 32 \
--maxDegree 16 \
--minSize 1 \
--epoch 1000 \
--batch_size 2048 \
--lr 0.0001 \
--kl_w 0.0005 \
--l_w 2.0 >> prepare.log
After execution, savedir
named dataname_{taskname}_{MMDD}
in /yourdirectory/results/.
, and params.yml
which is hyperparameters list in /savedir/input_data/.
are created.
Please adjust hyperparameters (batch_size, maxLength and so on) to match your datasets and GPU capacities. For molecules with molecular weights greater than 500, it is recommended that maxLength be 32 or 64.
1. Precedure of Training and Generation
Please refer to exec_vae.sh
.
1.1. Preprocessing
path="/yourdirectory/results/examples_standardized_struct_{MMDD}"
ymlFile=$path'/input_data/params.yml'
python preprocessing.py ${ymlfile} --njobs 24 >> $path'/preprocess.log'
--ymlfile
: the path ofparams.yml
.--n_jobs
: Number of cpu workers.
After execution, fragments.csv
and dataset.pkl
are created in /savedir/input_data/.
1.2. Training
python train.py ${ymlFile} --gpus 0 1 2 3 --n_jobs 24 --load_epoch $load_epoch --valid > $path'/train'$load_epoch'.log'
--gpus
: IDs of GPU. If multiple GPUs are given, they are used for DDP training.--n_jobs
: Number of cpu workers.--load_epoch
: load$load_epoch
-epoch trained model. Use to resume learning from any epoch.--valid
: To Validate or Not to Validate.
After execution, the model checkepoint is saved as model_best.pth
in /savedir/model/.
1.3. Reconstruntion and Generation
Caluculate reconstruction accuracy and MOSES+GuacaMol metrics.
python test.py ${ymlFile} --gpu 0 --k 10000 --N 5 --n_jobs 24 > $path'/test.log'
--gpu
: ID of GPU. multi GPUs are not supported.--k
: Number of moldecules generated.--N
: Iteration number of generation.--n_jobs
: Number of cpu workers.--gen
: Set if you only want Generation.
After execution, the results of reconstruction and generation are saved in /savedir/test/.
and /savedir/generate/.
respectively.
Conditional VAE
You can also train FRATTVAE with some conditions(logP, QED, SA, ...).
Condition values must be included in the datafile.
Conditions are selected as arguments in preparation.py
. (See exec_prepare.sh
)
If the condition value is a continuous value, enter condition key and value '1' (ex. MW:1).
If the condition value is a discrete value, enter condition key and value 'number of categories'.
python preparation.py $data_path \
--seed 0 \
--maxLength 32 \
--maxDegree 16 \
--minSize 1 \
--epoch 1000 \
--batch_size 1024 \
--condition MW:1 \
--condition logP:1 \
--lr 0.0001 \
--kl_w 0.0005 \
--l_w 2.0 >> prepare.log
You can exececute conditional training and generation using cvae/exec_cvae.sh
Pretrained Model
You can generate molecules using pretrained models.
Download result directories containing trained models here and unzip downloaded files.
Next, please replace yourdirectory
to your directories and rewrite batch_size
to match your gpu capacity in input_data/params.yml
.
ex. ChEMBL + DrugBank
#!/bin/bash
export PYTHONPATH="/yourdirectory/scripts:$PYTHONPATH"
export DGLBACKEND="pytorch"
source yourenviroment
path="/yourdirectory/results/ChEMBL_DB_standardized_struct"
ymlFile=$path'/input_data/params.yml'
nohup python3 generation.py ${ymlFile} --N 5 --k 10000 --gpu 0 --n_jobs 24 > $path'/generate.log' &
License
This software is released under a custom license.
Academic use of this software is free and does not require any permission. We encourage academic users to cite our research paper (if applicable).
For commercial use, please contact us for permission.
If you have any questions or problems, please let us know. MAIL: inukai10@dna.bio.keio.ac.jp
Citation
@article{inukai2024tree,
title={Leveraging Tree-Transformer VAE with fragment tokenization for high-performance large chemical model generation},
author={Inukai, Tensei and Yamato, Aoi and Akiyama, Manato and Sakakibara, Yasubumi},
year={2024}
}