Home

Awesome

Molecular generative model via retrosynthetically prepared chemical building block assembly

Advanced Science [Paper] [arXiv]

Official github of Molecular generative model via retrosynthetically prepared chemical building block assembly by Seonghwan Seo*, Jaechang Lim, Woo Youn Kim. (Advanced Science)

This repository is improved version(BBARv2) of jaechang-hits/BBAR-pytorch which contains codes and model weights to reproduce the results in paper. You can find the updated architectures at architecture/.

If you have any problems or need help with the code, please add an issue or contact shwan0106@kaist.ac.kr.

<img src="images/TOC.png" width=600>

Citation

@article{seo2023bbar,
  title = {Molecular Generative Model via Retrosynthetically Prepared Chemical Building Block Assembly},
  author = {Seo, Seonghwan and Lim, Jaechang and Kim, Woo Youn},
  journal = {Advanced Science},
  volume = {10},
  number = {8},
  pages = {2206674},
  doi = {https://doi.org/10.1002/advs.202206674},
  url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/advs.202206674},
}

Table of Contents

Installation

The project can be installed by pip with --find-links arguments for torch-geometric package.

pip install -e . --find-links https://data.pyg.org/whl/torch-2.3.1+cu121.html # CUDA
pip install -e . --find-links https://data.pyg.org/whl/torch-2.3.1+cpu.html # CPU-only

Data

Dataset Structure

Initially, the structure of directory data/ is as follows. Please unzip the necessary data with tar -xzvf commands.

├── data/
    ├── ZINC.tar.gz         (Constructed from https://github.com/wengong-jin/icml18-jtnn)
    ├── 3CL_ZINC.tar.gz     (Smina calculation result. (ligands: ZINC15, receptor: 7L13))
    └── LIT-PCBA.tar.gz     (ZINC20 UniDock calculation result against 15 LIT-PCBA targets)

Prepare Your Own Dataset

For your own dataset, you need to prepare data.csv as follows.

Preprocess

You need to preprocess dataset. Go to root directory and run ./script/preprocess.py.

cd <ROOT-DIR>
python ./script/preprocess.py \
  --data_dir ./data/<DATA-DIR> \
  --cpus <N-CPUS> \
  --split_ratio 0.9  # train:val split ratio.

After preprocessing step, the structure of directory data/ is as follows.

├── data/
    ├── <DATA-DIR>/
        ├── data.csv
        ├── valid_data.csv  new!
        ├── data.pkl        new!
        ├── library.csv     new!
        └── split.csv       new!

Model Training

The model training requires less than <u>12 hours</u> for 200k steps with 1 GPU(RTX2080) and 4 CPUs(Intel Xeon Gold 6234).

Training

cd <ROOT-DIR>
python ./script/train.py -h

Training Script Format Example

Our training script reads model config files ./config/model.yaml. You can change model size by modifying or creating new config files. You can find another arguments through running with -h flag.

python ./script/train.py \
    --name <exp-name> \
    --exp_dir <exp-dir-name> \          # default: ./result/
    --property <property1> <property2> ... \
    --max_step 100000 \                 # default: 100k; for paper, we used 200k.
    --data_dir <DATA-DIR> \             # default: ./data/ZINC/
    --model_config <model-config-path>  # default: ./config/model.yaml

Example running script

python ./script/train.py \
    --name 'logp-tpsa' \
    --exp_dir ./result/ZINC/ \
    --data_dir ./data/ZINC/ \
    --property logp tpsa

python ./script/train.py \
    --name '3cl_affinity' \
    --exp_dir ./result/3cl_affinity/ \
    --data_dir ./data/3CL_ZINC/ \
    --property affinity

python ./script/train.py \
    --name 'litpcba-ADRB2' \
    --exp_dir ./result/LIT-PCBA/ \
    --data_dir ./data/LIT-PCBA/ \
    --property ADRB2 QED

Generation

The model generates 20 to 30 molecules per 1 second with 1 CPU(Intel Xeon E5-2667 v4).

Download Pretrained Models.

# Download Weights of pretrained models. (mw, logp, tpsa, qed, 3cl-affinity)
# Path: ./test/pretrained_model/
cd <ROOT-DIR>
sh ./download-weights.sh

Generation

cd <ROOT-DIR>
python ./script/sample_denovo.py -h
python ./script/sample_scaffold.py -h

Example running script.

# Output directory path
mkdir ./result_sample

# De novo generation.
python ./script/sample_denovo.py \
    -g ./test/generation_config/logp.yaml \
    -n 100 \
    --logp 6 \
    -o ./result_sample/logp-6-denovo.smi \
    --seed 0

# Scaffold-based generation. => use `-s` or `--scaffold`
python ./script/sample_scaffold.py \
    -g ./test/generation_config/logp.yaml \
    -s "c1ccccc1" \
    -n 100 \
    --logp 2 \
    -o ./result_sample/logp-2-scaffold.smi

# Scaffold-based generation. (From File) => use `-S` or `--scaffold_path`
python ./script/sample_scaffold.py \
    --generator_config ./test/generation_config/mw.yaml \
    --scaffold_path ./test/start_scaffolds.smi \
    --num_samples 100 \
    --mw 300 \
    --o ./result_sample/mw-300-scaffold.smi \
    --seed 0 -q

Generator config (Yaml)