Home

Awesome

BBAR: Building Block based AutoRegressive generative model for molecular graph generation.

Official github of Molecular generative model via retrosynthetically prepared chemical building block assembly by Seonghwan Seo, Jaechang Lim, Woo Youn Kim.

There is improved version (BBARv2) in https://github.com/SeonghwanSeo/BBAR.git.

Table of Contents

Environment

Data

Data Structure

Data Directory Structure

Move to data/ directory. Initially, the structure of directory data/ is as follows.

├── data/
    ├── data_preprocess.sh
    ├── preprocessing/
    ├── start_scaffold/
    ├── ZINC/
    │   ├── smiles/
    │   ├── all.txt 		(source data)
    │   ├── get_metadata.py
    │   ├── library.csv
    │   ├── library_map.csv
    │   ├── train.txt 	(https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/train.txt)
    │   ├── valid.txt 	(https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/valid.txt)
    │   └── test.txt 		(https://github.com/wengong-jin/icml18-jtnn/tree/master/data/zinc/test.txt)
    ├── generalization/
    └── 7l13_docking/ 	(Smina calculation result. (ligands: ZINC, receptor: 7L13))
        ├── smiles/
        ├── library.csv	(Same to ZINC/library.csv)
        ├── library_map.csv
        └── property.db	(metadata)

Prepare your own dataset

To use a given preprocessing script, you need to format your own dataset as follows.

MolID,SMILES,Property1,Property2,...
id1,c1ccccc1,10.25,32.21,...
id2,C1CCCC1,35.1,251.2,...
...

After constructing your own dataset to the given format, make directory in data/ and put your dataset with the name property.db.

Preprocessing

Preprocessing (ZINC)

First, you need to create metadata. Go to data/ZINC and run python get_metadata.py. For 7L13 docking dataset, I already uploaded a metadata in github.

cd data/ZINC
python get_metadata.py
# property.db (metadata) is created from all.txt

And then, just run the script data_preprocess.sh

cd ../
./data_preprocess.sh ./ZINC/ <cpus>
# Create train.csv, val.csv, test.csv from ./ZINC/smiles/ and ./ZINC/library.csv

# For 7l13_docking
./data_preprocess.sh ./7l13_docking/ <cpus>
# Create train.csv, val.csv, test.csv from ./7l13_docking/smiles/ and ./7l13_docking/library.csv

After preprocessing step, the structure of directory data/ is as follows. Our training/sampling code is based on following directory structure, so renaming the file is not recommended.

├── data/
    ├── ZINC/
    │   ├── ...
    │   ├── property.db
    │   ├── train.csv
    │   ├── train_weight.npy
    │   ├── val.csv
    │   └── test.csv
    ├── ...

Preprocessing (Own Data)

If you want to use your own data, follow below procedure. You need to put property.db in data/<NEW-DIR>.

There are two script: for partitioning dataset and creating library.

There is a script for splitting dataset. (source file is property.db).

python preprocessing/split_data.py <NEW-DIR> --train_ratio <train-ratio> --val_ratio <val_ratio>
python preprocessing/get_library.py <NEW-DIR> --cpus <cpus>
./data_preprocess.sh <NEW-DIR> <cpus>
├── data/
    ├── <NEW-DIR>/
    │   ├── property.db (Source File)
    │   ├── smiles
    │   │   ├── train_smiles.csv
    │   │   ├── val_smiles.csv
    │   │   └── test_smiles.csv
    │   ├── library.csv
    │   ├── library_map.csv
    │   ├── train.csv
    │   ├── train_weight.npy
    │   ├── val.csv
    │   └── test.csv
    ├── ...

Model Training

python train.py -h

Move to the root directory. Our training script reads config files in ./config/, you can handle them by modifying or creating config files.

Training Script Format Example

python train.py \
    name <exp-name> \
    exp_dir <exp-dir> \
    property <property1> <property2> ... \
    trainer_config <train-config-path> \
    model_config <model-config-path> \
    data_config <data-config-path>

Example running script.

python train.py \
    name 'logp-tpsa' \
    exp_dir 'result/ZINC' \
    property logp tpsa \
    data_config './config/data/zinc.yaml'

Yaml File Example

data_dir: ./data/ZINC
property_path: ${data_dir}/property.db
library_path: ${data_dir}/library.csv
train_data_path: ${data_dir}/train.csv
train_weight_path: ${data_dir}/train_weight.npy
val_data_path: ${data_dir}/val.csv
train_max_atoms: 40
val_max_atoms: 40
# Training Environment
gpus: 1
num_workers: 4

# Hyperparameter for Model Training
lr: 0.001
train_batch_size: 128
val_batch_size: 256

# Hyperparameter for Negative Sampling
num_negative_samples: 10
alpha: 0.75

# unit: step (batch)
max_step: 500000
log_interval: 5000
val_interval: 10000
save_interval: 10000

Generation

python sample.py -h

Example running script.

# Non scaffold-based generation.
python sample.py \
    --generator_config './config/generator/logp_tpsa.yaml' \
    --o './result_sample/logp\=4-tpsa\=60.smi' \
    --num_samples 100 \
    --logp 4 --tpsa 60 	# generator config specific parameters. (No help message (python sample.py -h))

# Scaffold-based generation. (Single Scaffold)
python sample.py \
    --generator_config './config/generator/no_condition.yaml' \
    --scaffold 'Cc1ccccc1'
    --o './result_sample/no_condition.smi' \
    --num_samples 100
    
# Scaffold-based generation. (Multi Scaffold)
python sample.py \
    --generator_config './config/generator/logp.yaml' \
    --scaffold './data/start_scaffold/start100.smi' \
    --o './result_sample/logp\=6.smi' \
    --num_samples 100 \
    --logp 6

Yaml File Example

model_path: './result/ZINC/logp-tpsa/checkpoint/best.tar'
library_path: './data/ZINC/library.csv'

# Below is library built-in model file path. (Model Parameter + SMILES and Latent Vectors for fragments in library.)
# During generation, model vectorizes the fragments in library.
# You can skip this process by saving all of them: model parameter and library informations.
# I called it `library built-in model`
# If below is not `null`, generator save or load library built-in model.
# If the built-in model file exists, upper two parameters (`model_path`, `library_path`) are not needed.
library_builtin_model_path: './builtin_model/zinc_logp-tpsa' # (optional)

# Required
n_library_sample: 2000
alpha: 0.75
max_iteration: 10
idx_masking: True
compose_force: False