Home

Awesome

Algebraic Graph-assisted Bidirectional Transformers (AGBT)

Implementation of the paper "Algebraic Graph-assisted Bidirectional Transformers for Molecular Property Prediction" by Dong Chen, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei<sup>+</sup> and Feng Pan*


model_framework


<span style="color: red;">New interface (update: 08-2023) </span>

New Feature: If you're just trying to get molecular features based on deep networks, the following process eliminates the need to compile the code base.

Python Dependencies (Higher version should be find):

Once python's dependece is satisfied, there is no need to compile the entire code base.

# Pre-trained model
wget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/

# generate the feature
# '--feature_type': 'bos' for the begin of sequence symbol's embedding; 'avg' for the average of the all symbols embedding.
python "./agbt_pro/generate_bt_fps_new.py" --model_name_or_path "./examples/models/" --checkpoint_file "checkpoint_best.pt" --smi_file "./examples/data/example_train_canonical.smi" --save_feature_path "./examples/BT_FPs/examples_bt_train_features.npy" --feature_type bos

New interface is also updated in here



Requirments

OS Requirements

Python Dependencies

Installation Guide

Install from Github

git clone https://github.com/ChenDdon/AGBTcode.git
cd AGBTcode/agbt_pro
mkdir agbt_pro
mkdir agbt_pro/fairseq
mkdir agbt_pro/fairseq/data
python setup.py build_ext --inplace
mv ./agbt_pro/fairseq/data/* ./fairseq/data/

which should install in about 60 seconds.

Downloading Pre-trained Models

Pre-trained model is publicly available.

# Pre-trained model
wget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/

Pre-training settings

The pre-training dataset used in this work is ChEMBL26, which is available at chembl.gitbook.io/chembl-interface-documentation/downloads.

There are 1936342 samples including in the ChEMBL26 dataset. We divided the dataset into a training set (1926342) and a valid set (10000) in this work.

# Suppose the file name of the pre-training data are chembl26_train.smi and chembl26_valid.smi
# First pre-processing
python "./agbt_pro/preprocess.py" --only-source --trainpref "chembl26_train.smi" --validpref "chembl26_valid.smi" --destdir "./examples/data/chembl26/" --trainoutf "train" --validoutf "valid"  --workers 20 --file-format smiles

# Pre-training command
python "./agbt_pro/train.py" "./examples/data/chembl26/" --train-subset "train" --valid-subset "valid" --save-dir "./examples/models/" --task masked_lm --arch roberta_base --encoder-attention-heads 8 --encoder-embed-dim 512 --encoder-ffn-embed-dim 1024 --encoder-layers 8 --dropout 0.1 --attention-dropout 0.1 --criterion masked_lm --sample-break-mode complete --tokens-per-sample 256 --skip-invalid-size-inputs-valid-test --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0001 --weight-decay 0.1 --warmup-updates 5000 --total-num-update 1000000 --max-update 1000000 --save-interval 100 --save-interval-updates 100000 --log-format simple --log-interval 2000 --max-sentences 64 --update-freq 2 --ddp-backend no_c10d --fp16 --reset-optimizer --reset-dataloader --reset-meters

# the pre-trained model will save as ./examples/data/chembl26/checkpoint_best.pt

Reproduction instructions

wget https://weilab.math.msu.edu/AGBT_Source/AGBT_FPs.zip
wget https://weilab.math.msu.edu/AGBT_Source/downstream_nn_models.tar.gz

Customize task-specific AGBT-FPs

For users who want to build a new task-specific model from a set of molecules with corresponding properties, here we provide some scripts for generating AG-FPs, BT-FPs, and AGBT-FPs, respectively. By default, we use supervised learning-based strategy to fine-tune the pre-trained model. The example molecule includes the MOL2 file and the corresponding SMILES string. The following steps need to be performed on a platform that supports GPU computing.

# Generate Bidirectional Transformer-based Fingerprints (BT-FPs)

# step 1, download the pre-trained model
wget https://weilab.math.msu.edu/AGBT_Source/checkpoint_pretrained.pt ./examples/models/

# step 2, pre-process input data (Binarize the input data to speed up the training.)
mkdir "./examples/data/input0"
python "./agbt_pro/preprocess.py" --only-source --trainpref "./examples/data/example_train_canonical.smi" --validpref "./examples/data/example_valid_canonical.smi" --destdir "./examples/data/input0/" --trainoutf "train" --validoutf "valid"  --workers 20 --file-format smiles --srcdict "./examples/data/input0/dict.txt"

# step 3, fine-tuning the pre-trained model
mkdir "./examples/data/label"
cp "./examples/data/example_train.label" "./examples/data/label/train.label"
cp "./examples/data/example_valid.label" "./examples/data/label/valid.label"
python "./agbt_pro/train.py" "./examples/data/" --save-dir "./examples/models/" --train-subset train --valid-subset valid --restore-file "./examples/models/checkpoint_pretrained.pt" --task sentence_prediction --num-classes 1 --regression-target --init-token 0 --best-checkpoint-metric loss --arch roberta_base --bpe smi --encoder-attention-heads 8 --encoder-embed-dim 512 --encoder-ffn-embed-dim 1024 --encoder-layers 8 --dropout 0.1 --attention-dropout 0.1  --criterion sentence_prediction --max-positions 256 --truncate-sequence --skip-invalid-size-inputs-valid-test --optimizer adam --adam-betas '(0.9,0.999)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0001 --warmup-updates 500 --total-num-update 5000 --weight-decay 0.1 --max-update 5000 --log-format simple --reset-optimizer --reset-dataloader --reset-meters --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state --find-unused-parameters --log-interval 50 --max-sentences 64 --update-freq 2 --required-batch-size-multiple 1 --ddp-backend no_c10d --fp16 --max-epoch 5000

# step 4, generate BT-FPs
mkdir "./examples/BT_FPs/"
python "./agbt_pro/generate_bt_fps.py" --model_name_or_path "./examples/models/" --checkpoint_file "checkpoint_best.pt" --data_name_or_path  "./examples/data/" --dict_file "./examples/data/dict.txt" --target_file "./examples/data/example_train_canonical.smi" --save_feature_path "./examples/BT_FPs/examples_bt_train_features.npy"
python "./agbt_pro/generate_bt_fps.py" --model_name_or_path "./examples/models/" --checkpoint_file "checkpoint_best.pt" --data_name_or_path  "./examples/data/" --dict_file "./examples/data/dict.txt" --target_file "./examples/data/example_valid_canonical.smi" --save_feature_path "./examples/BT_FPs/examples_bt_valid_features.npy"
## Generate Algebraic Graph-based Fingerprints (AG-FPs)
mkdir "./examples/AG_FPs/"

# step 1. Laplacian, Lorentz
python "./ag_pro/AG_main.py" --dataset_prefix 'example_train' --dataset_path './examples/data/example_train_x_mol2' --dataset_id_path './examples/data/example_train.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Lorentz' --kernal_tau 0.5 --kernal_parameter 10.0
python "./ag_pro/AG_main.py" --dataset_prefix 'example_valid' --dataset_path './examples/data/example_valid_x_mol2' --dataset_id_path './examples/data/example_valid.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Lorentz' --kernal_tau 0.5 --kernal_parameter 10.0
# step 2. Laplacian, Exponential
python "./ag_pro/AG_main.py" --dataset_prefix 'example_train' --dataset_path './examples/data/example_train_x_mol2' --dataset_id_path './examples/data/example_train.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Exponential' --kernal_tau 0.5 --kernal_parameter 20.0
python "./ag_pro/AG_main.py" --dataset_prefix 'example_valid' --dataset_path './examples/data/example_valid_x_mol2' --dataset_id_path './examples/data/example_valid.id' --save_feature_path_prefix './examples/AG_FPs' --matrix_type 'Lap' --kernal_type 'Exponential' --kernal_tau 0.5 --kernal_parameter 20.0

Note: The "kernal_type", "kernal_tau", and "kernal_parameter" can be modified according to the performance for a specific task.

## Generate algebraic graph-assisted bidirectional transformer-based Fingerprints (AGBT-FPs)
mkdir "./examples/AGBT-FPs/"
python "./agbt_pro/feature_analysis.py" --train_x_f1 "./examples/AG_FPs/example_train_Lap_Lorentz_10.0_tau_0.5.npy" --train_x_f2 "./examples/AG_FPs/example_train_Lap_Exponential_20.0_tau_0.5.npy" --train_x_f3 "./examples/BT_FPs/examples_bt_train_features.npy" --train_y "./examples/data/example_train_y.npy" --test_x_f1 "./examples/AG_FPs/example_valid_Lap_Lorentz_10.0_tau_0.5.npy" --test_x_f2 "./examples/AG_FPs/example_valid_Lap_Exponential_20.0_tau_0.5.npy" --test_x_f3 "./examples/BT_FPs/examples_bt_valid_features.npy" --test_y "./examples/data/logp_FDA_label.npyexample_valid_y.npy" --features_norm --save_folder_path "./examples/AGBT-FPs/" --n_estimators 10000 --n_workers -1 --max_depth 7 --min_samples_split 3 --random_seed 1234 --n_select_features 512

For the data in the example, the entire process took less than 40 minutes.

<a name="Note"></a>

Note

(Update: 2021-11) For those interested in pre-trained models (BT-FPs), we provide three recent updates pre-trained models. These include those based on Chembl27(1.9 million), PubChem(over 0.1 billion), and ZINC (over 0.6 billion) datasets. The source code and models are publicly abailable at https://github.com/WeilabMSU/PretrainModels

(Update: 2022-06) The data used in this work has been migrated. The Users can download the datasets at https://weilab.math.msu.edu/DataLibrary/3D/.

License

All codes released in this study is under the MIT License.