Home

Awesome

3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction

License: MIT

This repository is the official implementation of 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (ICLR 2023). [PDF]

<p align="center"> <img src="assets/overview.png" /> </p>

Installation

Dependency

The code has been tested in the following environment:

PackageVersion
Python3.8
PyTorch1.13.1
CUDA11.6
PyTorch Geometric2.2.0
RDKit2022.03.2

Install via Conda and Pip

conda create -n targetdiff python=3.8
conda activate targetdiff
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
conda install pyg -c pyg
conda install rdkit openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge

# For Vina Docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2 
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3

The code should work with PyTorch >= 1.9.0 and PyG >= 2.0. You can change the package version according to your need.

(Alternatively) Install via Mamba

Install Mamba

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Create Mamba environment

mamba env create -f environment.yaml
conda activate targetdiff  # note: one still needs to use `conda` to (de)activate environments

Target-Aware Molecule Generation

Data

The data used for training / evaluating the model are organized in the data Google Drive folder.

To train the model from scratch, you need to download the preprocessed lmdb file and split file:

To evaluate the model on the test set, you need to download and unzip the test_set.zip. It includes the original PDB files that will be used in Vina Docking.

If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it into data/CrossDocked2020, and run the scripts in scripts/data_preparation:

Training

Training from scratch

python scripts/train_diffusion.py configs/training.yml

Trained model checkpoint

https://drive.google.com/drive/folders/1-ftaIrTXjWFhw3-0Twkrs5m0yX6CNarz?usp=share_link

Sampling

Sampling for pockets in the testset

python scripts/sample_diffusion.py configs/sampling.yml --data_id {i} # Replace {i} with the index of the data. i should be between 0 and 99 for the testset.

You can also speed up sampling with multiple GPUs, e.g.:

CUDA_VISIBLE_DEVICES=0 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 0 0
CUDA_VISIBLE_DEVICES=1 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 1 0
CUDA_VISIBLE_DEVICES=2 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 2 0
CUDA_VISIBLE_DEVICES=3 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 3 0

Sampling from pdb file

To sample from a protein pocket (a 10A region around the reference ligand):

python scripts/sample_for_pocket.py configs/sampling.yml --pdb_path examples/1h36_A_rec_1h36_r88_lig_tt_docked_0_pocket10.pdb

Evaluation

Evaluation from sampling results

python scripts/evaluate_diffusion.py {OUTPUT_DIR} --docking_mode vina_score --protein_root data/test_set

The docking mode can be chosen from {qvina, vina_score, vina_dock, none}

Note: It will take some time to prepare pqdqt and pqr files when you run the evaluation code with vina_score/vina_dock docking mode for the first time.

Evaluation from meta files

We provide the sampling results (also docked) of our model and CVAE, AR, Pocket2Mol baselines here.

Metafile NameOriginal Paper
crossdocked_test_vina_docked.pt-
cvae_vina_docked.ptliGAN
ar_vina_docked.ptAR
pocket2mol_vina_docked.ptPocket2Mol
targetdiff_vina_docked.ptTargetDiff

You can directly evaluate from the meta file, e.g.:

python scripts/evaluate_from_meta.py sampling_results/targetdiff_vina_docked.pt --result_path eval_targetdiff

One can reproduce the results reported in the paper quickly with notebooks/summary.ipynb


Binding Affinity Prediction

Data

Take the PDBBind v2016 for example, you need to first unzip the data:

mkdir -p data/pdbbind_v2016 && tar -xzvf data/pdbbind_v2016_refined.tar.gz -C data/pdbbind_v2016

Then, you can extract 10A pockets and split the dataset using the following commands:

# extract pockets
python scripts/property_prediction/extract_pockets.py --source data/pdbbind_v2016 --subset refined --refined_index_pkl data/pdbbind_v2016/pocket_10_refined/index.pkl

# split dataset
python scripts/property_prediction/pdbbind_split.py --index_path data/pdbbind_v2016/pocket_10_refined/index.pkl  --save_path data/pdbbind_v2016/pocket_10_refined/split.pt

Training

One can train the binding affinity prediction model with:

python scripts/property_prediction/train_prop.py configs/prop/pdbbind_general_egnn.yml

It is also possible to enhance the model with extra features extracted from the unsupervised generative model. You need to first export the hidden states with:

python scripts/likelihood_est_diffusion_pdbbind.py

This command will dump various meta information and you need to specify the feature you want to use in the training config (like configs/prop/pdbbind_general_egnn.yml) of the following supervised prediction model.

Trained model checkpoint

NOTE: For the supervised learning setting, since the training results on PDBBind v2020 are lost by accident, we can only provide the model checkpoint trained on PDBBind v2016 in the preliminary experiments for now. However, it can already make accurate prediction for the practical use. We will retrain the models on PDBBind v2020 and provide the trained checkpoints as soon.

https://drive.google.com/drive/folders/1-ftaIrTXjWFhw3-0Twkrs5m0yX6CNarz?usp=share_link

Evaluation

python scripts/property_prediction/eval_prop.py --ckpt_path pretrained_models/egnn_pdbbind_v2016.pt

Expected results:

RMSEMAER^2PearsonSpearman
1.3161.0310.6330.7970.782

Inference

To predict the binding affinity of a complex, one need to prepare the PDB file and SDF/MOL2 file first (Important: for the supervised learning model trained on PDBBind v2016, both protein and ligand need to have hydrogen atoms). Then, the binding affinity can be predicted with scripts/property_prediction/inference.py. For example,

python scripts/property_prediction/inference.py \
  --ckpt_path pretrained_models/egnn_pdbbind_v2016.pt \
  --protein_path examples/3ug2_protein.pdb \
  --ligand_path examples/3ug2_ligand.sdf \
  --kind Kd

Expected prediction: Kd=5.23 nm. Ground-truth: Kd=5.6 nm

Citation

@inproceedings{guan3d,
  title={3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction},
  author={Guan, Jiaqi and Qian, Wesley Wei and Peng, Xingang and Su, Yufeng and Peng, Jian and Ma, Jianzhu},
  booktitle={International Conference on Learning Representations},
  year={2023}
}