Awesome

3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction

This repository is the official implementation of 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (ICLR 2023). [PDF]

Installation

Dependency

The code has been tested in the following environment:

Package	Version
Python	3.8
PyTorch	1.13.1
CUDA	11.6
PyTorch Geometric	2.2.0
RDKit	2022.03.2

Install via Conda and Pip

conda create -n targetdiff python=3.8
conda activate targetdiff
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
conda install pyg -c pyg
conda install rdkit openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge

# For Vina Docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2 
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3

The code should work with PyTorch >= 1.9.0 and PyG >= 2.0. You can change the package version according to your need.

(Alternatively) Install via Mamba

Install Mamba

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Create Mamba environment

mamba env create -f environment.yaml
conda activate targetdiff  # note: one still needs to use `conda` to (de)activate environments

Target-Aware Molecule Generation

Data

The data used for training / evaluating the model are organized in the data Google Drive folder.

To train the model from scratch, you need to download the preprocessed lmdb file and split file:

crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
crossdocked_pocket10_pose_split.pt

To evaluate the model on the test set, you need to download and unzip the test_set.zip. It includes the original PDB files that will be used in Vina Docking.

If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it into data/CrossDocked2020, and run the scripts in scripts/data_preparation:

clean_crossdocked.py will filter the original dataset and keep the ones with RMSD < 1A. It will generate a index.pkl file and create a new directory containing the original filtered data (corresponds to crossdocked_v1.1_rmsd1.0.tar.gz in the drive). You don't need these files if you have downloaded .lmdb file.
```
python scripts/data_preparation/clean_crossdocked.py --source data/CrossDocked2020 --dest data/crossdocked_v1.1_rmsd1.0 --rmsd_thr 1.0
```

extract_pockets.py will clip the original protein file to a 10A region around the binding molecule. E.g.

python scripts/data_preparation/extract_pockets.py --source data/crossdocked_v1.1_rmsd1.0 --dest data/crossdocked_v1.1_rmsd1.0_pocket10

split_pl_dataset.py will split the training and test set. We use the same split split_by_name.pt as AR and Pocket2Mol, which can also be downloaded in the Google Drive - data folder.

python scripts/data_preparation/split_pl_dataset.py --path data/crossdocked_v1.1_rmsd1.0_pocket10 --dest data/crossdocked_pocket10_pose_split.pt --fixed_split data/split_by_name.pt

Training

Training from scratch

python scripts/train_diffusion.py configs/training.yml

Trained model checkpoint

https://drive.google.com/drive/folders/1-ftaIrTXjWFhw3-0Twkrs5m0yX6CNarz?usp=share_link

Sampling

Sampling for pockets in the testset

python scripts/sample_diffusion.py configs/sampling.yml --data_id {i} # Replace {i} with the index of the data. i should be between 0 and 99 for the testset.

You can also speed up sampling with multiple GPUs, e.g.:

CUDA_VISIBLE_DEVICES=0 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 0 0
CUDA_VISIBLE_DEVICES=1 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 1 0
CUDA_VISIBLE_DEVICES=2 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 2 0
CUDA_VISIBLE_DEVICES=3 bash scripts/batch_sample_diffusion.sh configs/sampling.yml outputs 4 3 0

Sampling from pdb file

To sample from a protein pocket (a 10A region around the reference ligand):

python scripts/sample_for_pocket.py configs/sampling.yml --pdb_path examples/1h36_A_rec_1h36_r88_lig_tt_docked_0_pocket10.pdb

Evaluation

Evaluation from sampling results

python scripts/evaluate_diffusion.py {OUTPUT_DIR} --docking_mode vina_score --protein_root data/test_set

The docking mode can be chosen from {qvina, vina_score, vina_dock, none}

Note: It will take some time to prepare pqdqt and pqr files when you run the evaluation code with vina_score/vina_dock docking mode for the first time.

Evaluation from meta files

We provide the sampling results (also docked) of our model and CVAE, AR, Pocket2Mol baselines here.

Metafile Name	Original Paper
crossdocked_test_vina_docked.pt	-
cvae_vina_docked.pt	liGAN
ar_vina_docked.pt	AR
pocket2mol_vina_docked.pt	Pocket2Mol
targetdiff_vina_docked.pt	TargetDiff

You can directly evaluate from the meta file, e.g.:

python scripts/evaluate_from_meta.py sampling_results/targetdiff_vina_docked.pt --result_path eval_targetdiff

One can reproduce the results reported in the paper quickly with notebooks/summary.ipynb

Binding Affinity Prediction

Data

In the unsupervised learning setting, we still use the CrossDocked2020 dataset and find the data with experimentally measured binding affinity (saved in affinity_info.pkl) for further analysis.
In the supervised learning setting, we use the PDBBind dataset, which can be downloaded from: http://www.pdbbind.org.cn. The downloaded refined / general set should be saved in data/pdbbind_v{YEAR} directory.

Take the PDBBind v2016 for example, you need to first unzip the data:

mkdir -p data/pdbbind_v2016 && tar -xzvf data/pdbbind_v2016_refined.tar.gz -C data/pdbbind_v2016

Then, you can extract 10A pockets and split the dataset using the following commands:

# extract pockets
python scripts/property_prediction/extract_pockets.py --source data/pdbbind_v2016 --subset refined --refined_index_pkl data/pdbbind_v2016/pocket_10_refined/index.pkl

# split dataset
python scripts/property_prediction/pdbbind_split.py --index_path data/pdbbind_v2016/pocket_10_refined/index.pkl  --save_path data/pdbbind_v2016/pocket_10_refined/split.pt

Training

One can train the binding affinity prediction model with:

python scripts/property_prediction/train_prop.py configs/prop/pdbbind_general_egnn.yml

It is also possible to enhance the model with extra features extracted from the unsupervised generative model. You need to first export the hidden states with:

python scripts/likelihood_est_diffusion_pdbbind.py

This command will dump various meta information and you need to specify the feature you want to use in the training config (like configs/prop/pdbbind_general_egnn.yml) of the following supervised prediction model.

Trained model checkpoint

NOTE: For the supervised learning setting, since the training results on PDBBind v2020 are lost by accident, we can only provide the model checkpoint trained on PDBBind v2016 in the preliminary experiments for now. However, it can already make accurate prediction for the practical use. We will retrain the models on PDBBind v2020 and provide the trained checkpoints as soon.

https://drive.google.com/drive/folders/1-ftaIrTXjWFhw3-0Twkrs5m0yX6CNarz?usp=share_link

Evaluation

For the unsupervised learning evaluation, please check notebooks/analyze_affinity.ipynb
For the supervised learning evaluation, one can use the following command to evaluate on the tes set:

python scripts/property_prediction/eval_prop.py --ckpt_path pretrained_models/egnn_pdbbind_v2016.pt

Expected results:

RMSE	MAE	R^2	Pearson	Spearman
1.316	1.031	0.633	0.797	0.782

Inference

To predict the binding affinity of a complex, one need to prepare the PDB file and SDF/MOL2 file first (Important: for the supervised learning model trained on PDBBind v2016, both protein and ligand need to have hydrogen atoms). Then, the binding affinity can be predicted with scripts/property_prediction/inference.py. For example,

python scripts/property_prediction/inference.py \
  --ckpt_path pretrained_models/egnn_pdbbind_v2016.pt \
  --protein_path examples/3ug2_protein.pdb \
  --ligand_path examples/3ug2_ligand.sdf \
  --kind Kd

Expected prediction: Kd=5.23 nm. Ground-truth: Kd=5.6 nm

Citation

@inproceedings{guan3d,
  title={3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction},
  author={Guan, Jiaqi and Qian, Wesley Wei and Peng, Xingang and Su, Yufeng and Peng, Jian and Ma, Jianzhu},
  booktitle={International Conference on Learning Representations},
  year={2023}
}