Awesome
PIDiff: Physics Informed Diffusion Model for Protein Pocket Specific 3D Molecular Generation
<img src="https://github.com/hello-maker/PIDiff/blob/master/assets/main.jpg">Requirements
We include key dependencies below. Our detailed environmental setup is available in [environment.yml
]
The code has been tested in the following environment:
Package | Version |
---|---|
Python | 3.8 |
PyTorch | 1.13.1 |
CUDA | 11.6 |
PyTorch Geometric | 2.2.0 |
RDKit | 2022.03.2 |
Install via Conda
conda create -n PIDiff python=3.8
conda activate PIDiff
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
conda install pyg -c pyg
conda install rdkit openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge
Data
The data used for training/evaluation would have been provided through the submission site in a folder named Data
or Google Drive folder.
Data
|__Training Data
| | # Raw complex structures of protein-ligand available from the CrossDocked2020 dataset. Proteins are specified in .pdb format, and Ligands in .sdf format.
| |__crossdocked_v1.1_rmsd1.0.tar.gz
| |
| | # Processed data that can be used for model training, obtainable through the execution of the ./Anonymous/datasets/pl_pair_dataset.py file
| |__crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
| |
| | # Index storage files for each sample, used for splitting the train set and test set, or for other preprocessing purposes.
| |__index.pkl
|
|__Split
| | # Names and index numbers of samples used directly for training and validation.
| |___crossdocked_pocket10_pose_split.pt
| |
| | # Raw file for creating the crossdocked_pocket10_pose_split.pt file. It is split through pdb id.
| |___split_by_name.pt
|
|__Test Data
| |...
|
To train the model from scratch, you need the preprocessed lmdb file and split file:
crossdocked_v1.1_rmsd1.0_pocket10_processed_final.lmdb
crossdocked_pocket10_pose_split.pt
To evaluate the model on the test set, you need to unzip the test_set.zip
in Data
folder. It includes the original PDB files that will be used in Vina Docking.
If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 from here, save it into data/CrossDocked2020
, and run the scripts in scripts/data_preparation
:
-
clean_crossdocked.py will filter the original dataset and keep the ones with RMSD < 1A. It will generate a
index.pkl
file and create a new directory containing the original filtered data (corresponds tocrossdocked_v1.1_rmsd1.0.tar.gz
in the drive). You don't need these files if you have downloaded .lmdb file. -
extract_pockets.py will clip the original protein file to a 10A region around the binding molecule. E.g.
-
split_pl_dataset.py will split the training and test set. We use the same split
split_by_name.pt
as AR and Pocket2Mol, which can also be downloaded in the Google Drive - data folder.python scripts/data_preparation/clean_crossdocked.py --source data/CrossDocked2020 --dest data/crossdocked_v1.1_rmsd1.0 --rmsd_thr 1.0 python scripts/data_preparation/extract_pockets.py --source data/crossdocked_v1.1_rmsd1.0 --dest data/crossdocked_v1.1_rmsd1.0_pocket10 python scripts/data_preparation/split_pl_dataset.py --path data/crossdocked_v1.1_rmsd1.0_pocket10 --dest data/crossdocked_pocket10_pose_split.pt --fixed_split data/split_by_name.pt
Training
Training from scratch
python scripts/train_diffusion.py configs/training.yml
Sampling
Sampling for pockets in the testset
python scripts/sample_diffusion.py configs/sampling.yml --data_id {i}
Evaluation
Evaluation from sampling results
python scripts/evaluate_diffusion.py {OUTPUT_DIR} --docking_mode vina_score --protein_root data/test_set
The docking mode can be chosen from {qvina, vina_score, vina_dock, none}
Note: It will take some time to prepare pqdqt and pqr files when you run the evaluation code with vina_score/vina_dock docking mode for the first time.
Real-world Validation
If you want to generate molecules for a new protein not in the test set, you should run ./scripts/real_world/Iinference.ipynb
.
Remember that you need to prepare the ligand's .sdf file for creating the protein pocket and the .pdb file containing the structural information of the protein.
Typically, the above process is also necessary for performing MD simulation.
Result
The main results for the proposed model are presented in the table below. For a more comprehensive overview of the results obtained with our model, please refer to the Report.
Evaluation of Generated Molecule
Model | VinaScore | VinaMin | VinaDock | HighAiffinity | VinaScore<sub>SA</sub> | SR |
---|---|---|---|---|---|---|
AR | -5.75 | -6.18 | -6.75 | 0.379 | -5.59 | 74.7% |
LiGAN | - | - | -6.33 | 0.21 | - | -68.4% |
GraphBP | - | - | -4.80 | 0.14 | - | 57.1% |
Pocket2Mol | -5.15 | -6.42 | -7.15 | 0.48 | -5.12 | 88.7% |
DiffSBDD | 52.78 | 16.45 | -6.65 | 0.452 | -51.53 | 83.0% |
DrugGPS | 28.18 | 6.33 | -3.74 | 0.12 | -27.32 | 48.1% |
TargetDiff | -5.47 | -6.64 | -7.80 | 0.57 | -5.31 | 91.9% |
ResGen | 13.79 | -1.53 | -4.90 | 0.23 | -13.73 | 40.7% |
PIDiff | -6.58 | -7.52 | -8.10 | 0.64 | -6.03 | 100% |
Testset | -6.36 | -6.71 | -7.45 | - | -6.28 | - |