Awesome

MolSnapper: Conditioning Diffusion for Structure Based Drug Design

This is A tool to condition diffusion model for Generating 3D Drug-Like Molecules.

This repository is build on MolDiff code and conditioned MolDiff trained model.

More information can be found in our paper.

Installation

Dependency

The codes have been tested in the following environment:

Package	Version
Python	3.9.18
PyTorch	2.0.1
CUDA	11.7
PyTorch Geometric	2.3.1
RDKit	2022.03.5
Biopython	1.83
PyTorch Scatter	2.1.1

Install via conda yaml file (cuda 11.3)

conda env create -f env.yml
conda activate MolSnapper

Install manually

conda create -n MolSanpper python=3.9 # optinal, create a new environment
conda activate MolSanpper

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
conda install pyg -c pyg
conda install -c pyg pytorch-scatter

# Install other tools
conda install -c conda-forge rdkit
conda install pyyaml easydict python-lmdb -c conda-forge
conda install -c oddt oddt

Dataset

CrossDocked

Download and the processed testset from DecompDiff repository https://github.com/bytedance/DecompDiff
Please download the following files:

test_set.zip
test_index.pkl

Save them in <test_directory> and process the test data using:

python scripts/prepare_data_cd.py --pairs-paths <test_directory>/test_index.pkl --root-dir <test_directory>  --out-mol-sdf <data_dir>/test_mol.sdf --out-pockets-pkl <data_dir>/test_pockets.pkl --out-table <data_dir>/test_table.csv

For example:

python scripts/prepare_data_cd.py --pairs-paths ./../crossdocked/test_index.pkl --root-dir ./../crossdocked/test_set  --out-mol-sdf ./../crossdocked/test_mol.sdf --out-pockets-pkl ./../crossdocked/test_pockets.pkl --out-table ./../crossdocked/test_table.csv

Processed data

The processed CrossDocked test set can be found in data dir:

data
├── crossdocked
│   ├── test_mol.sdf
│   ├── test_pockets.pkl
│   └── test_table.csv

Binding MOAD

Download and split the dataset as described by the authors of DiffSBDD https://github.com/arneschneuing/DiffSBDD/tree/main
Save the test set in <test_directory>

After removing water process the test directory using:

python scripts/prepare_moad.py --test_path <test_directory>  --out-mol-sdf <data_dir>/test_mol.sdf --out-pockets-pkl <data_dir>/test_pockets.pkl --out-table <data_dir>/test_table.csv

Processed data

The processed Binding MOAD data can be found here:

data
├── MOAD
│   ├── test_mol.sdf
│   ├── test_pockets.pkl
│   └── test_table.csv

Raw complex

If you have raw complexes, remove hydrogen and separate the pockets from the ligands using:

python scripts/clean_and_split.py --in-dir <data_directory>  --proteins-dir <pockets_directory> --ligands-dir <ligands_directory>

For a given pocket process the pocket

python scripts/prepare_single_complex.py --root_dir  <data_directory>  --ligand_filename <ligand_filename>.sdf  --protein_filename <protein_filename>.pdb --out_pockets_path <output_path>.pkl

For example:

python scripts/prepare_single_complex.py --root_dir  <data_directory>  --ligand_filename ligand.sdf --protein_filename data/protein.pdb --out_pockets_path ./data/protein.pkl

Processed complex

An example of a processed complex (PDB ID: 1h00) can be found here:

data
├── example_1h00
│   ├── ref_points.sdf
│   ├── processed_pocket_1h00.pkl
│   └── ligand.sdf

Sample

MolDiff provided the pretrained models, please first download the pretrained model weights from here and put them in the ./ckpt folder. MolSnapper uses the following model weight files:

MolDiff.pt: the pretrained complete MolDiff model.
bond_predictor.pt: the pretrained bond predictor that is used for bond guidance during sampling.

Sample molecules for a given pocket

After setting the correct model weight paths in the config file, you can run the following command to sample molecules:

python scripts/sample_single_pocket.py --outdir .<output_directory> --config <path_to_config_file> --device <device_id> --batch_size <batch_size> --pocket_path <pocket_path>.pkl --sdf_path <sdf_path>.sdf --use_pharma <use_pharma> --num_pharma_atoms <num_pharma_atoms> --clash_rate <clash_rate>

The parameters are:

outdir: the root directory to save the sampled molecules.
config: the path to the config file.
device: the device to run the sampling.
batch_size: the batch size for sampling. If set to 0 (default), it will use the batch size specified in the config file.
pocket_path: the path to the pocket file (pkl).
mol_size: the size of the generated molecule.
sdf_path: path to the SDF file that represent either ligands or reference points with atom positions and types.
use_pharma: A boolean parameter indicating whether to extract pharmacophore points from the SDF file or use the SDF as reference points.
pharma_th: determines the minimum percentage of satisfied pharmacophore points required for a generated molecule to be considered valid during the sampling process.
clash_rate: controls the strength of avoiding clashes during the molecule sampling process.
distance_th sets: the threshold for determining whether a pharmacophore is satisfied.

An example command is:

python scripts/sample_single_pocket.py --outdir ./outputs --config ./configs/sample/sample_MolDiff.yml --batch_size 32 --pocket_path ./data/example_1h00/processed_pocket_1h00.pkl --sdf_path ./data/example_1h00/ref_points.sdf --use_pharma False --clash_rate 0.1

After sampling, there will be two directories in the outdir folder that contains the meta data and the sdf files of the sampling, respectively.

Sample molecules for all pockets in the test set

For sample molecules for all the test set use:

python scripts/sample.py --outdir .<output_directory> --config <path_to_config_file> --device <device_id> --batch_size <batch_size> --pocket_dir <data_directory> --num_pharma_atoms <num_pharma_atoms> --clash_rate <clash_rate>

An example command is:

python scripts/sample.py --outdir ./outputs --config ./configs/sample/sample_MolDiff.yml --batch_size 32 --pocket_dir ./data/crossdocked  --num_pharma_atoms 20 --clash_rate 0.1

Evaluate

Filter the generted molecules using PoseBusters.

To evaluate basic molecular properties, 3D similarity to reference ligand, and hydrogen bonds by ODDT of the generated molecules, run the following command:

python scripts/evaluate.py  <gen_root> --protein_path <protein_path>.pdb --reflig_path <reflig_path> --save_path <save_path>

The parameters are:

gen_root: the directory of the sampled molecules.
protein_path: the path to the protein (PDB format).
reflig_path: the path to reference ligand to evaluate similarity (default is None).
save_path: the path directory to save the evaluation results.

For example:

python scripts/evaluate.py  ./outputs/my_run --protein_path ./data/example_1h00/pocket/1h00_protein.pdb --reflig_path ./data/example_1h00/ligand.sdf --save_path ./outputs/my_run/eval