Home

Awesome

Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?


GitHub Repo stars GitHub Repo forks

This repository hosts an open-source benchmark for Structure-based Drug Design, to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 16 Structure-based Drug Design algorithms on 7 tasks.

Installation

There are two environments: Test Env and TDC Env. Test Env is used to run these models: 3DSBDD, Pocket2mol, PockFlow, ResGen and Autogrow4. TDC Env is used to run the rest of the models and evaluate all the models' generated molecules.

conda env create -f environment_TestEnv.yml
conda activate TestEnv2
<!-- pip install guacamol --> <!-- pip install networkx --> <!-- pip install joblib -->

16 Methods

Based the ML methodologies, all the methods are categorized into:

time is the average rough clock time for a single run in our benchmark and do not involve the time for pretraining and data preprocess. We have processed the data, pretrained the model. Both are available in the repository.

ModelDimensionGenerated Number requires_gpu
3DSBDD3D771yes
AutoGrow42D1233yes
Pocket2mol3D928yes
PocketFlow3D1000yes
RenGen3D631yes
DST2D1001no
Graph GA2D643no
MIMOSA2D1001yes
MolDQN2D501yes
Pasithea1D914yes
REINVENT1D100yes
SCREENING-1000no
SELFIES-VAE-BO1D200yes
SMILES-GA1D584no
SMILES-LSTM-HC1D501no
SMILES-VAE-BO1D200yes

PDB information

All the PDB files can be downloaded from RCSB Protein Data Bank. The blinding sites are as follow:

PDBcenter(x,y,z)bounding box size
1iep15.6138918, 53.38013513, 15.45483715
3eml-9.06363, -7.1446, 55.8625999915
3ny82.2488, 4.68495, 51.3982000000000115 (23 for Pocket2mol)
4rlu-0.73599, 22.75547, -31.2368915
4unn5.684346153, 18.1917, -7.371515
5mo4-44.901, 20.490354, 8.4833515
7l11-21.81481, -4.21606, -27.9837815 (23 for Pocket2mol)

Sampling and evaluating

For 3DSBDD and Pocket2mol, we use this command to generate:

python sample_for_pdb.py --pdb_path [your pdb] --center=[centers] --bbox_size [box size] --outdir [your outdir]

Also need to change the num_samples in the sample_for_pdb.yml

For PocketFlow, we use this command to generate:

python main_generate.py -pkt [your pdb] --ckpt ckpt/ZINC-pretrained-255000.pt -n 1000 -d cuda:0 --root_path [your outdir] --name [pdb name] -at 1.0 -bt 1.0 --max_atom_num 35 -ft 0.5 -cm True --with_print True

For ResGen, we first convert our pdb file to sdf file and use this command to generate:

python gen.py --pdb_file [your pdb] --sdf_file [correspond sdf] --outdir [your outdir]

For Autogrow4, we recommend following their tutorial before running the generation command:

python RunAutogrow.py \
    --filename_of_receptor [your pdb] \
    --center_x [center x] --center_y  [center y] --center_z [center z] \
    --size_x [box size] --size_y [box size] --size_z [box size] \
    --source_compound_file /autogrow4/autogrow/source_compounds/naphthalene_smiles.smi \
    --root_output_folder /PATH_TO/output_directory/ \
    --number_of_mutants_first_generation 50 \
    --number_of_crossovers_first_generation 50 \
    --number_of_mutants 50 \
    --number_of_crossovers 50 \
    --top_mols_to_seed_next_generation 50 \
    --number_elitism_advance_from_previous_gen 50 \
    --number_elitism_advance_from_previous_gen_first_generation 10 \
    --diversity_mols_to_seed_first_generation 10 \
    --diversity_seed_depreciation_per_gen 10 \
    --num_generations 5 \
    --mgltools_directory /PATH_TO/mgltools_x86_64Linux2_1.5.6/ \
    --number_of_processors -1 \
    --scoring_choice VINA \
    --LipinskiLenientFilter \
    --start_a_new_run \
    --rxn_library ClickChem \
    --selector_choice Rank_Selector \
    --dock_choice VinaDocking \
    --max_variants_per_compound 5 \
    --redock_elite_from_previous_gen False \
    --generate_plot True \
    --reduce_files_sizes True \
    --use_docked_source_compounds True \
    >  /PATH_TO/OUTPUT/text_file.txt 2>  /PATH_TO/OUTPUT/text_errormessage_file.txt

These above models only produce molecules, to evalute these molecules with docking and heuristic oracles, using following command:

python evaluation.py --smiles_path [your path] --pdb [your pdb] --model [model name]

For the rest of models that are under PMO, we use the following command to generate, note that you should running under TDC enviornment:

oracle_array=('1iep_docking' '3eml_docking' '3ny8_docking' '4rlu_docking' '4unn_docking' '5mo4_docking' '7l11_docking')

for oralce in ${oracle_array[@]}
do
python -u run.py [model name] --task production --n_runs 1 --max_oracle_calls 1000 --oracles ${oralce}
done

After generation, you could use mol_opt_process to convert the generated yaml file to csv file and evaluate the heuristic oracles.

To know the statistics of the docking or property score, you can use following code:

python results_compare.py --eval_folder_path [your generated result] --pdb_list [your pdb list] --file_type [docking or property] --output_folder [your outdir]