Awesome
We Should at Least Be Able to Design Molecules That Dock Well
To learn how to evaluate your model, see Getting Started notebook.
Paper: https://arxiv.org/abs/2006.16955.
News:
- (12.2020) We have released v1.0 of the benchmark. Main changes include changing the scoring function to a more accurate one (Vinardo), adding stronger baselines (REINVENT and sampling molecules from ZINC), and measuring and ensuring sufficient novelty and diversity of generated molecules.
Results
Listed below are benchmark results from the paper for docking score optimization (the lower, the better). Each cell reports the mean score for the generated compounds and their internal diversity in parenthesis. For each protein we sampled a set of molecules from ZINC subset of protein's training set size. As a baseline, we also report results for the top 10% molecules from the training set and ZINC. Please see our paper for more details.
5HT1B | 5HT2B | ACM2 | CYP2D6 | |
---|---|---|---|---|
CVAE | -4.647 (0.907) | -4.188 (0.913) | -4.836 (0.905) | - |
GVAE | -4.955 (0.901) | -4.641 (0.887) | -5.422 (0.898) | -7.672 (0.714) |
REINVENT | -9.774 (0.506) | -8.657 (0.455) | -9.775 (0.467) | -8.759 (0.626) |
Train (10%) | -10.837 (0.749) | -9.769 (0.831) | -8.976 (0.812) | -9.256 (0.869) |
ZINC (10%) | -9.894 (0.862) | -9.228 (0.851) | -8.282 (0.860) | -8.787 (0.853) |
Environment
The best way is to use conda environment.
Create new environment and run docking_benchmark/install_conda_env.sh
script.
Data
In order to run experiments or train models additional data is required.
Download this zip, unpack it and set the DOCKING_BENCHMARK_DATA
environment variable to this directory.
Experiments
Single component optimization
Run the docking_baselines/scripts/generate_molecules.py
script. Run it with -h
flag for info about arguments.
Details about some of the arguments:
protein
- protein that ligand will be docked to; possible choices:5ht1b
,5ht2b
,acm2
random_samples
- Gauss samples that will be docked (see paper)mode
-minimize
ormaximize
- whether the model should minimize or maximize the componentdataset
- dataset used to fine-tune the model; available datasets for a given protein are listed inDOCKING_BENCHMARK_DATA/proteins_data/protein/metadata.json
file; the dataset defines the component to be optimized by properly settingscore_column
inmetadata
file