Awesome
Introduction
This respository contains code needed to reporoduce experiments reported in https://www.biorxiv.org/content/10.1101/2022.07.15.500218v1.
The work is built in previously published work from https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0235-x.
We have provided examples notebooks for creating the input files neccessary to reproduce our results from Double-Model RIOP (DrIOP).
./notebooks
For all other RIOP experiments please refer to https://github.com/m-mokaya/RIOP.
Usage
- Templates for inputs are provided in
reinvent/data/examples/templates
folder. More examples will follow. - There are templates for 6 running modes. Each running mode can be executed by "python input.py some_running_mode.json" after activating the environment.
Templates have to be edited before using. The only thing that needs modification for a standard run are the file and folder paths. Most running modes produce logs that can be monitored by
tensorboard
, see below.- Logging folder is defined by setting a valid path to the "logging_path" field in json. This is required for all running modes.
- Running modes:
- Sampling:
sampling.json
can be used to start sampling. Requires a generative model as an input and produces a file that contains SMILES. We provide a generative model "reinvent/data/augmented.prior". Alternatively focused Agents generated by transfer learning or reinforcement learning can be sampled as well. - Transfer Learning (TL):
transfer_learning.json
is the relevant template and it can be used to focus the general prior towards a narrow chemical space by training on a representative sample of SMILES provided by user. Requires as an input a list of SMILES (example format in "reinvent/data/smiles.smi") and the generative model "reinvent/data/augmented.prior". The result will be a set of generative Agent checkpoints produced after each epoch of training and a final focused Agent. Inspect thetensorboard
logs to estimate which Agent has the level of focusing that you prefer. - Reinforcement Learning (RL): Use
reinforcement_learning.json
as a template. The general input requires paths for both Agent and Prior generative models (in "reinforcement_learning" section of the JSON files). Both can be the same model provided by us "reinvent/data/augmented.prior" or alternatively the user can provide a focused Agent generated by TL. The output is a focused generative model and "scaffold_memory.csv" file which contains the best scoring SMILES during the RL run. The output folder is defined by setting a value for "resultdir". The scoring function object "scoring_function" can be either "name": "custom_product" or "name": "custom_sum". Scoring function has a list of parameters "parameters":[] which may contain any number of component objects. The current template example offers 5 components: a QED score, Matching Substructure (MS), Custom Alerts (CA) and 2 Predictive Property (PP) components. The PP components require setting either a classification ("reinvent/data/drd2.pkl") or regression ("reinvent/data/Aurora_model.pkl") model paths.
- Sampling:
Available components
The scoring function is built-up from components, which together define the "compass" the Agents use to navigate the chemical space and suggest chemical compounds. Currently, there are the following components available:
PREDICTIVE PROPERTY
: Descriptor-based models to predict e.g. activity against a given target or solubility. Usesscikit-learn
models (interface). Works with both classification and regression models.TANIMOTO SIMILARITY
: Requires a user defined set of SMILES and returns the highest similarity score to the provided set.JACCARD DISTANCE
: Requires a user defined set of SMILES and returns the lowest distance score to the provided set.MATCHING SUBSTRUCTRE
: This is a penalty component to bias towards generating certain (sub-)structures. Requires a user defined set of SMARTS. Returns 1 if there is a substructure match and 0.5 otherwise.CUSTOM ALERTS
: This is a penalty component to avoid generating certain (sub-)structures. Requires a user defined set of SMARTS patterns indicating unwanted moieties. Returns 0 if there is a match and 1 otherwise.QED SCORE
: Uses the QED implementation inRDKit
link.MOLECULAR WEIGHT
: Physico-chemical property calculated byRDKit
link.TPSA
: Physico-chemical property calculated byRDKit
link.ROTATABLE BONDS
: Physico-chemical property calculated byRDKit
link.NUMBER OF HYDROGEN BOND DONOROS
: Physico-chemical property calculated byRDKit
link.NUMBER OF RINGS
: Physico-chemical property calculated byRDKit
link.SELECTIVITY
: If the aim is to optimize activity against one target while reducing activity agains another, i.e. to increase compound's selectivity, this component can be used. Uses two scikit-learn models. Works with both classification and regression models. One model is predicting the target activity and the other is providing an off-target prediction. The score is reflecting a user defined activity gap between the target and the off-target predictions.
To use tensorboard
for logging:
-
To launch
tensorboard
, you need a graphical environment. Write:tensorboard --logdir "path to your log output directory" --port=8008
This will give you an address to copy to a browser and access to the graphical summaries fromtensorboard
. -
Further command-line parameters can be used to change the amount of scalars, histograms, images, distributions and graphs shown, e.g.:
--samples_per_plugin=scalar=700, images=20
Installation
-
Install Anaconda / Miniconda
-
Clone the repository
-
Open terminal, go to the repository and generate the appropriate environment: conda env create -f reinvent_shared.yml
-
(Optional) To set environmental variables (currently not needed), for example a license: On the command line, first:
cd $CONDA_PREFIX mkdir -p ./etc/conda/activate.d mkdir -p ./etc/conda/deactivate.d touch ./etc/conda/activate.d/env_vars.sh touch ./etc/conda/deactivate.d/env_vars.sh
then edit ./etc/conda/activate.d/env_vars.sh as follows:
#!/bin/sh export SOME_LICENSE='/path/to/your/license/file'
and finally, edit ./etc/conda/deactivate.d/env_vars.sh :
#!/bin/sh unset SOME_LICENSE
-
Activate environment: conda activate reinvent_shared.v2.1
-
(Optional) In the project directory, in ./configs/ create the file
config.json
by copying overexample.config.json
and editing as required. In the current version this is only relevant for the unit tests. -
Use the tool.