Awesome

Introduction

This respository contains code needed to reporoduce experiments reported in https://www.biorxiv.org/content/10.1101/2022.07.15.500218v1.

The work is built in previously published work from https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0235-x.

We have provided examples notebooks for creating the input files neccessary to reproduce our results from Double-Model RIOP (DrIOP).

./notebooks

For all other RIOP experiments please refer to https://github.com/m-mokaya/RIOP.

Usage

Templates for inputs are provided in reinvent/data/examples/templates folder. More examples will follow.
There are templates for 6 running modes. Each running mode can be executed by "python input.py some_running_mode.json" after activating the environment. Templates have to be edited before using. The only thing that needs modification for a standard run are the file and folder paths. Most running modes produce logs that can be monitored by tensorboard, see below.
- Logging folder is defined by setting a valid path to the "logging_path" field in json. This is required for all running modes.
Running modes:
- Sampling: sampling.json can be used to start sampling. Requires a generative model as an input and produces a file that contains SMILES. We provide a generative model "reinvent/data/augmented.prior". Alternatively focused Agents generated by transfer learning or reinforcement learning can be sampled as well.
- Transfer Learning (TL): transfer_learning.json is the relevant template and it can be used to focus the general prior towards a narrow chemical space by training on a representative sample of SMILES provided by user. Requires as an input a list of SMILES (example format in "reinvent/data/smiles.smi") and the generative model "reinvent/data/augmented.prior". The result will be a set of generative Agent checkpoints produced after each epoch of training and a final focused Agent. Inspect the tensorboard logs to estimate which Agent has the level of focusing that you prefer.
- Reinforcement Learning (RL): Use reinforcement_learning.json as a template. The general input requires paths for both Agent and Prior generative models (in "reinforcement_learning" section of the JSON files). Both can be the same model provided by us "reinvent/data/augmented.prior" or alternatively the user can provide a focused Agent generated by TL. The output is a focused generative model and "scaffold_memory.csv" file which contains the best scoring SMILES during the RL run. The output folder is defined by setting a value for "resultdir". The scoring function object "scoring_function" can be either "name": "custom_product" or "name": "custom_sum". Scoring function has a list of parameters "parameters":[] which may contain any number of component objects. The current template example offers 5 components: a QED score, Matching Substructure (MS), Custom Alerts (CA) and 2 Predictive Property (PP) components. The PP components require setting either a classification ("reinvent/data/drd2.pkl") or regression ("reinvent/data/Aurora_model.pkl") model paths.

Available components

The scoring function is built-up from components, which together define the "compass" the Agents use to navigate the chemical space and suggest chemical compounds. Currently, there are the following components available:

PREDICTIVE PROPERTY: Descriptor-based models to predict e.g. activity against a given target or solubility. Uses scikit-learn models (interface). Works with both classification and regression models.
TANIMOTO SIMILARITY: Requires a user defined set of SMILES and returns the highest similarity score to the provided set.
JACCARD DISTANCE: Requires a user defined set of SMILES and returns the lowest distance score to the provided set.
MATCHING SUBSTRUCTRE: This is a penalty component to bias towards generating certain (sub-)structures. Requires a user defined set of SMARTS. Returns 1 if there is a substructure match and 0.5 otherwise.
CUSTOM ALERTS: This is a penalty component to avoid generating certain (sub-)structures. Requires a user defined set of SMARTS patterns indicating unwanted moieties. Returns 0 if there is a match and 1 otherwise.
QED SCORE: Uses the QED implementation in RDKitlink.
MOLECULAR WEIGHT: Physico-chemical property calculated by RDKitlink.
TPSA: Physico-chemical property calculated by RDKitlink.
ROTATABLE BONDS: Physico-chemical property calculated by RDKitlink.
NUMBER OF HYDROGEN BOND DONOROS: Physico-chemical property calculated by RDKitlink.
NUMBER OF RINGS: Physico-chemical property calculated by RDKitlink.
SELECTIVITY: If the aim is to optimize activity against one target while reducing activity agains another, i.e. to increase compound's selectivity, this component can be used. Uses two scikit-learn models. Works with both classification and regression models. One model is predicting the target activity and the other is providing an off-target prediction. The score is reflecting a user defined activity gap between the target and the off-target predictions.

To use tensorboard for logging:

To launch tensorboard, you need a graphical environment. Write: tensorboard --logdir "path to your log output directory" --port=8008 This will give you an address to copy to a browser and access to the graphical summaries from tensorboard.
Further command-line parameters can be used to change the amount of scalars, histograms, images, distributions and graphs shown, e.g.: --samples_per_plugin=scalar=700, images=20

Installation

Install Anaconda / Miniconda
Clone the repository
Open terminal, go to the repository and generate the appropriate environment: conda env create -f reinvent_shared.yml

(Optional) To set environmental variables (currently not needed), for example a license: On the command line, first:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

then edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh
export SOME_LICENSE='/path/to/your/license/file'

and finally, edit ./etc/conda/deactivate.d/env_vars.sh :

#!/bin/sh
unset SOME_LICENSE

Activate environment: conda activate reinvent_shared.v2.1
(Optional) In the project directory, in ./configs/ create the file config.json by copying over example.config.json and editing as required. In the current version this is only relevant for the unit tests.
Use the tool.