Home

Awesome

repo version python version pytorch license

Bidirectional Molecule Generation with Recurrent Neural Networks

This is the supporting code for: Grisoni F., Moret M., Lingwood R., Schneider G., "Bidirectional Molecule Generation with Recurrent Neural Networks". Journal of Chemical Information and Modeling (2020). Available here.

You can use this repository for the generation of SMILES with bidirectional recurrent neural networks (RNNs). In addition to the methods' code, several pre-trained models for each approach are included.

The following methods are implemented:

Table of Contents

  1. Prerequisites
  2. Using the Code
    1. Sampling from a pre-trained model
    2. Training a model on your data
    3. Fine-tuning a model on your data
  3. Authors
  4. License
  5. How to cite
  6. Manuscript preprint

Prerequisites<a name="Prerequisites"></a>

This repository can be cloned with the following command:

git clone https://github.com/ETHmodlab/BIMODAL

To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can install the virtual environment:

cd path/to/repository/
conda env create -f brnn.yml

To activate the dedicated environment:

conda activate brnn

Your code should now be ready to use!

Using the code <a name="Using_the_code"></a>

Sampling from a pre-trained model <a name="Sample"></a>

In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (K<sub>d/I</sub>/IC<sub>50</sub>/EC<sub>50</sub> <1μM), for 10 epochs.

To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:

from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)

Parameters:

Notes:

Training a New Model

Alternatively, if you want to pre-train a model on your own data, you will need to execute three steps: (i) data processing (ii) training and (iii) evaluation. Please be aware that you will need the access to a GPU to pre-train your own model as this is a computationally intensive step.

Preprocessing

Data can be processed by using preprocessing/main_preprocessor.py:

from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)

Parameters:

Notes:

Training

Training requires a parameter file (.ini) with a given set of parameters. You can find examples for all models in experiments/, and further details about the parameters below:

SectionParameterDescriptionComments
ModelmodelTypeForwardRNN, FBRNN, BIMODAL, NADE
hidden_unitsNumber of hidden unitsSuggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL
generationTo be defined only for NADE (other models defined through preprocessing)fixed, random
DatadataName of data fileHas to be located in data/
encoding_sizeNumber of different SMILES tokens55
molecular_sizeLength of string with paddingSee preprocessing
missing_tokenTo add in the parameter file only for NADEM
TrainingepochsNumber of epochsSuggested value: 10
learning_rateLearning rateSuggested value: 0.001
n_foldsFolds in cross-validationSee below: More than 1 for cross_validation, 1 to use only one fold of the data for validation
batch_sizeBatch sizeSuggested value: 128
EvaluationsamplesNumber of generated SMILES after each epoch
tempSampling temperatureSuggested value: 0.7
starting_tokenStarting token for samplingG for all models except NADE, which requires a sequence consisting of missing values (see publication)

Note:

Options for training:

from trainer import Trainer

t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.cross_validation(stor_dir = '../evaluation/', restart = False)
from trainer import Trainer

t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.single_run(stor_dir = '../evaluation/', restart = False)

Parameters:

Evaluation

You can do the evaluation of the outputs of your experiment with the evaluation/main_evaluator.py with the following possibilities:

from evaluation import Evaluator

stor_dir = '../evaluation/'
e = Evaluator(experiment_name = 'BIMODAL_fixed_512')
# Plot training and validation loss within one figure
e.eval_training_validation(stor_dir=stor_dir)
# Plot percentage of novel, valid and unique SMILES
e.eval_molecule(stor_dir=stor_dir)

Parameters:

Note:

Fine-tuning a model<a name="Finetuning"></a>

Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.

You can start the sampling procedure with model/main_fine_tuner.py

SectionParameterDescriptionComments
ModelmodelTypeForwardRNN, FBRNN, BIMODAL, NADE
hidden_unitsNumber of hidden unitsSuggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL
generationOnly NADE (other models defined through preprocessing)fixed, random
DatadataName of data fileHas to be located in data/
encoding_sizeNumber of different SMILES tokens55
molecular_sizeLength of string with paddingSee preprocessing
missing_tokenTo add in the parameter file only for NADEM
TrainingepochsNumber of epochsSuggested value: 10
learning_rateLearning rateSuggested value: 0.001
batch_sizeBatch sizeSuggested value: 128
EvaluationsamplesNumber of generated SMILES after each epoch
tempSampling temperatureSuggested value: 0.7
starting_tokenStarting token for samplingG for all models except NADE, which requires a sequence consisting of missing values (see publication)
Fine-Tuningstart_modelName of pre-trained model to be used for fine-tuning

To fine-tune a model, you can run:

t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)

Parameters:

Note:

Authors<a name="Authors"></a>

See also the list of contributors who participated in this project.

License<a name="License"></a>

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This code is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

How to Cite <a name="cite"></a>

If you use this code (or parts thereof), please cite it as:

@article{grisoni2020,
  title={Bidirectional Molecule Generation with Recurrent Neural Networks},
  author={Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={60},
  number={3},
  pages={1175–1183}, 
  year={2020},
  doi = {10.1021/acs.jcim.9b00943},
  url = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
 publisher={ACS Publications}
}

Manuscript Preprint<a name="Preprint"></a>

A preprint (not peer-reviewed) version of the original manuscript is available as a pdf in this repository (preprint folder). This document is the unedited version of a Submitted Work that was subsequently accepted for publication in the Journal of Chemical Information and Modeling, copyright © American Chemical Society after peer review. To access the final edited and published work see https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943.