Awesome
De novo molecule design with chemical language models
In this repository, you will find a hands-on tutorial to generate focused libraries using RNN-based chemical language models.<div> This code serves as a support to to the protocol chapter: Grisoni F., Schneider G. (2022) De Novo Molecular Design with Chemical Language Models. In: Heifetz A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9 <div>
The code for the following two methods is provided:
- Bidirectional Molecule Design by Alternate Learning (BIMODAL), designed for SMILES generation – see Grisoni et al. 2020.
- Forward RNN, i.e., "classical" unidirectional RNN for SMILES generation. In addition to the method code, several pre-trained models are included.
Note! This repository contains the code for the hands-on chapter and has a teaching purpose only. <div> To use the most up-to-date versions of the methods, have a look at the following repositories:
- https://github.com/ETHmodlab/BIMODAL for BIMODAL.
- https://github.com/ETHmodlab/virtual_libraries for unidirectional RNNs.
Happy coding!
Table of Contents
Getting started <a name="Prerequisites"></a>
This repository can be cloned with the following command:
git clone https://github.com/ETHmodlab/de_novo_design_RNN
To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can create the virtual environment as follows:
cd path/to/repository/
conda env create -f environment.yml
To activate the dedicated environment:
conda activate de_novo
Your code is now ready to use!
Using the code <a name="Using_the_code"></a>
Provided Jupyter notebook <a name="notebook"></a>
In this repository, you can find a Jupyter notebook that will help you get started with using the code. We recommend having a look at the notebook first. <div>
To use the provided notebook, move to the “example” folder and launch the Jupyter Notebook application, as follows:
cd example
jupyter notebook
A webpage will open, showing the content of the “code” folder. Double clicking on the file “de_novo_design_pipeline.ipynb” opens the notebook. <div> Each line of the provided code can be executed to visualize and reproduce the results of this tutorial. Below, you will also find some additional details into more advanced setting tuning.
Sampling from a pre-trained model <a name="Sample"></a>
In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (K<sub>d/I</sub>/IC<sub>50</sub>/EC<sub>50</sub> <1μM), for 10 epochs.
To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:
from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)
Parameters:
- experiment_name (str): name of the experiment with pre-trained model you want to sample from (you can find pre-trained models in evaluation/)
- stor_dir (str): directory where the models are stored. The sampled SMILES will also be saved there (if write_csv=True)
- N (int): number of SMILES to sample
- T (float): sampling temperature
- fold (list of int): number of folds to use for sampling
- epoch (list of int): epoch(s) to use for sampling
- valid (bool): if set to True, only generate valid SMILES are accepted (increases the sampling time)
- novel (bool): if set to True, only generate novel SMILES (increases the sampling time)
- unique (bool): if set to True, only generate unique SMILES are provided (increases the sampling time)
- write_csv (bool): if set to True, the .csv file of the generated smiles will be exported in the specified directory.
Notes:
- For the provided pre-trained models, only fold=[1] and epoch=[9] are provided.
- The list of available models and their description are provided in evaluation/model_names.md
Fine-tuning a model<a name="Finetuning"></a>
Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.
The fine-tuning set needs to be pre-processed, see next section.
You can start the sampling procedure with model/main_fine_tuner.py
Section | Parameter | Description | Comments |
---|---|---|---|
Model | model | Type | ForwardRNN, BIMODAL |
hidden_units | Number of hidden units | Suggested value: 256 for ForwardRNN; 128 for BIMODAL | |
Data | data | Name of data file | Has to be located in data/ |
encoding_size | Number of different SMILES tokens | 55 | |
molecular_size | Length of string with padding | See preprocessing | |
Training | epochs | Number of epochs | Suggested value: 10 |
learning_rate | Learning rate | Suggested value: 0.001 | |
batch_size | Batch size | Suggested value: 128 | |
Evaluation | samples | Number of generated SMILES after each epoch | |
temp | Sampling temperature | Suggested value: 0.7 | |
starting_token | Starting token for sampling | G | |
Fine-Tuning | start_model | Name of pre-trained model to be used for fine-tuning |
To fine-tune a model, you can run:
t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)
Parameters:
- experiment_name: Name parameter file (.ini)
- stor_dir: Directory where outputs can be found
- restart: If True, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)
Note:
- The batch size should not exceed the number of SMILES that you have in your fine-tuning file (taking into account the data augmentation).
Preprocessing <a name="preprocessing"></a>
Data can be processed by using preprocessing/main_preprocessor.py:
from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)
Parameters:
- filename_in (str): name of the file containing the SMILES strings (.csv or .tar.xz)
- model_type (str): name of the chosen generative method
- starting_point (str): starting point type ('fixed' or 'random')
- augmentation(int): augmentation folds [Default = 1]
Notes:
- In preprocessing/main_preprocessor.py you will find info regarding advanced options for pre-processing (e.g., stereochemistry, canonicalization, etc.)
- Please note that the pre-treated data will have to be stored in data/.
Advanced functions <a name="advanced"></a>
If you want to personalize the pre-training or use advanced settings, please refer to the following repo: https://github.com/ETHmodlab/BIMODAL
Authors<a name="Authors"></a>
Authors of the provided code (as in this repo)
- Robin Lingwood (https://github.com/robinlingwood)
- Francesca Grisoni (https://github.com/grisoniFr)
- Michael Moret (https://github.com/michael1788)
Author of this tutorial
- Francesca Grisoni (https://github.com/grisoniFr)
See also the list of contributors who participated in this project.
License<a name="License"></a>
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This code is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.
How to Cite <a name="cite"></a>
If you use this code (or parts thereof), please cite it as:
@article{grisoni2020,
title = {Bidirectional Molecule Generation with Recurrent Neural Networks},
author = {Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
journal = {Journal of Chemical Information and Modeling},
volume = {60},
number = {3},
pages = {1175–1183},
year = {2020},
doi = {10.1021/acs.jcim.9b00943},
url = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
publisher = {ACS Publications}
}
@incollection{grisoni2021,
author = {Grisoni, Francesca and Schneider, Gisbert},
title = {De novo Molecule Design with Chemical Language Models},
booktitle = {Artfificial Intelligence in Drug Design},
publisher = {Springer},
year = 2021,
volume = {2390},
series = {Methods in Molecular Biology},
pages = {207-232},
address = {New York, NY},
}