Home

Awesome

repo version python version pytorch license

De novo molecule design with chemical language models

In this repository, you will find a hands-on tutorial to generate focused libraries using RNN-based chemical language models.<div> This code serves as a support to to the protocol chapter: Grisoni F., Schneider G. (2022) De Novo Molecular Design with Chemical Language Models. In: Heifetz A. (eds) Artificial Intelligence in Drug Design. Methods in Molecular Biology, vol 2390. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1787-8_9 <div>

The code for the following two methods is provided:

Note! This repository contains the code for the hands-on chapter and has a teaching purpose only. <div> To use the most up-to-date versions of the methods, have a look at the following repositories:

Happy coding!

Table of Contents

  1. Getting started
  2. Using the Code
    1. Provided Jupyter notebook
    2. Sampling from a pre-trained model
    3. Fine-tuning a model on your data
    4. Data pre-processing
  3. Advanced functions
  4. Authors
  5. License
  6. How to cite

Getting started <a name="Prerequisites"></a>

This repository can be cloned with the following command:

git clone https://github.com/ETHmodlab/de_novo_design_RNN

To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can create the virtual environment as follows:

cd path/to/repository/
conda env create -f environment.yml

To activate the dedicated environment:

conda activate de_novo

Your code is now ready to use!

Using the code <a name="Using_the_code"></a>

Provided Jupyter notebook <a name="notebook"></a>

In this repository, you can find a Jupyter notebook that will help you get started with using the code. We recommend having a look at the notebook first. <div>

To use the provided notebook, move to the “example” folder and launch the Jupyter Notebook application, as follows:

cd example
jupyter notebook

A webpage will open, showing the content of the “code” folder. Double clicking on the file “de_novo_design_pipeline.ipynb” opens the notebook. <div> Each line of the provided code can be executed to visualize and reproduce the results of this tutorial. Below, you will also find some additional details into more advanced setting tuning.

Sampling from a pre-trained model <a name="Sample"></a>

In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (K<sub>d/I</sub>/IC<sub>50</sub>/EC<sub>50</sub> <1μM), for 10 epochs.

To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:

from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)

Parameters:

Notes:

Fine-tuning a model<a name="Finetuning"></a>

Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.

The fine-tuning set needs to be pre-processed, see next section.

You can start the sampling procedure with model/main_fine_tuner.py

SectionParameterDescriptionComments
ModelmodelTypeForwardRNN, BIMODAL
hidden_unitsNumber of hidden unitsSuggested value: 256 for ForwardRNN; 128 for BIMODAL
DatadataName of data fileHas to be located in data/
encoding_sizeNumber of different SMILES tokens55
molecular_sizeLength of string with paddingSee preprocessing
TrainingepochsNumber of epochsSuggested value: 10
learning_rateLearning rateSuggested value: 0.001
batch_sizeBatch sizeSuggested value: 128
EvaluationsamplesNumber of generated SMILES after each epoch
tempSampling temperatureSuggested value: 0.7
starting_tokenStarting token for samplingG
Fine-Tuningstart_modelName of pre-trained model to be used for fine-tuning

To fine-tune a model, you can run:

t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)

Parameters:

Note:

Preprocessing <a name="preprocessing"></a>

Data can be processed by using preprocessing/main_preprocessor.py:

from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)

Parameters:

Notes:

Advanced functions <a name="advanced"></a>

If you want to personalize the pre-training or use advanced settings, please refer to the following repo: https://github.com/ETHmodlab/BIMODAL

Authors<a name="Authors"></a>

Authors of the provided code (as in this repo)

Author of this tutorial

See also the list of contributors who participated in this project.

License<a name="License"></a>

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This code is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

How to Cite <a name="cite"></a>

If you use this code (or parts thereof), please cite it as:

@article{grisoni2020,
  title         = {Bidirectional Molecule Generation with Recurrent Neural Networks},
  author        = {Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
  journal       = {Journal of Chemical Information and Modeling},
  volume        = {60},
  number        = {3},
  pages         = {1175–1183}, 
  year          = {2020},
  doi           = {10.1021/acs.jcim.9b00943},
  url           = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
 publisher      = {ACS Publications}
}
@incollection{grisoni2021,
  author       = {Grisoni, Francesca and Schneider, Gisbert},
  title        = {De novo Molecule Design with Chemical Language Models},
  booktitle    = {Artfificial Intelligence in Drug Design},
  publisher    = {Springer},
  year         = 2021,
  volume       = {2390},
  series       = {Methods in Molecular Biology},
  pages        = {207-232},
  address      = {New York, NY},
  }