Home

Awesome

--> Migration to the newer version of OpenNMT is in progress. See the "reborn" branch.

Molecular Transformer for Reagents Prediction

This is the code for the paper Reagent Prediction with a Molecular Transformer Improves Reaction Data Quality.
The repository is effectively a fork of the Molecular Transformer.

Idea:

Code

Old files:

onmt is a directory with the OpenNMT code.
preprocess.py, train.py, translate.py, score_predictions.py are files used in the Molecular Transformer code.

New files:

src folder contains preprocessing for reagent prediction.
prepare_data.py is the main script that preprocesses USPTO for reagents prediction with MT.
environment.yml is the conda environment specification.

Data

The entire USPTO data used to assemble the training set for reagent prediction can be downloaded here.

The training set for reagents prediction was obtained from it using the prepare_data.py script. It does not overlap with the USPTO MIT test set. The tokenized data can be downloaded here.

The tokenized data for product prediction is stored here. For the description of these data, please refer to the README of the original Molecular Transformer.

The data for product prediction with altered reagents can be downloaded here.

Workflow

  1. Create a conda environment from the specification file:

    conda env create -f environment.yml
    conda activate reagents_pred
    

    Install pytorch separately:

    wget https://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl
    pip install torch-0.4.1-cp36-cp36m-linux_x86_64.whl
    rm torch-0.4.1-cp36-cp36m-linux_x86_64.whl
    pip install torchtext==0.3.1
    
  2. Download the datasets and put them in the data/tokenized directory.

  3. Train a reagents prediction model: First, preprocess the data for an OpenNMT model:

        python3 preprocess.py -train_src data/tokenized/${DATASET_NAME}/src-train.txt \
                              -train_tgt data/tokenized/${DATASET_NAME}/tgt-train.txt \
                              -valid_src data/tokenized/${DATASET_NAME}/src-val.txt \
                              -valid_tgt data/tokenized/${DATASET_NAME}/tgt-val.txt \
                              -save_data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \
                              -src_seq_length 1000 -tgt_seq_length 1000 \
                              -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab
    

    Then, run a model:

        python3 train.py -data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \
                         -save_model experiments/checkpoints/${DATASET_NAME}/${DATASET_NAME}_model \
                         -seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 \
                         -train_steps 500000 -param_init 0  -param_init_glorot -max_generator_batches 32 \
                         -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0  -accum_count 4 \
                         -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000  \
                         -learning_rate 2 -label_smoothing 0.0 -report_every 10 \
                         -layers 4 -rnn_size 256 -word_vec_size 256 -encoder_type transformer -decoder_type transformer \
                         -dropout 0.1 -position_encoding -share_embeddings \
                         -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \
                         -heads 8 -transformer_ff 2048 -tensorboard
    
  4. Train a basline product prediction model:
    Train a Molecular Transformer on, say, MIT_separated data. For this, run the preprocess.py script and the train.py script
    as shown above but with DATASET_NAME=MIT_separated.

  5. Use a trained reagent model to improve reagents in a dataset for product prediction.
    The script reagent_substitution.py uses a reagents prediction model to change reagents in data which is the input
    to a product prediction model. To change the reagents in, say, MIT_separated, run the script as follows:

        python3 reagent_substitution.py --data_dir data/tokenized/MIT_separated \ 
                                        --reagent_model <MODEL_NAME> \ 
                                        --reagent_model_vocab <MODEL_SRC_VOCAB> \
                                        --beam_size 5 --gpu 0
    

    MODEL_NAME may be stored, in experiments/checkpoints/. MODEL_SRC_VOCAB (a .json file )may be stored in data/vocabs/.
    Or download the final data here.

  6. Train product prediction models on datasets cleaned by a reagents prediction model like in step 5.

The trained reagent and product models in the forms of .pt files are stored here.

Inference

To make predictions for reactions supplied in a .txt file as SMILES, use the following script:

```bash
    python3 translate.py -model <PATH TO THE MODEL WEIGHTS> \
                -src <PATH TO THE TEST REACTIONS WITHOUT REAGENTS> \
                -output <PATH TO THE .TXT FILE WHERE THE PREDICTIONS WILL BE STORED> \
                -batch_size 64 -replace_unk -max_length 200 -fast -beam_size 5 -n_best 5 -gpu <GPU ID>
```

The supplied reactions should be tokenized with tokens separated by spaces, like in files produced by prepare_data.py.

Citation

Paper:

@article{andronov_voinarovska_andronova_wand_clevert_schmidhuber_2022,
author     ="Andronov, Mikhail and Voinarovska, Varvara and Andronova, Natalia and Wand, Michael and Clevert, Djork-Arné and Schmidhuber, Jürgen",
title      ="Reagent prediction with a molecular transformer improves reaction data quality",
journal    ="Chem. Sci.",
year       ="2023",
volume     ="14",
issue      ="12",
pages      ="3235-3246",
publisher  ="The Royal Society of Chemistry",
doi        ="10.1039/D2SC06798F",
url        ="http://dx.doi.org/10.1039/D2SC06798F"

The underlying framework:

@inproceedings{opennmt,
  author    = {Guillaume Klein and
               Yoon Kim and
               Yuntian Deng and
               Jean Senellart and
               Alexander M. Rush},
  title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
  booktitle = {Proc. ACL},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/P17-4012},
  doi       = {10.18653/v1/P17-4012}
}