Awesome

PROTAC-RL

Source code for the Nature Machine Intelligence paper Accelerated rational PROTAC design via deep learning and molecular simulations.

Protac-RL

PROTAC-RL is a novel deep reinforcement learning-driven generative model for the rational design of PROTACs in a low-resource setting.

Install requirements

Python = 3.6.10

rdkit = 2019.09.2.0, torch = 1.8.0+cu111

Python packages listed in environment.yml

To install all the Python packages, create a new conda environment:

conda env create -f environment.yml
conda activate PROTAC-RL

Pre-processing

The tokenized datasets can be found on the data/ folder.

PROTAC, ZINC are datasets from PROTAC-DB, ZINC (molecular weight > 500). For each dataset, we have two types of SMILES. One is canonical SMILES and another is random SMILES (data augmentation).

We use a shared vocabulary. The vocab_size and seq_length are chosen to include the whole datasets.

In order to have the same vocabulary for pre-training and fine-tuning, remember to move .vocab.pt from ZINC dataset to PROTAC dataset after PROTAC dataset being pre-processed.

NOTICE

GitHub would ignore origin empty log folder in our code. To avoid the FileNotFoundError, you could mkdir a new log folder at first.

Pre-training

pre-training can be started by running the training.sh script using ZINC dataset

Fine-tuning

fine-tuning script fine-tune-training.sh can be run after pre-training using PROTAC dataset

RL (beam search & Multinomial sampling)

To train the RL model use the train_case.sh script, training in beam search can use train_type as B and in multinomial sampling can use train_type as M. Annotation is sufficient in script file train_case.sh.

In most of cases, multinomial sampling performs better because of its ability to explore larger chemical space.

The input of cases can refer to case/dBET6/

Generation (beam search & multinomial sampling)

Model generation of beam search can be started by running the testing_beam_search_*.sh script. There are several similar generating files provided by us, and usage of each file was written in annotation in beginning of each file.

Model generation of multinomial sampling can be started by running the testing_msearch_*.sh script. Also, there are several similar generating files provided by us, and usage of each file was written in annotation in beginning of each file.

Example

To specifically describe how to train and use our PROTAC-RL, we showed an example below:

pre-processing for ZINC (parameters of script changed to ZINC)

bash preprocess.sh

then tune parameters to PROTACs

bash preprocess.sh

pre-training with ZINC data

bash training.sh

before fine-tuning, move .vocab.pt from ZINC dataset folder to PROTAC dataset folder and replace orgin one

fine-tuning with PROTAC data

bash fine-tuning.sh

RL-training in dBET6 case and scoring function was set to PK

bash train_case.sh

generate from RL-training model

bash testing_msearch_case.sh

then find generation and log file in dBET6 case folder

Reference

Please cite the following paper if you use this code in your work.

@article{zheng2022accelerated,
  title={Accelerated rational PROTAC design via deep learning and molecular simulations},
  author={Zheng, Shuangjia and Tan, Youhai and Wang, Zhenyu and Li, Chengtao and Zhang, Zhiqing and Sang, Xu and Chen, Hongming and Yang, Yuedong},
  journal={Nature Machine Intelligence},
  pages={1--10},
  year={2022},
  publisher={Nature Publishing Group}
}

Contact

@Shuangjia