Awesome

X-MOL : large-scale pre-training for molecular understanding and diverse molecular analysis

Introduction of X-MOL

In silico modelling and analysis of small molecules substantially accelerates the process of drug development. Representing and understanding molecules is the fundamental step for various in silico molecular analysis tasks. Traditionally, these molecular analysis tasks have been investigated individually and separately. In this study, we presented X-MOL, which applies large-scale pre-training technology on 1.1 billion molecules for molecular understanding and representation, and then, carefully designed fine-tuning was performed to accommodate diverse downstream molecular analysis tasks, including molecular property prediction, chemical reaction analysis, drug-drug interaction prediction, de novo generation of molecules and molecule optimization. As a result, X-MOL was proven to achieve state-of-the-art results on all these molecular analysis tasks with good model interpretation ability. Collectively, taking advantage of super large-scale pre-training data and super-computing power, our study practically demonstrated the utility of the idea of "mass makes miracles" in molecular representation learning and downstream in silico molecular analysis, indicating the great potential of using large-scale unlabelled data with carefully designed pre-training and fine-tuning strategies to unify existing molecular analysis tasks and substantially enhance the performance of each task. In our study, X-MOL adopts a well-designed pre-training strategy to learn and understand the SMILES representation efficiently. Specifically, X-MOL designs a generative model during pre-training. In this way, the model is trained to generate a valid and equivalent SMILES representation from an input SMILES representation of the same molecule. This generative training strategy ultimately results in a pre-trained model with a good understanding of the SMILES representation, and it can generate the correct SMILES of the given molecule quite well. As a result, X-MOL builds a super large-scale pre-training model based on the Transformer, which is composed of 12 encoder-decocder layers, 768-dimensional hidden units and 12 attention heads. Specifically, our generative pre-training strategy is implemented by an encoder-decoder architecture, but it is different from traditional encoder-decoder architectures such as those used in neural machine translation (NMT), as the encoder and decoder in X-MOL share the same layers. In X-MOL, the input random SMILES and output random SMILES are sent into the model simultaneously, and the output random SMILES is totally masked. In addition, only a unidirectional attention operation can be performed within the output random SMILES, which means that each character in the output random SMILES can pay attention only to itself and the previously generated characters. In this way, the shared-layer encoder-decoder architecture in X-MOL is able to unifiy the semantic comprehension of encoder and decoder, also the shared-layer architecture could reduce the number of parameters significantly compared with traditional encoder-decoder architectures.

Work-flow of X-MOL

.......Pre-training.................................Fine-tuning........... .........................................|Molecular property prediction... ..........Tremendous data|...............|Drug-drug inteartion prediction. ..Large-scale transformer|---> X-MOL --->|Chemical reaction prediction.... .Powerful computing power|...............|Molecule generation............. .........................................|Molecule optimization...........

Environment

We provide the pre-trained X-MOL and the script of fine-tuning X-MOL as well as the environment Environment: The fine-tuning of X-MOL to prediction tasks and generation tasks are two irrelevant and independent part, the environment (including python and nccl) should be downloaded and decompressed into both the two folders The provided environment : - Pre_trained X-MOL : https://1drv.ms/u/s!BIa_gVKaCDngi2S994lMsp-Y3TWK?e=l5hbxi - Environment-python : https://1drv.ms/u/s!Aoa_gVKaCDngi2U1ip8w2HxjIt4-?e=koGl4c - Environment-nccl : https://1drv.ms/u/s!Aoa_gVKaCDngi2J7pOh7WdKR-pMa?e=GVlYbd Requirements : - Python3.7 (although the environment of model traininng, python2, is provided above, the process of preprocessing data and model evaluation is based on a python3 environment) - RDKit (2019.09.1.0)

Fine-tuning to prediction tasks

Modify the configuration file : conf_pre/ft_conf.sh The terms that need to be modified are high-lighted, like : ### attention, this term need to be modified vocab_path="./package/molecule_dict_zinc250k" ### attention, this term need to be modified CONFIG_PATH="./package/ernie_zinc250k_config.json"
Fine-tuning to classification/regression : Modify the main() in run_classifier.py 
1. For classification : task_type = 'cls' 
2. For regression : task_type = 'reg'
Fine-tuning to single-input/multiple-input : Modify the main() in run_classifier.py 
1. For single-input : multi_input = False 
2. For multiple-input : multi_input = True Modify the main() in finetune_launch.py: extend_sent = True Modify the "type_vocab_size" in model config
For molecule property prediction task : 
1. Repeat training: Modify finetune_launch.py, the code in if __name__ == "__main__": : while fine_tune_rep < the_numeber_of_repeating_times: 
2. Random/scaffold split: 
 - Modify finetune_launch.py, the code in if __name__ == "__main__": : Keep the subprocess.call("python3 pt_scaffold_split.py", shell=True) 
 - Modify pt_scaffold_split.py, the code in if __name__ == "__main__": : sep_file_ex('path_to_training_data_folder', split_func='scaffold', amp=False, ampn=(0,0,0))
If the vocab list needs to be extended : Modify the main() in finetune_launch.py : extend_vocab = False
Run : sh train_ft.sh sh train_lrtemb.sh (knowlegde embedding)

Fine-tuning to generation tasks

Modify the configuration file : ft_conf The terms that need to be modified are high-lighted, like : ### attention, this term need to be modified vocab_path="./package/molecule_dict_zinc250k" ### attention, this term need to be modified CONFIG_PATH="./package/ernie_zinc250k_config.json"
If the vocab list needs to be extended : Modify the main() in finetune_launch_local.py: extend_vocab = True extend_fc = True
Run : sh train_ft.sh (DL&GD generation tasks) sh train_opt.sh (optimization tasks)

Change the number of GPUs used in the training process

For both the two type tasks : Modify finetune_launch.py (finetune_launch_local.py in generation tasks) Valid value of the two arguments in the argparse term multip_g 1. nproc_per_node : GPU numbers 2. selected_gpus : GPU ids

Extend the vocab list

The rules in the extension of vocabulary list : 1. The extension must based on the X-MOL_dict, as well as the vocabularg list used in pre_training. 2. The extended vocab must be placed behind the original vocabs (the index of new vocabs is start from 122). 3. Do not forget to turn on the extend_vocab in the finetune_launch.py/finetune_launch_local.py. 4. Do not forget to modify the "vocab_size" in model config 5. Once the vocabulary list is extended, the pre-trained model will be changed, please make sure you have a good backup of X-MOL.

Fine-tuning output

Path of saving the log file and saved model : 1. Log files are saved in ./log/, a launching log and n running log will be saved (n = the number of GPUs). 2. The saved model (parameter SAVE_STEPS=1000 in ft_conf.sh incidates that the model will be stored every 1000 steps during the training process) will be stored in ./checkpoints/.

Warm start and cold start

Warm start : Fine-tuning the model on the basis of X-MOL Set parameter init_model in ft_conf/ft_conf.sh as init_model="path/to/decompressed/X-MOL" Cold start : Training the model from scratch Set parameter init_model in ft_conf/ft_conf.sh as init_model=""

Contact

1810538@tongji.edu.cn or qiliu@tongji.edu.cn