Home

Awesome

Deep generative model of constructing chemical latent space for large molecular structures with 3D complexity

Overview

The structural diversity of chemical libraries, which are systematic collections of compounds that have potential to bind to biomolecules, can be represented by chemical latent space.
In this study, we developed a new deep-learning method, called NP-VAE, based on variational autoencoder for handling natural compounds. NP-VAE enables generation of a chemical latent space that project large and complex compound structures including chirality.

Requirements

By using npvae_env.yml in the env folder, you can build an anaconda environment exactly the same as this research.

conda env create -f npvae_env.yml
conda activate npvae_env

Compound Datasets

The datasets used in this study is in the smiles_data file.
evaluation_*.txt is the evaluation dataset divided for train, test and validation, and drugbank_smiles.txt is the processed DrugBank dataset.
The project dataset used in this study is an original compound library collected from various laboratories through the Ministry of Education, Culture, Sports, Science, and Technology-designated project, “Frontiers of Chemical Communication”, in which this research participated.
Two representative collections of compound structures within the project dataset, namely collection_A.txt(provided by Kakeya Lab.) and collection_B.txt(provided by Uesugi Lab.), are available.
However, most other compound structures in the project dataset are unpublished, and restrictions apply to the availability of these data, which were used under license for the current study and therefore are not publicly available.

Pretrained Parameters

The saved parameters from our training on the DrugBank&Project dataset can be downloaded from the link published in the pre-trained folder. After downloading, please upzip and place them in it.
These parameters were obtained by including the non-public project datasets in the training. If you want to use them in the following process, please specify this path as LOAD_PATH.
no_property_model.iter-100 is the parameters obtained by training the DrugBank&Project dataset with only the structural information of the compounds.
On the other hand, nplikeness_model.iter-100 is the parameters obtained by adding the NP-likeness score as functional information along with structural information, and we recommend using this one unless there is a particular reason.

Program Usage

Please select and execute the following python files according to your purpose.

Some of the main parameters to be set and command examples are shown below.


1. Train the model

The trained parameters are published, but if you want to train the model on your own dataset, run preprocessing.py and train.py.

1.1. Preprocessing

python preprocessing.py --smiles_path ./smiles_data/hoge.txt --save_path ./save_data

1.2. Training

(Skip this process if you use the pre-trained parameters.)
The training uses multiple GPUs for acceleration. please make sure your GPUs are available.

python train.py --smiles_path ./smiles_data/hoge.txt --prepared_path ./save_data --save_path ./param_data

2. Calculate latent variables

You can calculate latent variables corresponding to your input compounds based on learned parameters.

python calculate_z.py --smiles_path ./smiles_data/hoge.txt --prepared_path ./save_data --load_path ./param_data --save_path ./output_data

If you want to obtain latent variables that match your compound structure based on published parameters without training, please run preprocessing.py first to complete the preprocessing. (See procedure 1.1 above.)
Then, instead of running calculate_z.py, change the downloaded parameter file name to model.iter-100 and run calculate_z_simple.py.

python calculate_z_simple.py --smiles_path ./smiles_data/hoge.txt --prepared_path ./save_data --load_path ./pre-trained --save_path ./output_data

3. Evaluate reconstruction accuracy

(Skip this process if you use the pre-trained parameters.)
This process does not need to be performed, but it helps to verify the fitting accuracy of the model after training.

python evaluate.py --smiles_path ./smiles_data/hoge.txt --saved_path ./output_data

4. Visualize latent variables

The acquired latent variables are dimensionally reduced by tSNE and the visualization results can be obtained as a png file. The png file is saved under the SAVED_PATH you specified.
If you want to colorize and see the distribution of specific compounds, please prepare another txt file describing them in SMILES format.
If you just want to see the appearance by color coding for each functional information value, there is no need to prepare a separate txt file. In that case, please set the -color flag.

python visualize.py --smiles_path ./smiles_data/hoge.txt --saved_path ./output_data -check_path ./smiles_data/target_smiles.txt -color

5. Generate new compound structures

You can generate new compound structures from the space around a compound you specified. A SDF file is saved under the saved_path you specified.
If you use published parameters without training, please set prepared_path to pre-trained.

python generate.py --smiles_path ./smiles_data/hoge.txt --prepared_path ./save_data --load_path ./param_data --saved_path ./output_data -target `c1ccccc1`

License

This software is released under a custom license.

Academic use of this software is free and does not require any permission. We encourage academic users to cite our research paper (if applicable).

For commercial use, please contact the author for permission at [toshiki-ochiai@dna.bio.keio.ac.jp].

By using this software, you acknowledge and agree to the terms of use.