Awesome
Reinforced Regressinal and Conditional GAN: RRCGAN
Submitted to Digital Discovery.
RRCGAN is a deep generative model using a Generative Adversarial Network (GAN) combined with a Regressor to generate molecules with targeted properties. It is purely run in Python. Using GPU is necessary, otherwise running the code takes a lot!
Overview
RRCGAN
is a generative GAN model designed to generate small molecules with targeted properties. RRCGAN
, a generative deep learning model, has been built in Keras from Tensorflow that can easily installed on personal computers. Having a GPU is recommended to accelerate each epochs of learning. The packages used in RRCGAN can be installed on all major platforms (e.g. BSD, GNU/Linux, OS X, Windows).
System Requirements
Hardware requirements
RRCGAN
requires only a standard computer with GPU and enough RAM. The allocated ram of the GPU is also critical in running the code to keep the large matrices used in the training process. A GPU unit with 12GB of ram is suggested.
Software requirements
Python Dependencies
RRCGAN
mainly depends on the Python scientific stack, Keras form Tensorflow, and chemistry tools chainer chemistry and RDKit.
numpy
scipy
scikit-learn
pandas
seaborn
sklearn
tensorflow
matplotlib
chainer chemistry
RDKit
Installation Guide:
The only challenge for running the model is to set up the Tensorflow-gpu. One should install specific version of Tensorflow and Nvidia drivers to make it work. The necessary packages and the built conda environment used is mentioned in environment.yml
. Installing the whole packages and running Tensorflow-gpu may take 30-60 minutes.
The following NVIDIA software are required for GPU support.
• NVIDIA GPU drivers version 450.80.02 or higher.
• CUDA® Toolkit 11.8.
• cuDNN SDK 8.6.0.
For installation, one needs to install the packages in the order. So, Nvidial GPU needs to be installed first. More information can be found in Tensorflow website (https://www.tensorflow.org/install/pip)
We also motivates a reader to visit (https://stackoverflow.com/questions/50622525/which-tensorflow-and-cuda-version-combinations-are-compatible) to read more details for installing the versions for Tensorflow-gpu.
We primarily used Lewis Cluster from University of Missouri-Columbia for running the code. The following is the information of a PC that was tested for running the Tensorflow on GPU. -GPU Nvidia RTX 2080 Super, Cuda version: 10.1, cuDNN: 7.6, Tensorflow: 2.11.0
Files Guidance
Main models
directory: ./model_regular:
contains codes for the Regular model.
- "preprocessing.ipynb" contains the code for reading the '.csv' files from PubChemQC library.
- "embedding_version_0_3.ipynb" contains encoder and decoder models.
- "Main_model_V01.ipynb" contains the code for Regressor and GAN models. After training the AE using "embedding_version_0_3.ipynb" file, the whole architecture should be trained here. After training, the file has a section for sample generation.
directory: ./model_transfer:
contains codes for the first iteration of Transferred model.
directory: ./model_transfer2:
contains codes for the second iteration of Transferred model.
Training data:
directory: ./data/trainingsets/train_regular_pubqc130K:
contains the 130K molecules downloaded from PubChemQC library.
directory: ./data/trainingsets/highgap_pubqc:
contains the molecules with >= 10 (eV) gap values from PubChemQC library.
directory: ./data/trainingsets/highgap_outliergen_trans1:
contains the molecules with >= 10 (eV) DFT-calculated gap values as outliers of generated samples using the Regular model.
directory: ./data/trainingsets/highgap_outliergen_trans2:
contains the molecules with >= 10.2 (eV) DFT-calculated gap values generated by the Transfer model iteration 1.
Samples of generated SMILES
directory: ./experiments
contains sample generation for each regular and transferred models.
directory: ./experiments/regular/*
contains generated samples that were used for figures such as latent space, distribution of structural features, and distribution of regular vs. two iterations of transferred models.
Shared functions
directory: ./utils
contains general functions that can be imported from all the directories.
Demo: directory: ./Demo
It contains all needed code for training a regular RRCGAN model. It can be trained on a batch of training samples of 20K samples very fast to show the training process. All the needed training samples with their saved SMILES and one-hot encoded matrices are saved in a ‘.pickle’ file inside ./Demo/training_20K/ directory.
License
This project is covered under the GNU General Public License v3.0. The details of the used License is in License file.