Awesome
Code for the paper "A Deep Generative Model for Fragment-Based Molecule Generation" (AISTATS 2020)
Links: Paper - AISTATS 2020 proceedings
Installation
Run:
source scripts/install.sh
This will take care of installing all required dependencies.
If you have trouble during the installation, try running each line of the scripts/install.sh
file separately (one by one) in your shell.
The only required dependency is the latest Conda package manager, which you can download with the Anaconda Python distribution here.
After that, you are all set up.
Preprocessing
First, you need to download the data and do some preprocessing. To do this, run:
python manage.py preprocess --dataset <DATASET_NAME>
where <DATASET_NAME>
must be ZINC
or PCBA
. At the moment, we support only these two.
Use python manage.py preprocess --help
to see other useful options for preprocessing.
This will download the necessary files in the DATA
folder, and will preprocess them as described in the paper.
Training
After preprocessing, you can train the model running:
python manage.py train --dataset <DATASET_NAME>
where <DATASET_NAME>
is defined as described above.
If you wish to train using a GPU, add the --use_gpu
option.
Check out python manage.py train --help
to see all the other hyperparameters you can change.
Training the model will create folder RUNS
with the following structure:
RUNS
└── <date>@<time>-<hostname>-<dataset>
├── ckpt
│ ├── best_loss.pt
│ ├── best_valid.pt
│ └── last.pt
├── config
│ ├── config.pkl
│ ├── emb_<embedding_dim>.dat
│ ├── params.json
│ └── vocab.pkl
├── results
│ ├── performance
│ │ ├── loss.csv
│ │ └── scores.csv
│ └── samples
└── tb
└── events.out.tfevents.<tensorboard_id>.<hostname>
the <date>@<time>-<hostname>-<dataset>
folder is a snapshot of your experiment, which will contain all the data collected during training.
You can monitor the progress of training using tensorboardX, just run
tensorboard --logdir RUNS
during training and check the localhost:6006
page in your favorite browser.
Sampling
After the model is trained, you can sample from it using
python manage.py sample --run <RUN_PATH>
where <RUN_PATH>
is the path to the run directory of the experiment, which will be something like RUNS/<date>@<time>-<hostname>-<dataset>
(<date>
, <time>
, <hostname>
, <dataset>
are placeholders of the actual data).
Check out python manage.py sample --help
to see all the sampling options.
You will find your samples in the results/samples
folder on your experiment run directory.
Postprocessing
After you have sampled the model, you wish to conduct some common postprocessing operations such as calculate statistics on the samples, aggregate multiple sample files and the test data in one big file for plotting, etc.
Then, you need to run:
python manage.py postprocess --run <RUN_PATH>
where <RUN_PATH>
is obtained as described above.
Check out python manage.py postprocess --help
to see all available options.
Plotting
If you wish to obtain similar figures as the ones in the paper on your samples, just run:
python manage.py plot --run <RUN_PATH>
where <RUN_PATH>
is defined as described above.
Samples
You can find the 20k SMILES samples used in the paper for the analysis in the SAMPLES folder.