Awesome
SyntheMol: Generative AI for Drug Discovery
SyntheMol is a generative AI method for designing structurally novel and diverse drug candidates with predicted bioactivity that are easy to synthesize.
SyntheMol consists of a Monte Carlo tree search (MCTS) that explores a combinatorial chemical space consisting of molecular building blocks and chemical reactions. The MCTS is guided by a bioactivity prediction AI model, such as a graph neural network or a random forest. Currently, SyntheMol is designed to use 137,656 building blocks and 13 chemical reactions from the Enamine REAL Space, which can produce over 30 billion molecules. However, SyntheMol can be easily adapted to use any set of building blocks and reactions.
SyntheMol is described in the following paper, where we applied SyntheMol to design novel antibiotic candidates for the Gram-negative bacterium Acinetobacter baumannii.
Swanson, K., Liu, G., Catacutan, D. B., Arnold, A., Zou, J., Stokes, J. M. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nature Machine Intelligence, 2024.
Full details for reproducing the results in the paper are provided in the docs directory.
Table of contents
- Installation
- Combinatorial chemical space
- Bioactivity prediction model
- Generate molecules
- Filter generated molecules
Installation
SyntheMol can be installed in < 3 minutes on any operating system using pip (optionally within a conda environment). SyntheMol can be run on a standard laptop (e.g., 16 GB memory and 8-16 CPUs), although a GPU is useful for faster training and prediction of the underlying bioactivity prediction model (Chemprop).
Optionally, create a conda environment.
conda create -y -n synthemol python=3.10
conda activate synthemol
Install SyntheMol via pip.
pip install synthemol
Alternatively, clone the repo and install SyntheMol locally.
git clone https://github.com/swansonk14/SyntheMol.git
cd SyntheMol
pip install -e .
If there are version issues with the required packages, create a conda environment with specific working versions of the packages as follows.
pip install -r requirements.txt
pip install -e .
Note: If you get the issue ImportError: libXrender.so.1: cannot open shared object file: No such file or directory
, run conda install -c conda-forge xorg-libxrender
.
Combinatorial chemical space
SyntheMol is currently designed to use 139,493 building blocks (137,656 unique molecules) and 13 chemical reactions from the Enamine REAL Space, which can produce over 30 billion molecules (30,330,025,259). However, an alternate combinatorial chemical space can optionally be used by replacing the building blocks and chemical reactions as follows.
Building blocks: Replace data/building_blocks.csv
with a custom file containing the building blocks. The file should be a CSV file with a header row and two columns: smiles
and ID
. The smiles
column should contain the SMILES string for each building block, and the ID
column should contain a unique ID for each building block.
Chemical reactions: In SyntheMol/reactions/custom.py
, set CUSTOM_REACTIONS
to a list of Reaction
objects similar to the REAL_REACTIONS
list in SyntheMol/reactions/real.py
. If CUSTOM_REACTIONS
is defined (i.e., not None
), then it will automatically be used instead of REAL_REACTIONS
.
Bioactivity prediction model
SyntheMol requires a bioactivity prediction model to guide its generative process. SyntheMol is designed to use one of three types of models:
- Chemprop: a message passing neural network from https://github.com/chemprop/chemprop
- Chemprop-RDKit: Chemprop augmented with 200 RDKit molecular features
- Random forest: a scikit-learn random forest model trained on 200 RDKit molecular features
Train model
All three model types can be trained using Chemprop, which is installed along with SyntheMol. All three model types can be trained on either regression or binary classification bioactivities. Full details are provided in the Chemprop README. Below is an example for training a Chemprop model on a binary classification task. By default, training is done on a GPU (if available).
Data file
# data/data.csv
smiles,activity
Br.CC(Cc1ccc(O)cc1)NCC(O)c1cc(O)cc(O)c1,0
CC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1
O=C(O)CCc1ccc(NCc2cccc(Oc3ccccc3)c2)cc1,0
...
Train Chemprop
chemprop_train \
--data_path data/data.csv \
--dataset_type classification \
--save_dir models/chemprop
Pre-compute building block scores
After training, use the model to pre-compute scores of building blocks to accelerate the SyntheMol generation process. Below is an example using the trained Chemprop model. By default, prediction is done on a GPU (if available).
chemprop_predict \
--test_path "$(python -c 'import synthemol; print(str(synthemol.constants.BUILDING_BLOCKS_PATH))')" \
--preds_path models/chemprop/building_blocks.csv \
--checkpoint_dir models/chemprop
Generate molecules
SyntheMol uses the bioactivity prediction model within a Monte Carlo tree search to generate molecules. Below is an example for generating molecules with a trained Chemprop model using 20,000 MCTS rollouts. SyntheMol uses CPUs only (no GPUs).
synthemol \
--model_path models/chemprop \
--model_type chemprop \
--save_dir generations/chemprop \
--building_blocks_path models/chemprop/building_blocks.csv \
--building_blocks_score_column activity \
--n_rollout 20000
Note: The building_blocks_score_column
must match the column name in the building blocks file that contains the building block scores. When using chemprop_train
and chemprop_predict
, the column name will be the same as the column that contains target activity/property values in the training data file (e.g., activity
).
Filter generated molecules
Optionally, the generated molecules can be filtered for structural novelty, predicted bioactivity, and structural diversity. These filtering steps use chemfunc, which is installed along with SyntheMol. Below is an example for filtering the generated molecules.
Novelty
Filter for novelty by comparing the generated molecules to a set of active molecules (hits) from the training set or literature and removing similar generated molecules.
Hits file
# data/hits.csv
smiles,activity
CC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1
O=C(NNc1ccccc1)c1ccncc1,1
...
Compute Tversky similarity between generated molecules and hits
chemfunc nearest_neighbor \
--data_path generations/chemprop/molecules.csv \
--reference_data_path data/hits.csv \
--reference_name hits \
--metric tversky
Filter by similarity, only keeping molecules with a nearest neighbor similarity to hits of at most 0.5
chemfunc filter_molecules \
--data_path generations/chemprop/molecules.csv \
--save_path generations/chemprop/molecules_novel.csv \
--filter_column hits_tversky_nearest_neighbor_similarity \
--max_value 0.5
Bioactivity
Filter for predicted bioactivity by keeping the molecules with the top 20% highest predicted bioactivity.
chemfunc filter_molecules \
--data_path generations/chemprop/molecules_novel.csv \
--save_path generations/chemprop/molecules_novel_bioactive.csv \
--filter_column score \
--top_proportion 0.2
Diversity
Filter for diversity by clustering molecules based on their Morgan fingerprint and only keeping the top scoring molecule from each cluster.
Cluster molecules into 50 clusters
chemfunc cluster_molecules \
--data_path generations/chemprop/molecules_novel_bioactive.csv \
--num_clusters 50
Select the top scoring molecule from each cluster
chemfunc select_from_clusters \
--data_path generations/chemprop/molecules_novel_bioactive.csv \
--save_path generations/chemprop/molecules_novel_bioactive_diverse.csv \
--value_column score