Home

Awesome

Product-of-Experts Chemical Language Models

Source code for the paper "Navigating Ultra-Large Virtual Chemical Spaces with Product-of-Experts Chemical Language Models"

Installation

Install Poetry and run:

poetry install

Or use the provided Dockerfile:

docker build -t poeclm -f docker/Dockerfile .
docker run --ipc=host --gpus all --rm -it \
    -e UID=$(id -u) \
    -e GID=$(id -g) \
    -v $(pwd):/home/user/workspace \
    -w /home/user/workspace \
    poeclm:latest \
    /bin/bash

Reproducing the datasets

Enumerated library used for pre-training

Enumerate a library of compounds:

poetry run python scripts/enumerate_library.py \
    -b data/chemical_space/enaminebbUS.smi.gz \
    -r data/chemical_space/hartenfeller.csv \
    -o data/chemical_space/enum_16M.smi \
    -n 16_000_000 \
    --prop_filters Ro5 Veber \
    --rd_filters Inpharmatica PAINS \
    --threshold 0.65

Prepare a dataset for pre-training:

poetry run python scripts/prepare_dataset.py \
    -i data/chemical_space/enum_16M.smi \
    -o data/datasets/enum_16M \
    -f 0.96 0.02 0.02 \
    --vocab_file data/vocab.txt \
    --filter_expr 'n_heavy_atoms <= 50'

DOCKSTRING dataset used for fine-tuning

Download the DOCKSTRING dataset:

poetry run python scripts/download_dockstring.py \
    -o data/dockstring/dataset.csv

Prepare a dataset for fine-tuning:

poetry run python scripts/prepare_dataset.py \
    -i data/dockstring/dataset.csv \
    -o data/datasets/dockstring \
    -f 0.8 0.1 0.1 \
    --vocab_file data/vocab.txt \
    --filter_expr 'n_heavy_atoms <= 50'

Reproducing the models

Prior models

Pre-train a 6M parameter model on the enumerated library:

poetry run python pretrain.py \
    "model.n_embd=288" \
    "model.n_layer=6" \
    "model.n_head=6" \
    "output_dir=outputs/pretraining/enum_16M/6M"

Pre-train a 25M parameter model on the enumerated library:

poetry run python pretrain.py \
    "model.n_embd=512" \
    "model.n_layer=8" \
    "model.n_head=8" \
    "output_dir=outputs/pretraining/enum_16M/25M"

Pre-train a 85M parameter model on the enumerated library:

poetry run python pretrain.py \
    "model.n_embd=768" \
    "model.n_layer=12" \
    "model.n_head=12" \
    "output_dir=outputs/pretraining/enum_16M/85M"

Expert and anti-expert models

DRD2

Fine-tune the 85M parameter model for DRD2 docking:

poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='DRD2 <= -11.1'" \
    "batch_size=32" \
    "output_dir=outputs/finetuning/DRD2+/85M"
poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='DRD2 > -11.1'" \
    "batch_size=512" \
    "output_dir=outputs/finetuning/DRD2-/85M"

BBB

Fine-tune the 85M parameter model for BBB permeability:

poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='0.5159 * clogp - 0.0277 * tpsa - 0.3462 > 0.0'" \
    "batch_size=256" \
    "output_dir=outputs/finetuning/BBB+/85M"
poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='0.5159 * clogp - 0.0277 * tpsa - 0.3462 <= 0.0'" \
    "batch_size=512" \
    "output_dir=outputs/finetuning/BBB-/85M"

QED

Fine-tune the 85M parameter model for QED:

poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='qed > 0.6'" \
    "batch_size=256" \
    "output_dir=outputs/finetuning/QED+/85M"
poetry run python finetune.py \
    "checkpoint_file=outputs/pretraining/enum_16M/85M/best.ckpt" \
    "dataset_dir=data/datasets/dockstring" \
    "filter_expr='qed <= 0.6'" \
    "batch_size=256" \
    "output_dir=outputs/finetuning/QED-/85M"

Reproducing the results

Compound generation

Baseline

Generate compounds with the random baseline:

from pathlib import Path

import pandas as pd

df = pd.read_parquet("data/datasets/enum_16M/train.parquet", engine="pyarrow")
df = df.sample(n=2**15, random_state=42, ignore_index=True)

out = Path("outputs/generation/Baseline/Random")
out.mkdir(parents=True, exist_ok=True)

df.to_parquet(out / "samples.parquet", engine="pyarrow", index=False, row_group_size=2**20)

unique_10K = df.head(10_000)
unique_10K[["standard_smiles", "id"]].to_csv(
    out / "unique.smi",
    sep=" ",
    header=False,
    index=False,
)

Prior

Generate compounds with the 6M parameter model:

poetry run python generate.py \
    "models=[{checkpoint_file: outputs/pretraining/enum_16M/6M/best.ckpt, weight: 1.0}]" \
    "output_dir=outputs/generation/Prior/6M"

Generate compounds with the 25M parameter model:

poetry run python generate.py \
    "models=[{checkpoint_file: outputs/pretraining/enum_16M/25M/best.ckpt, weight: 1.0}]" \
    "output_dir=outputs/generation/Prior/25M"

Generate compounds with the 85M parameter model:

poetry run python generate.py \
    "models=[{checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0}]" \
    "output_dir=outputs/generation/Prior/85M"

Expert (DRD2)

Generate compounds with the expert model (DRD2):

poetry run python generate.py \
    "models=[{checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 1.0}]" \
    "output_dir=outputs/generation/Expert/DRD2+"

PoE

Generate compounds with the PoE model using different configurations:

poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -1.0},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=1.0"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 1.5},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -1.5},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=1.5"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 2.0},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -2.0},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=2.0"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 2.5},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -2.5},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=2.5"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 2.0},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -2.0},
        {checkpoint_file: outputs/finetuning/BBB+/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/BBB-/85M/best.ckpt, weight: -1.0},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=2.0_BBB=1.0"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 2.0},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -2.0},
        {checkpoint_file: outputs/finetuning/QED+/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/QED-/85M/best.ckpt, weight: -1.0},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=2.0_QED=1.0"
poetry run python generate.py \
    "models=[
        {checkpoint_file: outputs/pretraining/enum_16M/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/DRD2+/85M/best.ckpt, weight: 2.0},
        {checkpoint_file: outputs/finetuning/DRD2-/85M/best.ckpt, weight: -2.0},
        {checkpoint_file: outputs/finetuning/BBB+/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/BBB-/85M/best.ckpt, weight: -1.0},
        {checkpoint_file: outputs/finetuning/QED+/85M/best.ckpt, weight: 1.0},
        {checkpoint_file: outputs/finetuning/QED-/85M/best.ckpt, weight: -1.0},
    ]" \
    "output_dir=outputs/generation/PoE/DRD2=2.0_BBB=1.0_QED=1.0"

Evaluation

Similarity search with SpaceLight

Download SpaceLight from the AMD Software Server (requires a license) and extract it to the third_party directory.

Generate a topological fragment space for similarity search with SpaceLight:

poetry run python scripts/generate_spacelight_config.py \
    -b data/chemical_space/enaminebbUS.smi.gz \
    -r data/chemical_space/hartenfeller.csv \
    -o data/chemical_space/JSon
./third_party/SpaceLight_1.2.2/SpaceLight generate \
    -i data/chemical_space/JSon \
    -f data/chemical_space/fragspace.tfsdb

Run similarity search with SpaceLight:

for path in $(find outputs/generation -name unique.smi | sort); do
    ./third_party/SpaceLight_1.2.2/SpaceLight search \
        -f data/chemical_space/fragspace.tfsdb \
        -i $path \
        -o ${path%.smi}_SpaceLight.csv
done

Molecular docking with DOCKSTRING

Run molecular docking with DOCKSTRING (requires openbabel):

for path in $(find outputs/generation -name unique.smi | sort); do
    poetry run python scripts/run_docking.py \
        -t DRD2 \
        -i $path \
        -o ${path%.smi}_DRD2.sdf
done

Acknowledgements

The code is based on:

The building blocks are from: https://zinc20.docking.org/catalogs/enaminebbUS/

The chemical reactions are from: https://doi.org/10.1021/ci200379p/

The DOCKSTRING dataset is from: https://doi.org/10.1021/acs.jcim.1c01334/