Awesome

🏎️🏎️🏎️🏎️

LaMBO: Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

Abstract

Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug and antibody sequence design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on a small-molecule task based on the ZINC dataset and introduce a new large-molecule task targeting fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.

Key Results

BayesOpt can be used to maximize the simulated folding stability (-dG) and solvent-accessible surface area (SASA) of red-spectrum fluorescent proteins. Higher is better for both objectives. The starting proteins are shown as colored circles, with corresponding optimized offspring shown as crosses. Stability correlates with protein function (e.g. how long the protein can fluoresce) while SASA is a proxy for fluorescent intensity.

On all three tasks (described in Section 5.1 of the paper), LaMBO outperforms genetic algorithm baselines, specifically NSGA-2 and a model-based genetic optimizer with the same surrogate architecture (MTGP + NEHVI + GA). Performance is quantified by the hypervolume bounded by the optimized Pareto frontier. The midpoint, lower, and upper bounds of each curve depict the 50%, 20%, and 80% quantiles, estimated from 10 trials. See Section 5.2 in the paper for more discussion.

UPDATE 04/20/2024

An open-source contribution identified some subtle bugs that hurt performance of all methods substantially on some tasks. The proposed fix has been merged and therefore the current master commit will now produce better results than originally reported. If you wish to reproduce the original curves in the paper, check out the following commit

git checkout 431b052

Installation

FoldX

FoldX is available under a free academic license. After creating an account you will be emailed a link to download the FoldX executable and supporting assets. Copy the contents of the downloaded archive to ~/foldx. You may also need to rename the FoldX executable (e.g. mv -v ~/foldx/foldx_20221231 ~/foldx/foldx).

RDKit

RDKit is easiest to install if you're using Conda as your package manager (shown below).

TDC

TDC is required to run the DRD3 docking task. See the linked README for installation instructions.

git clone https://github.com/samuelstanton/lambo && cd lambo
conda create --name lambo-env python=3.8 -y && conda activate lambo-env
conda install -c conda-forge rdkit -y
conda install -c conda-forge pytdc pdbfixer openbabel -y
pip install -r requirements.txt --upgrade
pip install -e .

Reproducing the figures

This project uses Weight and Biases for logging. The experimental data used to produce the plots in our papers is available here.

See ./notebooks/plot_pareto_front for a demonstration of how to reproduce Figure 1.

See ./notebooks/plot_hypervolume for a demonstration of how to reproduce Figures 3 and 4.

Running the code

See ./notebooks/rfp_preprocessing.ipynb for a demonstration of how to download PDB files from the RCSB Protein Data Bank and prepare them for use with FoldX.

See ./notebooks/foldx_demo.ipynb for a demonstration of how to use our Python bindings for FoldX, given a starting sequence with known structure.

This project uses Hydra for configuration when running from the command line.

We recommend running NSGA-2 first to test your installation

python scripts/black_box_opt.py optimizer=mf_genetic optimizer/algorithm=nsga2 task=regex tokenizer=protein

For the model-based genetic baseline, run

python scripts/black_box_opt.py optimizer=mb_genetic optimizer/algorithm=soga optimizer.encoder_obj=mll task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

For the full LaMBO algorithm, run

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=mlm task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

To evaluate on the multi-objective RFP (large-molecule) or ZINC (small-molecule) tasks, use task=proxy_rfp tokenizer=protein and task=chem tokenizer=selfies, respectively.

To evaluate on the single-objective ZINC task used in papers like Tripp et al (2020), run

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn surrogate.holdout_ratio=0.1 surrogate.bs=256 surrogate.eval_bs=256 optimizer.resampling_weight=0.5 optimizer.window_size=8

Below we list significant configuration options. See the config files in ./hydra_config for all configurable parameters. Note that any config field can be overridden from the command line, and some configurations are not supported.

Acquisition options

nehvi (default, multi-objective)
ehvi (multi-objective)
ei (single-objective)
greedy (single and multi-objective)

Encoder options

mlm_cnn (default, substitutions only)
mlm_transformer (substitutions only)
lanmt_cnn (substitutions, insertions, deletions)
lanmt_transformer (substitutions, insertions, deletions)

Optimizer options

lambo (default)
mb_genetic (Genetic baseline with model-based compound screening)
mf_genetic (Model-free genetic baseline)

Algorithm options

soga (default, single-objective)
nsga2 (multi-objective)

Surrogate options

multi_task_exact_gp (default, DKL MTGP regression)
single_task_svgp (DKL SVGP regression)
single_task_exact_gp (DKL GP regression)
string_kernel_exact_gp (not recommended, SSK GP regression)
deep_ensemble (MLE regression)

Task options

regex (default, maximize counts of 3 bigrams)
regex_easy (maximize counts of 2 tokens)
chem (ZINC small molecules, maximize LogP and QED)
chem_lsbo (ZINC small molecules, maximize penalized LogP)
tdc_docking (ZINC small molecules, minimize DRD3 docking affinity and synthetic accessibility)
proxy_rfp (FPBase large molecules, maximize stability and SASA)

Tokenizer options

protein (default, amino acid vocab for large molecules)
selfies (ZINC-derived SELFIES vocab for small molecules)
smiles (not recommended, ZINC-derived SMILES vocab for small molecules)

Tests

pytest tests

This project currently has very limited test coverage.

Citation

If you use any part of this code for your own work, please cite

@article{stanton2022accelerating,
  title={Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders},
  author={Stanton, Samuel and Maddox, Wesley and Gruver, Nate and Maffettone, Phillip and Delaney, Emily and Greenside, Peyton and Wilson, Andrew Gordon},
  journal={arXiv preprint arXiv:2203.12742},
  year={2022}
}