Home

Awesome

Diffusion Posterior Sampling via SMC for Zero-Shot Scaffolding of Protein Motifs

This repository provides an interface for solving motif scaffolding problems with an unconditional diffusion model as a prior. It defines inverse problems for a variety of tasks and solves them by sampling the posterior via sequential Monte Carlo. This permits conditional sampling without additional training to the chosen unconditional model.

The following are the supported tasks, samplers, and likelihood formalisations for conditioning on a motif.

Motif Scaffolding Tasks

(Diffusion) Posterior Sampling Methods

Motif Scaffolding Likelihood Formalisations

Additionally, other unconditional models can be supported by creating an adapter for them in model, registering them and their parameters, and adding a config in config/model. Currently, we have Genie-SCOPe-128 and Genie-SCOPe-256 available. The conditional samplers assume the frame representation of the protein, so some extra engineering may be required for other models.

Installation

To match our setup, use Python 3.9 with CUDA version 11.8 and above. First, pip install the requirements.

pip install -r requirements.txt

Then, initialise the submodules if not already setup.

git submodule update --init

Finally, but optionally, we use the insilico design pipeline from AQLaboratory for evaluation. Run their bash scripts for installing TMScore, ProteinMPNN, and ESMFold to set up the self-consistency pipeline.

Structure

.
├── conditional/                    # Posterior samplers
│   ├── __init__.py                     # Sampler registration
│   ├── components/                     # Reusable components
│   │   ├── observation_generator.py        # for generating y sequence
│   │   └── particle_filter.py              # for filtering
│   ├── bpf.py                          # Bootstrap particle filter
│   ├── fpssmc.py                       # Filtering posterior sampling
│   ├── smcdiff.py                      # SMCDiff
│   ├── tds.py                          # Twisted diffusion sampler
│   └── wrapper.py                      # Abstract class for samplers
├── config/                         # Configs
│   ├── config.yaml                     # Main config file
│   ├── experiments/                    # Experiment config groups
│   │   └── ...
│   └── model/                          # Model config groups
│       └── ...
├── data/                           # Motif scaffolding data   
│   ├── motif_problems/                 # RFDiffusion benchmark
│   │   └── ...
│   ├── multi_motif_problems/           # Genie2 benchmark
│   │   └── ...
│   └── symmetric_motif_problems/       # RFDiffusion trimeric covid binder
│       └── ...
├── experiments/                    # Experiments
│   ├── __init__.py                     # Experiment registration
│   └── experiments.py                  # Experiment definitions
├── model/                          # Diffusion model
│   ├── __init__.py                     # Model registration
│   ├── diffusion.py                    # Abstract class for diffusion models
│   └── genie.py                        # Genie adapter
├── multirun/                       # Output of multirun/sweeping experiments
│   └── ...
├── outputs/                        # Output of experiments
│   └── ...
├── protein/                        # Protein-related functions
│   └── frames.py                       # Abstract class for frames
├── scripts/                        # Scripts for config generation, etc.
│   └── ...
├── submodules/                     # Git submodules
│   └── ...
├── utils/                          # Utility functions
│   ├── path.py                         # for resolving paths
│   ├── pdb.py                          # for working with PDBs
│   ├── registry.py                     # for handling registrations
│   ├── resampling.py                   # for low-variance resampling
│   └── symmetry.py                     # for dealing with symmetry
└── main.py                         # Main entry point

Usage

The project uses the Hydra framework for handling different experimental setups. Configuration files and groups are defined under the config folder.

To get started, the following command will show the available options for config groups, e.g. an experiment type, as well as the default parameters set.

python3 main.py --help

Supported Experiments

Configured through the option experiment={experiment_name}. Check their arguments and defaults in config/experiments.

ExperimentDescription
sample_unconditionalSample unconditional samples from the diffusion model. Total length must be specified.
sample_given_motifSample conditioned on a motif being present in the samples. Motif config files have specifications like in RFDiffusion.
sample_given_multiple_motifsSample conditioned on multiple motifs being present on the samples. Motif config files have specifications like in Genie2.
sample_given_symmetrySample conditioned on the samples following a point symmetry.
sample_given_motif_and_symmetrySample conditioned on a motif being present in the samples and them following a point symmetry. Motif specification is for a single monomer.
evaluate_samplesEvaluate motif scaffolding results using insilico design pipeline.

Examples

Sample 16 proteins with 96 residues each using unconditional model Genie-SCOPe-128 (default if unspecified) on GPU device #1.

python3 main.py experiment=sample_unconditional \
    experiment.n_samples=16 \
    experiment.sample_length=96 \
    model=genie-scope-128 \
    model.device=cuda:1

Scaffold motif problem 3IXT using TDS with masking likelihood, twist scale=2.0, and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=3IXT \
    experiment/conditional_method=tds-mask \
    experiment.conditional_method.twist_scale=2.0 \
    experiment.n_samples=8

Produce 16 scaffolds for motif problem 1PRW using TDS with distance likelihood and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=1PRW \
    experiment/conditional_method=tds-distance \
    experiment.conditional_method.n_batches=16
    experiment.n_samples=128

Scaffold motif problem 5TPN, allowing the motif to be placed anywhere, using TDS with masking likelihood and K=8 particles.

python3 main.py experiment=sample_given_motif \
    experiment/motif=5TPN \
    experiment.fixed_motif=False \
    experiment/conditional_method=tds-mask \
    experiment.n_samples=8

Scaffold multi-motif problem 1PRW_two using TDS with frame-based distance likelihood and K=8 particles.

python3 main.py experiment=sample_given_multiple_motifs \
    experiment/multi_motif=1PRW_two \
    experiment/conditional_method=tds-frame-distance \
    experiment.n_samples=8 \
    model=genie-scope-256

Sample a monomer with 250 residues and C-5 internal symmetry using FPS-SMC with K=16 particles.

python3 main.py experiment=sample_given_symmetry \
    model=genie-scope-256 \
    experiment.sample_length=250 \
    experiment.symmetry=C-5 \
    experiment/conditional_method=fpssmc \
    experiment.n_samples=16

Evaluate samples from unconditional, single-motif scaffolding, or multi-motif scaffolding experiments using the insilico design pipeline with CUDA visible devices #0, #2, and #3

python3 main.py experiment=evaluate_samples \
    experiment.path_to_experiment=<path_to_hydra_output_folder> \
    experiment.gpu_devices=\[0, 2, 3\]

Each of the conditional methods and models have their own default hyperparameters which can be overwritten in the command-line. Check out their config files for more info. Custom motifs can also be scaffolded by creating a config file in configs/experiment/motif following the specification of configs in that directory. The case is similar with multiple motifs, except they are stored in configs/experiment/multi_motif.