Awesome
Diffusion Posterior Sampling via SMC for Zero-Shot Scaffolding of Protein Motifs
This repository provides an interface for solving motif scaffolding problems with an unconditional diffusion model as a prior. It defines inverse problems for a variety of tasks and solves them by sampling the posterior via sequential Monte Carlo. This permits conditional sampling without additional training to the chosen unconditional model.
The following are the supported tasks, samplers, and likelihood formalisations for conditioning on a motif.
Motif Scaffolding Tasks
- (Single-) motif scaffolding
- Multi-motif scaffolding
- Symmetric motif scaffolding
- Inpainting/scaffolding with degrees of freedom
(Diffusion) Posterior Sampling Methods
- Bootstrap Particle Filter
- SMCDiff (Trippe et al., 2023)
- Filtering Posterior Sampling (Dou & Song, 2024)
- Twisted Diffusion Sampler (Wu et al., 2023)
- Monte Carlo Guided Diffusion (Cardoso et al., 2023)
Motif Scaffolding Likelihood Formalisations
- Masking (as done by previous methods)
- Condition on a partial view of the backbone with a fixed orientation.
- Distance
- Condition on the pairwise residue distances (but is reflection-invariant).
- Frame-based distance
- Condition on the pairwise residue distances and pair-wise residue rotation matrix deviations from the rigid body frame representation.
Additionally, other unconditional models can be supported by creating an adapter for them in model
, registering them and their parameters, and adding a config in config/model
. Currently, we have Genie-SCOPe-128 and Genie-SCOPe-256 available. The conditional samplers assume the frame representation of the protein, so some extra engineering may be required for other models.
Installation
To match our setup, use Python 3.9 with CUDA version 11.8 and above. First, pip install the requirements.
pip install -r requirements.txt
Then, initialise the submodules if not already setup.
git submodule update --init
Finally, but optionally, we use the insilico design pipeline from AQLaboratory for evaluation. Run their bash scripts for installing TMScore, ProteinMPNN, and ESMFold to set up the self-consistency pipeline.
Structure
.
├── conditional/ # Posterior samplers
│ ├── __init__.py # Sampler registration
│ ├── components/ # Reusable components
│ │ ├── observation_generator.py # for generating y sequence
│ │ └── particle_filter.py # for filtering
│ ├── bpf.py # Bootstrap particle filter
│ ├── fpssmc.py # Filtering posterior sampling
│ ├── smcdiff.py # SMCDiff
│ ├── tds.py # Twisted diffusion sampler
│ └── wrapper.py # Abstract class for samplers
├── config/ # Configs
│ ├── config.yaml # Main config file
│ ├── experiments/ # Experiment config groups
│ │ └── ...
│ └── model/ # Model config groups
│ └── ...
├── data/ # Motif scaffolding data
│ ├── motif_problems/ # RFDiffusion benchmark
│ │ └── ...
│ ├── multi_motif_problems/ # Genie2 benchmark
│ │ └── ...
│ └── symmetric_motif_problems/ # RFDiffusion trimeric covid binder
│ └── ...
├── experiments/ # Experiments
│ ├── __init__.py # Experiment registration
│ └── experiments.py # Experiment definitions
├── model/ # Diffusion model
│ ├── __init__.py # Model registration
│ ├── diffusion.py # Abstract class for diffusion models
│ └── genie.py # Genie adapter
├── multirun/ # Output of multirun/sweeping experiments
│ └── ...
├── outputs/ # Output of experiments
│ └── ...
├── protein/ # Protein-related functions
│ └── frames.py # Abstract class for frames
├── scripts/ # Scripts for config generation, etc.
│ └── ...
├── submodules/ # Git submodules
│ └── ...
├── utils/ # Utility functions
│ ├── path.py # for resolving paths
│ ├── pdb.py # for working with PDBs
│ ├── registry.py # for handling registrations
│ ├── resampling.py # for low-variance resampling
│ └── symmetry.py # for dealing with symmetry
└── main.py # Main entry point
Usage
The project uses the Hydra framework for handling different experimental setups. Configuration files and groups are defined under the config
folder.
To get started, the following command will show the available options for config groups, e.g. an experiment type, as well as the default parameters set.
python3 main.py --help
Supported Experiments
Configured through the option experiment={experiment_name}
. Check their arguments and defaults in config/experiments
.
Experiment | Description |
---|---|
sample_unconditional | Sample unconditional samples from the diffusion model. Total length must be specified. |
sample_given_motif | Sample conditioned on a motif being present in the samples. Motif config files have specifications like in RFDiffusion. |
sample_given_multiple_motifs | Sample conditioned on multiple motifs being present on the samples. Motif config files have specifications like in Genie2. |
sample_given_symmetry | Sample conditioned on the samples following a point symmetry. |
sample_given_motif_and_symmetry | Sample conditioned on a motif being present in the samples and them following a point symmetry. Motif specification is for a single monomer. |
evaluate_samples | Evaluate motif scaffolding results using insilico design pipeline. |
Examples
Sample 16 proteins with 96 residues each using unconditional model Genie-SCOPe-128 (default if unspecified) on GPU device #1.
python3 main.py experiment=sample_unconditional \
experiment.n_samples=16 \
experiment.sample_length=96 \
model=genie-scope-128 \
model.device=cuda:1
Scaffold motif problem 3IXT using TDS with masking likelihood, twist scale=2.0, and K=8 particles.
python3 main.py experiment=sample_given_motif \
experiment/motif=3IXT \
experiment/conditional_method=tds-mask \
experiment.conditional_method.twist_scale=2.0 \
experiment.n_samples=8
Produce 16 scaffolds for motif problem 1PRW using TDS with distance likelihood and K=8 particles.
python3 main.py experiment=sample_given_motif \
experiment/motif=1PRW \
experiment/conditional_method=tds-distance \
experiment.conditional_method.n_batches=16
experiment.n_samples=128
Scaffold motif problem 5TPN, allowing the motif to be placed anywhere, using TDS with masking likelihood and K=8 particles.
python3 main.py experiment=sample_given_motif \
experiment/motif=5TPN \
experiment.fixed_motif=False \
experiment/conditional_method=tds-mask \
experiment.n_samples=8
Scaffold multi-motif problem 1PRW_two using TDS with frame-based distance likelihood and K=8 particles.
python3 main.py experiment=sample_given_multiple_motifs \
experiment/multi_motif=1PRW_two \
experiment/conditional_method=tds-frame-distance \
experiment.n_samples=8 \
model=genie-scope-256
Sample a monomer with 250 residues and C-5 internal symmetry using FPS-SMC with K=16 particles.
python3 main.py experiment=sample_given_symmetry \
model=genie-scope-256 \
experiment.sample_length=250 \
experiment.symmetry=C-5 \
experiment/conditional_method=fpssmc \
experiment.n_samples=16
Evaluate samples from unconditional, single-motif scaffolding, or multi-motif scaffolding experiments using the insilico design pipeline with CUDA visible devices #0, #2, and #3
python3 main.py experiment=evaluate_samples \
experiment.path_to_experiment=<path_to_hydra_output_folder> \
experiment.gpu_devices=\[0, 2, 3\]
Each of the conditional methods and models have their own default hyperparameters which can be overwritten in the command-line. Check out their config files for more info. Custom motifs can also be scaffolded by creating a config file in configs/experiment/motif
following the specification of configs in that directory. The case is similar with multiple motifs, except they are stored in configs/experiment/multi_motif
.