Awesome

MMDiff: Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion (NeurIPS MLSB 2023)

</div>

Description

Official PyTorch implementation of Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion.

Animation of a diffusion model-generated 3D nucleic acid molecule visualized successively

</div> <details open><summary><b>Table of contents</b></summary>

Creating a Virtual Environment
Installing RoseTTAFold2NA
Preparing Datasets
Training New Models
Sampling and Evaluation with Trained Models
Acknowledgements
Citations

</details>

How to install <a name="virtual-environment-creation"></a>

Install Mamba

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Install dependencies

# clone project
git clone https://github.com/Profluent-Internships/MMDiff
cd MMDiff

# create conda environment
mamba env create -f environment.yaml
conda activate MMDiff  # note: one still needs to use `conda` to (de)activate environments

# install local project as package
pip3 install -e .

Download pre-trained checkpoints

Note: Make sure to be located in the project's root directory beforehand (e.g., ~/MMDiff/)

# fetch and extract model checkpoints directory
wget https://zenodo.org/record/8247932/files/MMDiff_Release_Checkpoints.tar.gz
tar -xzf MMDiff_Release_Checkpoints.tar.gz -C checkpoints
rm MMDiff_Release_Checkpoints.tar.gz

Install US-align and qTMclust to cluster generated structures

cd $MY_PROGRAMS_DIR  # download US-align to your choice of directory (e.g., `~/Programs/`)
git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp

Note: Make sure to update the usalign_exec_path and qtmclust_exec_path values in e.g., configs/paths/default.yaml to reflect where you have placed the US-align and qTMclust executables on your machine.

How to install RoseTTAFold2NA to score generated sequences <a name="rf2na-installation"></a>

cd forks/RoseTTAFold2NA/

# create conda environment for RoseTTAFold2NA
conda deactivate
mamba env create -f RF2na-linux.yml --prefix RF2NA/
conda activate RF2NA/

# install SE(3)-Transformer locally to run RoseTTAFold2NA #
cd SE3Transformer/
pip install --no-cache-dir -r requirements.txt
python setup.py install
cd ..

# download pre-trained weights locally #
cd network
wget https://files.ipd.uw.edu/dimaio/RF2NA_apr23.tgz
tar xvfz RF2NA_apr23.tgz
du -sh weights/ # note: this directory should contain a 1.1GB weights file
cd ..

# download sequence and structure databases #
# note: downloading these databases is unnecessary if one will only be using RF2NA's single-sequence mode #
# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates (including *_a3m.ffdata, *_a3m.ffindex)
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz

# RNA databases
mkdir -p RNA
cd RNA

# Rfam [300M]
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.full_region.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
cmpress Rfam.cm

# RNAcentral [12G]
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/rfam/rfam_annotations.tsv.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/id_mapping.tsv.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/sequences/rnacentral_species_specific_ids.fasta.gz
../input_prep/reprocess_rnac.pl id_mapping.tsv.gz rfam_annotations.tsv.gz   # ~8 minutes
gunzip -c rnacentral_species_specific_ids.fasta.gz | makeblastdb -in - -dbtype nucl  -parse_seqids -out rnacentral.fasta -title "RNACentral"

# nt [151G]
update_blastdb.pl --decompress nt
cd ../../../../  # return to the root directory of the project

How to prepare input data <a name="dataset-preparation"></a>

How to download protein input data <a name="protein-dataset-preparation"></a>

To get the protein portion of the training dataset, first download the PDB and then preprocess it using our provided scripts. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. Note that our scripts assume you have downloaded the PDB in the mmCIF format. Navigate down to "Download Protocols" and follow the download instructions depending on your location.

WARNING: Downloading PDB can take up to 1TB of space.

After downloading, you should have a directory formatted like this: https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/

00/
01/
02/
..
zz/

In this directory, unzip all the files:

find . -name "*.gz" -exec gzip -d {} +

Then run the following with <mmcif_dir> replaced with the location of the files downloaded from the PDB.

python src/data/components/pdb/process_pdb_mmcif_files.py --mmcif_dir <mmcif_dir> --write_dir <processed_dir> --num_processes 50 --skip_existing --verbose

See the script for more options. Each mmCIF will be written as a pickle file that we read and process in the data loading pipeline. A protein_metadata.csv will be saved that contains the pickle path of each protein example as well as additional information about each example for faster filtering.

How to enable PDB clustering <a name="protein-dataset-clustering"></a>

To use clustered training data, download the clusters at 30% sequence identity via the RCSB. This download link also works at the time of writing:

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-30.txt

Place this file in data/processed_pdb or anywhere in your file system. Update your config to point to the clustered data:

data:
  cluster_path: data/processed_pdb/clusters-by-entity-30.txt

To use clustered data, set sample_mode to either cluster_time_batch or cluster_length_batch. See next section for details.

Setting the batching mode <a name="dataset-batching"></a>

experiment:
  # Use one of the following:

  # Each batch contains multiple time steps of the same protein.
  sample_mode: time_batch

  # Each batch contains multiple proteins of the same length.
  sample_mode: length_batch

  # Each batch contains multiple time steps of a protein from a cluster.
  sample_mode: cluster_time_batch

  # Each batch contains multiple clusters of the same length.
  sample_mode: cluster_length_batch

How to download nucleic acid input data <a name="na-dataset-downloading"></a>

# download complexes from the PDB for training
python notebooks/creating_protein_na_datasets_from_the_pdb.py

How to analyze input data <a name="na-dataset-analysis"></a>

# download Pfam databases and related metadata
pfam_db_dir=~/Databases/Pfam
mkdir -p $pfam_db_dir

curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz | gunzip > $pfam_db_dir/Pfam-A.hmm
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz | gunzip > $pfam_db_dir/Pfam-A.hmm.dat
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz | gunzip > $pfam_db_dir/active_site.dat

cd $pfam_db_dir
hmmpress Pfam-A.hmm

# execute analysis script to attach Pfam annotations to existing dataset metadata
python notebooks/analyze_dataset_characteristics.py

How to process nucleic acid input data <a name="na-dataset-preparation"></a>

First, group each of the downloaded PDB files by their two middle alphanumeric characters, with <pdb_dir> replaced with the location of the PDB files downloaded.

cd <pdb_dir> && find . -maxdepth 1 -type f -name "*.pdb" -exec sh -c 'dir=$(basename {} | awk -F"[._]" "{print substr(\$1, 2, 2)}"); mkdir -p "$dir"; mv {} "$dir"' \;

Then, run the following with <pdb_dir> replaced with the location of the PDB files downloaded.

python src/data/components/pdb/process_pdb_na_files.py --pdb_dir <pdb_dir> --write_dir <processed_dir> --num_processes 50 --skip_existing --verbose

See the script for more options. Each PDB will be written as a pickle file that we read and process in the data loading pipeline. A na_metadata.csv will be saved that contains the pickle path of each nucleic acid example as well as additional information about each example for faster filtering.

Note: The process_pdb_na_files.py script must be run after first running the process_pdb_mmcif_files.py script, to ensure that the nucleic acid complexes curated by the creating_protein_na_datasets_from_the_pdb.py script overwrite those collected by process_pdb_mmcif_files.py. This results in de-duplication of approximately 2,500 PDB complexes that are assembled by both process_pdb_mmcif_files.py (first) and creating_protein_na_datasets_from_the_pdb.py (second, and preferred since nucleic acid molecules are included in this collection).

How to combine protein and nucleic acid metadata for dataloading <a name="protein-na-dataset-combination"></a>

Lastly, when building the dataset from scratch, we need to combine the protein_metadata.csv and na_metadata.csv files by joining the two by their pdb_name columns and dropping duplicate rows contained in protein_metadata.csv (i.e., preferring to keep entries in na_metadata.csv over those in protein_metadata.csv). One can do so as follows:

python src/data/components/pdb/join_pdb_metadata.py na_metadata_csv_path=<processed_dir>/na_metadata.csv protein_metadata_csv_path=<processed_dir>/protein_metadata.csv metadata_output_csv_path=<processed_dir>/metadata.csv

After the script above is finished, the resulting metadata.csv file should also be placed in metadata/ as PDB_NA_Dataset.csv to finalize the information required for the model's dataloading pipeline, as shown below:

cp <processed_dir>/metadata.csv metadata/PDB_NA_Dataset.csv

How to train <a name="model-training"></a>

Train model with default configuration

# train on CPU
python src/train.py trainer=cpu

# train on GPU
python src/train.py trainer=gpu

Train model with chosen experiment configuration from configs/experiment/. For example:

python src/train.py experiment=pdb_prot_na_gen_se3

Note: You can override any parameter from the command line like this

python src/train.py trainer.max_epochs=20 data.batch_size=64

How to sample <a name="model-sampling-and-eval"></a>

sample.py is the inference script, which can be run as follows.

python src/sample.py

For example, to reproduce the sampling experiments in our manuscript, one can run the inference script as follows, for protein-only, nucleic acid-only, and protein-nucleic acid sequence-structure diffusion, respectively:

# protein-only sampling and evaluation
python src/sample.py inference.name='protein_only_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=123 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[protein] ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt

# nucleic acid-only sampling and evaluation
python src/sample.py inference.name='na_only_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=456 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[na] inference.measure_auxiliary_na_metrics=true ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt

# protein-nucleic acid sampling and evaluation
python src/sample.py inference.name='protein_na_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=789 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[protein,na] ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt

The config for inference is in configs/sample.yaml. See the config for different inference options. By default, inference will use the model weights checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt for protein-nucleic acid joint generation of sequence and structure. Simply change the ckpt_path to use your custom weights (e.g., for structure-only generation) as desired.

Note: Evaluation with RoseTTAFold2NA for assessing designability does not allow semicolons (:) in the names of any FASTA file inputs. Keep this in mind when choosing your value for inference.name or inference.output_dir.

ckpt_path: <path>

Samples will be saved to inference.output_dir in the sample.yaml. By default it is set to ./inference_outputs/. Sample outputs will be saved as follows:

inference_outputs
└── 10D_07M_2023Y_20h_46m_13s                                         # date time of inference
    └── length_100                                                    # sampled length
        ├── sample_0                                                  # sample ID for length
        │   ├── self_consistency                                      # self-consistency results
        │   │   ├── rf2na                                             # RoseTTAFold2NA prediction using generated chain sequences
        │   │   │   ├── sample_1.pdb
        │   │   ├── {protein_,na_,}sample_1.pdb
        │   │   ├── sc_results.csv                                    # summary metrics CSV
        │   │   └── seqs
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_a.fa            # generated sequence for each generated chain
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_b.fa            # note: `P:` -> protein; `S:` -> single-stranded DNA; `R` -> RNA; and `PR:` -> protein-RNA
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_a.fasta         # note: MMDiff sequence when `inference.generate_protein_sequences_using_pmpnn` is `true` for protein-only generation
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_b.fasta
        │   │       └── ...
        │   ├── {protein_,na_,}bb_traj_1.pdb                          # `x_{t-1}` diffusion trajectory
        │   ├── {protein_,na_,}sample_1.pdb                           # final sample
        │   └── x0_traj_1.pdb                                         # `x_0` model prediction trajectory
        └── sample_1                                                  # next sample

Acknowledgements <a name="acknowledgements"></a>

MMDiff builds upon the source code and data from the following projects:

We thank all their contributors and maintainers!

Citing this work <a name="citations"></a>

If you use the code or data associated with this package or otherwise find such work useful, please cite:

@inproceedings{
  morehead2023towards,
  title={Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion},
  author={Morehead, Alex and Bhatnagar, Aadyot and Ruffolo, Jeffrey A. and Madani, Ali},
  booktitle={NeurIPS Machine Learning in Structural Biology Workshop},
  year={2023}
}