Awesome
EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow-Matching and Co-Evolutionary Dynamics
Follow my newest GENZyme on pocket design + pocket inpainting for full enzyme design.
EnzymeFlow Paper at arxiv.
Requirement
python>=3.11
CUDA=12.1
torch==2.4.1 (>=2.0.0)
torch_geometric==2.4.0
pip install mdtraj==1.10.0 (do first will install numpy, scipy as well, install later might raise dependency issues)
pip install pytorch-warmup==0.1.1
pip install POT==0.9.4
pip install rdkit==2023.9.5
pip install biopython==1.84
pip install tmtools==0.2.0
pip install geomstats==2.7.0
pip install dm-tree==0.1.8
pip install ml_collections==0.1.1
pip install OpenMM
pip install einx
pip install einops
conda install conda-forge::pdbfixer
Model Training
-
Please refer to the below, to see how we prepare training data.
-
configs.py
contain all training configurations and hyperparameters. -
Train model using
train_ddp.py
for parallal training with multi-gpus (we trained with 4 A40 gpus).
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_ddp.py
- The training loads pre-trained model. You may also train from scratch by setting the configs in
configs.py
, setting parametersckpt_from_pretrain=False pretrain_ckpt_path=None
.
Model Weights
A mini-EnzymeFlow checkpoint is put in Google drive. Once you download it, put it under ./checkpoint
folder.
Model Inference
EnzymeFlow inference demo is provided in jupyter notebook.
Unseen reaction inference demo is provided in jupyter notebook, you only need to generate ligand.mol2 file.
Baseline Experiments
1. RFDiff-AA
For RFDiffAA and LigandMPNN, please refer to RFDiffAA-official and LigandMPNN-official. For each enzyme-reaction pair in evaluation data, we use RFDiffAA with default params to generate 100 catalytic pockets (with 32 residues) for each unique substrate. Then we use LigandMPNN to perform sequence prediction (inverse folding) on the generated catalytic pockets post-hoc.
We provide some RFDiffAA-generated samples in ./data/rfdiffaa_generated
folder at link.
We provide LigandMPNN-predicted sequences for RFDiffAA-generated pockets at file.
We provide CLEAN-predicted EC-Class for LigandMPNN-predicted pocket sequences at file.
2. Enzyme Commission Classifcation
Baselines like RFDiffAA or others do not generate EC-class for the design of catalytic pockets. We use CLEAN to infer the EC-class of sequence representations of these pockets. For CLEAN, please refer to CLEAN-official or CLEAN-webserver. We use CLEAN with greedy max-separation
approach for EC-class inference.
3. ESM3
For ESM3, please refer to ESM3-official. For each sequence representation of generated catalytic pocket, we use ESM3 to recover the full enzyme sequence (by 'entire' meaning, we recover 32 residues into a protein sequence of 200 residues). We can perform enzyme retrieval on both (1) pocket enzymes sequences and (2) full enzyme sequences. ESM3 prompting is at link.
4. Pocket-specified Enzyme CLIP
For ranking-based retrieval evaluation, please refer to RectZyme-paper. We train a pocket-specific enzyme CLIP model with enzyme pockets features computed by latest ESM3 and reactions features computed by MAT-2D. The training data are those of 60%-homology (~50,000 positive samples); evaluation data are those unique, non-repeated ones; training negative samples are training data that are not annotated to catalyze a specific reaction like ClipZyme; evaluation do not use negative data.
Data Preparation
1. Enzyme Pocket, Substrate Molecule, Product Molecule Rawdata
$~~~~$ (a) molecule_structures folder in ./data
contain all substrate and product molecules, can be downloaded at link.
$~~~~$ (b) pocket_fixed_residues/pdb_10A folder in ./data
contain all enzyme pockets, can be downloaded at link.
$~~~~$ (c) We provide rawdata-40%homology and metadata-40%homology with 40% homologys in ./data
folder. More rawdata (50%, 60%, 80%, 90% homologys) can be downloaded at link.
2. Co-evolution and MSA
$~~~~$ (a) rxn_to_smiles_msa.pkl in ./data
contain reaction MSAs.
$~~~~$ (b) uid_to_protein_msa.pkl in ./data
contain enzyme MSAs, can be downloaded at link.
$~~~~$ (c) vocab.txt in ./data
is co-evolution vocabulary.
When the raw data--enzyme pockets, molecules, co-evolution--are ready (stored in right folders), we proceed to process them into metadata.
3. Process rawdata into metadata by running process_data.py
.
$~~~~$ (a) Remember to change the configs --rawdata_file_name
, e.g., python process_data.py --rawdata_file_name rawdata_cutoff-0.4.csv
. Warning: we have absolute path in metadata.csv
, so you might need to change it to your path.
4. Processed Metadata.
$~~~~$ (a) Processed metadata will be saved into ./data/processed
folder, including:
$~~~~$ (b) processed enzyme in ./data/processed/protein
folder.
$~~~~$ (c) processed substrate in ./data/processed/ligand
folder.
$~~~~$ (d) processed co-evolution in ./data/processed/msa
folder.
$~~~~$ (e) processed produuct in ./data/processed/product
folder.
$~~~~$ (f) a toy example is provided.
5. Evaluation Sample.
$~~~~$ (a) We provide eval-rawdata and eval-metadata in ./data
folder. Warning: we have absolute path in metadata.csv, so you might need to change it to your path.
$~~~~$ (b) We provide unprocessed-eval-data in ./data/raw_eval_data
folder.
$~~~~$ (c) We provide processed-eval-data in ./data/processed_eval
folder.
$~~~~$ (d) You can also process evaluation data by running process_data.py
. Remeber to change the configs, e.g., python process_data.py --rawdata_file_name eval-data_cutoff-0.1_unique-subs-enz_100.csv --metadata_file_name metadata_eval.csv
.
Further Statistics
License
No Commercial use of either the model nor generated data, details to be found in license.md.
Citation
@article{hua2024enzymeflow,
title={EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics},
author={Hua, Chenqing and Liu, Yong and Zhang, Dinghuai and Zhang, Odin and Luan, Sitao and Yang, Kevin K and Wolf, Guy and Precup, Doina and Zheng, Shuangjia},
journal={arXiv preprint arXiv:2410.00327},
year={2024}
}