Home

Awesome

<h1>Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates</h1> <h2>Model Architecture</h2>

This repository contains code, data and model weights for ICML 2024 paper Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

The overall model architecture is shown below:

image

<h2>Environment</h2> The dependencies can be set up using the following commands:
conda create -n enzygen python=3.8 -y 
conda activate enzygen 
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y 
bash setup.sh 
<h2>Download Data</h2>

We provide the EnzyBench at EnzyBench and Enzyme Classification Tree (EC) ID to index dict at EC_Dict

Please download the dataset and put them in the data folder.

mkdir data 
cd data 
wget https://drive.google.com/file/d/1VycT_gFV2JBpRMCBZlwwxLLRcZDljXCS/view?usp=drive_link
wget https://drive.google.com/file/d/1BCitsFRQpzUbGss7xBpTpvKcMcJh_oOz/view?usp=drive_link
<h2>Download Model</h2>

We provide the checkpoint used in the paper at Model

Please download the checkpoints and put them in the models folder.

If you want to train your own model, please follow the training guidance below

<h2>Training</h2> If you want to train a model with enzyme-substrate interaction constraint as introduced in our paper, please follow the script below:
bash train_enzyme_substrate_33layer.sh

If you want to train a model without enzyme-substrate interaction constraint, please follow the script below:

bash train_cluster_enzyme_33layer.sh

From our experiences, first training a model without enzyme-substrate interaction constraint for around 200,000 steps and then continue training based on sequence recovery loss, coordinate recovery loss and enzyme-substrate interaction loss will lead to the best performance!

<h2>Inference</h2> To design enzymes for the 30 testing third-level categories, please use the following scripts:
bash generation.sh

There are five items in the output directory:

  1. protein.txt refers to the designed protein sequence
  2. src.seq.txt refers to the ground truth sequences
  3. pdb.txt refers to the target PDB ID and the corresponding chain
  4. pred_pdbs refers to the directory of designed pdbs
  5. tgt_pdbs refers to the directory of target pdbs
<h2>Finetune your own model</h2> To finetune your own model based on our trained model, please follow the guidelines below: <h3>Prepare your own data</h3> We provide a case of training data at preprocess/case.json. For training and validation, you should prepare ['seq', 'coor', 'motif', 'pdb', 'ec4', 'substrate', 'binding', 'substrate_coor', 'substrate_feat'] features. Seq denotes the protein sequence, coor denotes the alpha-carbon coordinates which is flattened with the order of x, y, z coordinate. motif denotes the functional sites indexing from 0. pdb denotes the pdb id and chain. ec4 dotes the fourth EC category. substrate denotes the substrate id and binding (0 or 1) denotes if the substrates can bind to the enzyme. substrate_coor and substrate_feat respectively denotes the coordinates and features of the substrates. You can extract the substrate coordinates and features using preprocess/get_substrate_feature.py.
python preprocess/get_substrate_feature.py
<h3>Finetuning your model</h3> After preparing your own data, you can finetune your model using finetune.sh
bash finetune.sh
<h2>Evaluation</h2> We provide the ESP evaluation data at [ESP_data_eval](https://drive.google.com/file/d/1q8NENdVWBufz5fDk7TviS6h6_BKmfviN/view?usp=drive_link)

The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.

The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link

<h3>Expected Results</h3>
Protein Family1.1.11.11.11.14.131.14.141.2.12.1.12.3.12.4.1
EnzyGen0.640.980.380.420.720.800.610.38
Protein Family2.4.22.5.12.6.12.7.12.7.102.7.112.7.42.7.7
EnzyGen0.860.660.530.760.920.930.800.79
Protein Family3.1.13.1.33.1.43.2.23.4.193.4.213.5.13.5.2
EnzyGen0.760.620.880.470.260.730.400.14
Protein Family3.6.13.6.13.6.54.1.14.2.14.6.1--Avg
EnzyGen0.660.780.400.800.930.57--0.65
<h2>Citation</h2> If you find our work helpful, please consider citing our paper.
@inproceedings{songgenerative,
  title={Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates},
  author={Song, Zhenqiao and Zhao, Yunlong and Shi, Wenxian and Jin, Wengong and Yang, Yang and Li, Lei},
  booktitle={Forty-first International Conference on Machine Learning}
}