Awesome
A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity
🚀 Introduction
The approach achieved over a 10-fold increase in DNA cleavage activity in two complex multi-domain functional proteins (Kurthia massiliensis Ago and Pyrococcus furiosus Ago, referred to as KmAgo and PfAgo), significantly surpassing any existing wild-type protein activity at ambient temperature.
<img src="img/framework.png" alt="Logo">🤖 Inference with Ago protein
prepare dataset
mkdir -p dataset/Ago
cd dataset
wget wget https://huggingface.co/datasets/tyang816/cath/resolve/main/SS.zip
unzip SS.zip
cd Ago
wget https://huggingface.co/datasets/tyang816/Ago_database_PDB/resolve/main/Ago_AlphaFold2_PDB.zip
unzip Ago_AlphaFold2_PDB.zip
build pdb graph
# see `script/gen_graph.sh`
export protein=Ago
python protein_DIFF/dataset/generate_graph.py \
--pdb_dir dataset/$protein/pdb/ \
--save_dir dataset/$protein/process/
fix position (Active/Conservative Sites..)
put Ago.fix.txt
into dataset/Ago
prepare cath dataset
get cath
# cath
mkdir -p cath40_k10_imem_add2ndstrc/raw
cd cath40_k10_imem_add2ndstrc/raw
wget https://huggingface.co/datasets/tyang816/cath/resolve/main/dompdb.tar
tar -xvf dompdb.tar
rm dompdb.tar
build data graph
cd <your/diffusion>
python protein_DIFF/dataset/cath_imem_2nd.py
Start Pre-training
We suggest that you do not change the parameters.
protein=Ago
CUDA_VISIBLE_DEVICES=0 python protein_DIFF/run_pt.py \
--batch_size 32 \
--lr 5e-4 \
--timesteps 500 \
--hidden_dim 256 \
--objective pred_x0 \
--smooth_temperature 1.0 \
--wd 0 \
--clip_grad_norm 1e3 \
--device_id 0 \
--depth 6 \
--drop_out 0.08 \
--embedding_dim 256 \
--embedding \
--norm_feat \
--Date 1121 \
--noise_type uniform \
--target_protein_dir dataset/$protein/process/ \
--output_dir result/$protein
🔬 Design Your Own Protein
Inference
STEP
: select the model, the higher the value of step, the higher the rr
mkdir ckpt & cd ckpt
wget https://huggingface.co/tyang816/CPDiffusion/resolve/main/Jun_5_ago_dataset%3DCATH_result_lr%3D0.0005_wd%3D0.0_dp%3D0.08_hidden%3D256_noisy_type%3Duniform_embed_ss%3DFalse_88935.pt
cd ..
STEP=88935
CUDA_VISIBLE_DEVICES=0 python protein_DIFF/inference.py \
--ckpt ckpt/Jun_5_ago_dataset=CATH_result_lr=0.0005_wd=0.0_dp=0.08_hidden=256_noisy_type=uniform_embed_ss=False_"$STEP".pt \
--target_protein dataset/Ago/process/AGO_050_model_3_ptm.pt \
--target_protein_dir dataset/Ago/process/ \
--gen_num 100 \
--output_dir result/predict
🙌 Citation
Please cite our work if you have used our code or data for dry experiment testing/wet experiment. We are pleased to see improvements in the subsequent work.
@article{zhou2024cpdiffusion_ago,
title={A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity},
author={Zhou, Bingxin and Zheng, Lirong and Wu, Banghao and Yi, Kai and Zhong, Bozitao and Tan, Yang and Liu, Qian and Li{\`o}, Pietro and Hong, Liang},
journal={Cell Discovery},
volume={10},
number={1},
pages={95},
year={2024},
publisher={Springer Nature Singapore Singapore}
}