Denoising Diffusion Probabilistic Model For Protein Sequence Generation

Implementation of a proof of concept (POC) that leverages <a href="https://arxiv.org/abs/2006.11239">Denoising Diffusion Probabilistic Model</a> to generate protein sequences. Code is implemented in pytorch.

This implementation of DDPM was transcribed from lucidrains <a href="https://github.com/lucidrains/denoising-diffusion-pytorch">here</a> I replace the UNet with a pre-trained protein language model ESM-2 for the denoising part.

<img src="./images/sample.jpg" width="500px"><img>


$ git clone https://github.com/pengzhangzhi/protein-sequence-diffusion-model
cd denoising_diffusion_protein_sequence

Install this package

pip install .

Install esm to get the language model. The esm is hacked for this project. The original esm see here.

cd esm
pip install .

Sampling Protein Sequences

cd denoising_diffusion_pytorch

Use pretrained model in denoising_diffusion_pytorch/experiment/best-v1.ckpt to sample novel protein sequences.

python sample.py

Results will be saved in denoising_diffusion_pytorch/generated_protein_seqs.fasta.


I use pytorch-lighning to train the denosing diffusion model. Command line arguments can be passed to manipulate the training, details see denoising_diffusion_pytorch/add_args.py.

cd denoising_diffusion_pytorch
python pl_train.py 


