Home

Awesome

<h2 align="center"> <a href="hhttps://arxiv.org/abs/2402.16445"> TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation </a></h2> <h5 align="center">

arXiv License HuggingFace Data License <br>

</h5> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2> <p align="center"> <img src="imgs/motivation.png" width="400" style="margin-bottom: 0.2;"/> <p> <h5 align="left"> The official code for "TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation". Here we publish the inference code of TaxDiff. The training code & Protein sequence with Taxonomic lables dataset will be released after our paper is accepted. </h2> <p align="center"> <img src="imgs/archrtecture.png" width="700" style="margin-bottom: 0.2;"/> <p> <details open><summary>💡 I also have other AI for Science projects that may interest you ✨. </summary><p> <!-- may -->

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing <br> Liuzhenghao Lv, Zongying Lin, Li Hao, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian<br> github
arXiv <br>

</p></details>

😮 Highlights

💡 Protein sequences Generation Model

🔥 Diffusion-based Framework

⭐ Excellent performance

🚀 Main Results

More detailed results can be found in our paper.

Unconditional Generation

<p align="left"> <img src="imgs/unconditional.png" width=80%> </p>

Controllable Generation

<p align="left"> <img src="imgs/controllable.png" width=80%> </p>

📖 Data Preparation

For inference, please download from HuggingFace. Unzip it and put the ckpt into the folder ckpt/

ckpt/0012802_eval.ckpt

Our dataset can download from HuggingFace.

uniref50_200_256_clean_taxnomic_family_tid__filter_layer6.fasta

We will release protein sequences with taxonmic labels for training procedure once our paper is accepted.

If you want to select a specific protein taxonomic for your research, you need to first find his corresponding tax-id in the data_reader/Taxonnmic_classfication.xlsx, and then modify protein class lables in the sample_protein.py.

class_lables = torch.randint(low=1, high=int(23427), size=(1,num))

🛠️ Requirements and Installation

git clone git@[github.com/Linzy19/TaxDiff.git]
cd TaxDiff
pip install -r requirements.txt

🗝️ Inferencing

The inferencing instruction is in sample_protein.py.

python sample_protein.py --model DiT-pro-12-h6-L16 --cuda-num cuda:0 --num 500

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@article{zongying2024taxdiff,
  title={TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation},
  author={Zongying, Lin and Hao, Li and Liuzhenghao, Lv and Bin, Lin and Junwu, Zhang and Yu-Chian, Chen Calvin and Li, Yuan and Yonghong, Tian},
  journal={arXiv preprint arXiv:2402.17156},
  year={2024}
}