Awesome
DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks
The official implementation of DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks.
Getting Started
Download codes
git clone https://github.com/TencentAILabHealthcare/DNAGPT.git
Download pre-trained weights
You can download the weights from
and save model weights to checkpoint dir
cd DNAGPT/checkpoints
# download or copy model weight to this default directory
Foundation model
- dna_gpt0.1b_h.pth: DNAGPT 0.1B params model pretrained with human genomes
- dna_gpt0.1b_m.pth: DNAGPT 0.1B params model pretrained with mutli-organism genomes
- dna_gpt3b_m.pth: DNAGPT 3B params model pretrained with mutli-organism genomes
Finetune model
- regression.pth: Human RNA experssion level regression model
- classification.pth: Human AATAAA GSR classification model
Install
Pre-requirements
- python >= 3.8
Required packages
cd DNAGPT
pip install -r requirements.txt
Test
Example
python test.py --task=<task type> --input=<your dna data> --weight=<path to the pre-trained weight> --name=<the model you want to use> --num_samples=<number of samples seq>
go to directory "scripts" for more test examples.
Generation
# gpt 0.1b human genomes model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt0.1b_h' --weight 'checkpoints/dna_gpt0.1b_h.pth' --num_samples 10 --max_len 256
# gpt 0.1b multi-organism model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt0.1b_m' --weight 'checkpoints/dna_gpt0.1b_m.pth' --num_samples 10 --max_len 256
# gpt 3b multi-organism model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt3b_m' --weight 'checkpoints/dna_gpt3b_m.pth' --num_samples 10 --max_len 256
Regression
python test.py --task 'regression' --input xxxxx --numbers xxxxx --name 'dna_gpt0.1b_h' --weight 'checkpoints/regression.pth'
Classification
python test.py --task 'classification' --input xxxxx --name 'dna_gpt0.1b_m' --weight 'checkpoints/classification.pth'
Tips:
- 'dna_gpt0.1b_m' supports a maximum input length of 24564 bps and 'dna_gpt0.1b_s', 'dna_gpt3b_m' support a maximum input length of 3060 bps.
- The spec_token is set default to 'R' which means human. special token should use with "<", ">", like "<R>"
Citation
DNAGPT
@article{zhang2023dnagpt,
title={DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks},
author={Zhang, Daoan and Zhang, Weitong and He, Bing and Zhang, Jianguo and Qin, Chenchen and Yao, Jianhua},
journal={bioRxiv},
pages={2023--07},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}