Home

Awesome

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

The official implementation of DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks.

Getting Started

Download codes

git clone https://github.com/TencentAILabHealthcare/DNAGPT.git

Download pre-trained weights

You can download the weights from

and save model weights to checkpoint dir

cd DNAGPT/checkpoints
# download or copy model weight to this default directory

Foundation model

Finetune model

Install

Pre-requirements

Required packages

cd DNAGPT
pip install -r requirements.txt

Test

Example

python test.py --task=<task type> --input=<your dna data> --weight=<path to the pre-trained weight> --name=<the model you want to use> --num_samples=<number of samples seq>

go to directory "scripts" for more test examples.

Generation

# gpt 0.1b human genomes model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt0.1b_h' --weight 'checkpoints/dna_gpt0.1b_h.pth' --num_samples 10 --max_len 256
# gpt 0.1b multi-organism model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt0.1b_m' --weight 'checkpoints/dna_gpt0.1b_m.pth' --num_samples 10 --max_len 256
# gpt 3b multi-organism model
python test.py --task 'generation' --input '<R>AGAGAAAAGAGT' --name 'dna_gpt3b_m' --weight 'checkpoints/dna_gpt3b_m.pth' --num_samples 10 --max_len 256

Regression

python test.py --task 'regression' --input xxxxx --numbers xxxxx --name 'dna_gpt0.1b_h' --weight 'checkpoints/regression.pth'

Classification

python test.py --task 'classification' --input xxxxx --name 'dna_gpt0.1b_m' --weight 'checkpoints/classification.pth'

Tips:

  1. 'dna_gpt0.1b_m' supports a maximum input length of 24564 bps and 'dna_gpt0.1b_s', 'dna_gpt3b_m' support a maximum input length of 3060 bps.
  2. The spec_token is set default to 'R' which means human. special token should use with "<", ">", like "<R>"

Citation

DNAGPT

@article{zhang2023dnagpt,
  title={DNAGPT: A Generalized Pretrained Tool for Multiple DNA Sequence Analysis Tasks},
  author={Zhang, Daoan and Zhang, Weitong and He, Bing and Zhang, Jianguo and Qin, Chenchen and Yao, Jianhua},
  journal={bioRxiv},
  pages={2023--07},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}