Home

Awesome

GENA-LM

GENA-LM is a family of Open-Source Foundational Models for Long DNA Sequences.

GENA-LM models are transformer masked language models trained on human DNA sequence.

Key features of our GENA-LM models:

Pre-trained models

ModelArchitectureMax SeqLen, tokens (bp)ParamsTokenizer dataTraining data
bert-baseBERT-12L512(4500)110MT2T split v1T2T split v1
bert-base-t2tBERT-12L512(4500)110MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs
bert-base-lastln-t2tBERT-12L512(4500)110MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs
bert-base-t2t-multiBERT-12L512(4500)110MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs+Multispecies
bert-large-t2tBERT-24L512(4500)336MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs
bigbird-base-sparseBERT-12L, DeepSpeed Sparse Ops, RoPE4096(36000)110MT2T split v1T2T split v1
bigbird-base-sparse-t2tBERT-12L, DeepSpeed Sparse Ops, RoPE4096(36000)110MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs
bigbird-base-t2tBERT-12L, HF BigBird4096(36000)110MT2T+1000G SNPs+MultispeciesT2T+1000G SNPs

T2T split v1 refers to preliminary models with a non-augmented T2T human genome assembly split. BERT-based models employ Pre-Layer Normalization and lastln explicitly denotes that layer normalization is also applied to the final layer. RoPE indicates the use of rotary position embeddings in place of BERT-like absolute positional embeddings.

For our first models (gena-lm-bert-base and gena-lm-bigbird-base-sparse) we hold out human chromosomes 22 and Y (CP068256.2 and CP086569.2) as the test dataset for the masked language modeling task. For all other models, we hold out human chromosomes 7 and 10 (CP068271.2 and CP068268.2); these models have the suffix "t2t" in their names. Other data was used for training. Human-only models were trained on pre-processed Human T2T v2 genome assembly and its 1000-genome SNP augmentations making in a total of ≈ 480 x 10^9 base pairs. Multispecies models were trained on human-only and multispecies data making in a total of ≈ 1072 x 10^9 base pairs.

Pre-trained models on downstream tasks

ModelTaskTask seq lenMetricHF branch name
gena-lm-bert-base-t2tpromoters300bp74.56+-0.36 F1promoters_300_run_1
gena-lm-bert-large-t2tpromoters300bp76.44+-0.16 F1promoters_300_run_1
gena-lm-bert-large-t2tpromoters2000bp93.70+-0.44 F1promoters_2000_run_1
gena-lm-bert-base-t2tsplice site15000bp92.63+-0.09 PR AUCspliceai_run_1
gena-lm-bert-large-t2tsplice site15000bp93.59+-0.11 PR AUCspliceai_run_1

To get a pre-trained model on a downstream task, replace model_name and branch_name with values from the table. The metrics in the table are averaged over multiple runs. Therefore, the values for each checkpoint may differ from those reported here.

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(f'AIRI-Institute/{model_name}')
model = AutoModel.from_pretrained(f'AIRI-Institute/{model_name}', revision=branch_name, trust_remote_code=True)

Examples

How to load pre-trained GENA-LM for Masked Language Modeling

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)

How to load pre-trained GENA-LM to fine-tune it on classification task

Get model class from GENA-LM repository:

git clone https://github.com/AIRI-Institute/GENA_LM.git
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')

or you can just download modeling_bert.py and put it close to your code.

OR you can get model class from HuggingFace AutoModel:

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)

GENA-LM bigbird-base-t2t model uses the HuggingFace BigBird implementation. Therefore, default classes from the Transformers library could be used:

from transformers import AutoTokenizer, BigBirdForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')

Notebooks

Citation

@article {GENA_LM,
	author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
	title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
	elocation-id = {2023.06.12.544594},
	year = {2023},
	doi = {10.1101/2023.06.12.544594},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
	eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
	journal = {bioRxiv}
}

Downstream tasks

Downstream tasks for model evaluation encompass the prediction of promoter and enhancer activity, splicing sites, chromatin profiles, and polyadenylation site strength. Check downstream_tasks folder for code and data preprocessing scripts we used:

Pre-training data

Download and preprocess data

In order to download human genome please run the following script:

./download_data.sh human

For preprocessing, execute the following script:

python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/

Installation

For models with sparse attention (gena-lm-bigbird-base-sparse, gena-lm-bigbird-base-sparse-t2t) FP16 support and DeepSpeed is needed.

APEX for FP16

Install APEX https://github.com/NVIDIA/apex#quick-start

git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

DeepSpeed for Sparse Ops

DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0, or 11.1 and runs only in FP16 mode (as of DeepSpeed 0.6.0).

PyTorch>=1.7.1,<=1.10.1 wheels with CUDA 10.2/11.0/11.1 from pytorch.org can be used. However, using Sparse Ops with CUDA 11.1 PyTorch wheels would require CUDA 11.3/11.4 to be installed on the system. Sparse Ops could also be used with PyTorch==1.12.1 CUDA 11.3 wheels, but running DeepSpeed Sparse Ops tests would require modifying them as they check for Torch CUDA version <=11.1. DeepSpeed fork for Triton 1.1.1 already has updated tests.

Triton 1.0.0 and 1.1.1 requires python<=3.9.

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache

and check installation with

ds_report

Triton 1.1.1

Triton 1.1.1 brings x2 speed-up to sparse operations on A100, but DeepSpeed (0.6.5) currently supports only triton 1.0.0. DeepSpeed fork with triton 1.1.1 support could be used in the cases where such speed-up is needed:

pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache

and run sparse ops tests with

cd tests/unit
pytest -v test_sparse_attention.py

Finetuning with lm-experiments-tools

We use Trainer and multi-gpu training from lm-experiments-tools repository as the basis for our finetuning scripts. However, you can use HF Transformers Trainer, PyTorch Lightning, or Accelerate and PyTorch with custom training loops instead.

Install lm-experiments-tools according to https://github.com/yurakuratov/t5-experiments#install-only-lm_experiments_tools:

git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .