Awesome
RNAErnie_baselines
Official implement of BERT-like baselines (RNABERT, RNA-MSM, RNA-FM) for paper "Multi-purpose RNA Language Modeling with Motif-aware Pre-training and Type-guided Fine-tuning" with pytorch.
Installation
First, download the repository and create the environment.
git clone https://github.com/CatIIIIIIII/RNAErnie_baselines.git
cd ./RNAErnie_baselines
conda env create -f environment.yaml
Then, activate the "RNAErnie" environment.
conda activate ErnieFold
Pre-training
You need to download the pre-training model weight from RNABERT, RNA-MSM and place them in the ./checkpoints
folder. The pre-training model weight of RNA-FM would be downloaded automatically when you run the fine-tuning script.
Downstream Tasks
RNA sequence classification
1. Data Preparation
You can download training data from Google Drive and place them in the ./data/seq_cls
folder. For baselines, only dataset nRC is available for this task.
2. Fine-tuning
Fine-tune BERT-style large-scale pre-trained language model on RNA sequence classification task with the following command:
python run_seq_cls.py \
--device 'cuda:0' \
--model_name RNAFM
You could configure backbone model by changing --model_name
to RNAMSM
or RNABERT
.
RNA RNA interaction prediction
1. Data Preparation
You can download training data from Google Drive and place them in the ./data/rr_inter
folder.
2. Fine-tuning
Fine-tune RNAErnie on RNA-RNA interaction task with the following command:
python run_rr_inter.py \
--device 'cuda:0' \
--model_name RNAFM
You could configure backbone model by changing --model_name
to RNAMSM
or RNABERT
.
RNA secondary structure prediction
1. Data Preparation
You can download training data from Google Drive and unzip and place them in the ./data/ssp
folder. Two tasks (RNAStrAlign-ArchiveII, bpRNA1m) are available for this task.
2. Adaptation
Adapt RNAErnie on RNA secondary structure prediction task with the following command:
python run_ss_pred.py \
--device 'cuda:0' \
--model_name RNAFM
You could configure backbone model by changing --model_name
to RNAMSM
or RNABERT
. Or test on different tasks by changing --task_name
to RNAStrAlign
or bpRNA1m
.