Awesome

ELLE: Efficient Lifelong Pre-training for Emerging Data

Code associated with the ELLE: Efficient Lifelong Pre-training for Emerging Data ACL 2022 paper

Citation

Installation

conda env create -f environment.yml
conda activate ELLE
cd ./fairseq_ELLE
pip3 install --editable ./
cd ../apex
pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Available Pre-trained Models

We've prepared pre-trained checkpoints that takes $\text{BERT}_\text{L6_D384}$ as the initial model in fairseq and huggingface formats.

Fine-tune

Downloading Pre-trained Checkpoints

First download the pre-trained checkpoints.

Downloading data

MNLI

Follow /fairseq-0.9.0/README.glue.md to download and pre-process MNLI dataset, and place it under ./fairseq-0.9.0. The directory is expected to be in the structure below:

.
| - downstream
| - fairseq_ELLE
| - fairseq-0.9.0
| - - MNLI-bin
| - checkpoints_hf
| - - roberta_base_ELLE
| - checkpoints_fairseq
| - - roberta_base_ELLE

HyperPartisan, Helpfulness, ChemProt, and ACL-ARC

All these task data is available on a public S3 url; check ./downstream/environments/datasets.py.

If you run the ./downstream/train_batch.py command (see next step), we will automatically download the relevant dataset(s) using the URLs in ./downstream/environments/datasets.py.

Fine-tune

MNLI

export PYTHONPATH=./fairseq-0.9.0
cd ./fairseq-0.9.0
bash eval_MNLI_base_prompt.sh

HyperPartisan

cd ./downstream
bash finetune_news.sh

Helpfulness

cd ./downstream
bash finetune_reviews.sh

ChemProt

cd ./downstream
bash finetune_bio.sh

ACL-ARC

cd ./downstream
bash finetune_cs.sh

Pre-training

Prepare Datasets

The dataset of WB domain follows https://arxiv.org/abs/2105.13880 and datasets of News, Review, Bio, CS domains follow https://github.com/allenai/dont-stop-pretraining. You also need to cut a part of training dataset as the memory. In our main experiment, we take 1G data per domain as the memory. We have provided the pre-training data (already processed in fairseq format) we use in google drive, covering five pre-training domains (WB, News, Reviews, BIO and CS). We sample around 3400M tokens for each domain.

Pre-training with ELLE

Firstly, install the fairseq package:

export PYTHONPATH=./fairseq_ELLE
cd ./fairseq_ELLE/examples/roberta/

Pre-train PLMs with ELLE that takes $\text{BERT}_\text{L6_D384}$ the initial model:

bash train_base_prompt.sh

Pre-train PLMs with ELLE that takes $\text{BERT}_\text{L12_D768}$ as the initial model:

bash train_large_prompt.sh

Pre-train PLMs with ELLE that takes $\text{GPT}_\text{L6_D384}$ as the initial model:

bash gpt_base_prompt.sh

Note that you need to replace the DATA_DIR and memory_dir variables in these bash files with your own path to data files and memory files.

Convert Fairseq Checkpoints into Huggingface Format

Firstly, you need to organize your fairseq PLM checkpoint like the following:

checkpoints_fairseq_new/roberta_base_ELLE/checkpoint_last.pt

and copy the dictionary file:

cp /downstream/dict.txt /checkpoints_fairseq_new/roberta_base_ELLE

Then convert the checkpoint into huggingface checkpoint:

cd /downstream
python convert_pnn_to_hf_batch.py /checkpoints_fairseq_new /checkpoints_hf_new
cp -r /downstream/base_prompt_files/* /checkpoints_hf_new

Then you can do fine-tuning as Fine-tune Section.