Awesome
[ACL 2024] A Codebase for Incremental Learning with Large Language Models
Contents
Introduction
This is a repository for Incremental Learning with Large Language Models.
- It supports both generative and discriminative models in transformers.
- It supports using accelerate for distributed data parrallel and model parallel.
- It supports using wandb for logging.
Supported List
Scenario
- Instance-Incremental Learning
- Class-Incremental Learning
- Task-Incremental Learning
- Continual Instruction Tuning (Coming soon!)
- Continual Knowledge Editing (Coming soon!)
Tasks
- Text Classification
- Intent Classification
- Relational Extraction
- Named Entity Recognition
Methods
More baselines will be released in the future!
General (Text/Intent) Classification
- SEQ (Sequential Finetuning)
- ExperienceReplay
- PEFT (including, LoRA, PromptTuning)
- LAMOL (ICLR 2020)
- LAMOL_KD (arXiv)
- L2KD (EMNLP 2020)
- AdapterCL (EMNLP 2021)
- PCLL (EMNLP 2022)
- LFPT5 (ICLR 2022)
- ProgPrompt (ICLR 2023)
- SEQ* (ACL 2024)
Named Entity Recognition
- ExtendNER (AAAI 2021)
- SelfTrain (EMNLP 2022)
- CFNER (EMNLP 2022)
- SpanKL (AAAI 2023)
- DLD (SIGIR 2023)
- RDP (CIKM 2023)
- CPFD (EMNLP 2023)
- OCILNER (ACL 2023)
- ICE (ACL 2023 findings)
- IS3 (ACL 2024 findings)
Original for Image Classification
<!-- - [ ] [A-GEM (ICLR 2019)](https://arxiv.org/abs/1812.00420) - [ ] [GEM (NIPS 2017)](https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html) -->Datasets
Instance Incremental Learning
- Concept-1K (The raw and the preprocessed Concept-1K are included in dataset/concept_1k, dataset/concept_1k_task10, dataset/concept_1k_task1).
Intent Classification
- Topic3datasets (agnews, dbpedia, yahoo)
Intent Classification
- CLINC150
- Banking77
Relation Extraction
- FewRel
- TACRED
Named Entity Recognition
- Few-NERD
- Ontonotes5
- I2B2
Best Practice to Use this Codebase
How to reproduce the performance of SEQ and SEQ*?
The config file of SEQ (just sequential fine-tuning) can be found in the SEQ_full.yaml
(in the config directory).
The config file of SEQ* can be found in the SEQ_pre_warm_fix.yaml
.
Note that the classifier type (linear or cosine linear) is not specified in all config files because we set it the script. An example can be found in https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm/blob/main/reproduce_shell/exp-CIL-sota/SOTA-CIL-Intent-discriminative-banking77_task7.sh
.
Usage
Overview
.
├── main_CL.py # This this the python file to be executed for running all experiments
├── utils # This folder contains all basic files for incremental learning
│ ├── backbone.py # This file loads backbone models from the transformers library
│ ├── buffer.py # This file defines the replay buffer
│ ├── classifier.py # This file loads Linear/CosineLinear classifiers
│ ├── wrapmodel.py # This file wrap the model for using DeepSpeed with accelerate
│ ├── dataformat_preprocess.py# This file preprocess the raw datasets to the continual learning dataset
│ ├── dataloader.py # This file prepare the input for languge models
│ ├── dataset.py # This file defines the format for different datasets for continual learning
│ ├── download_backbones.py # This file downloads models in advance to avoid network problem.
│ ├── evaluation.py # This file defines the evaluation process for various tasks
│ ├── factory.py # This file loads the various models from the ./models folder
│ ├── logger.py # This file defines the logger
│ ├── metric.py # This file defines the evaluation metric for continual learning
│ ├── optimizer.py # This file defines the optimizer for different models
│ ├── prompt.py # This file defines the prompt used for different tasks
│ ├── probing.py # This file computes the probing performance
│ └── config.py # This file defines general parameters and settings for the experiments
├── config # This folder contains the hyper-parameters for each methods in each datasets
├── dataset # This folder contains datasets for continual learning
├── models # This folder contains models for continual learning
└── experiments # This folder contains log data for each run
Quick Start
Step 1: prepare the environment
pip install -r requirement.txt
Step 2: prepare the dataset
Check the support_dataset_list in utils/dataformat_preprocess.py and select the dataset you want for experiment.
Then, download the raw dataset to the folder dataset/{dataset-name}. For example, download the clinc150 to the folder dataset/clinc150. The raw datasets can be downloaded here. We note that the raw data of Conept-1K is in dataset/concept_1k. The preprocessed Concept-1K for 10 step incremental learning is in dataset/concept_1k_task10. The whole Concept-1K is in dataset/concept_1k_task1.
Next, exceute the preprocess_dataset.sh. It will automatically preprocess 8 default datasets for reproducing results ('topic3datasets','clinc150','banking77', 'fewrel','tacred','conll2003','fewnerd','i2b2','ontonotes5') and create new folders in datasets/{dataset-for-continual-learning-name} automatically (e.g.,backing_task7). If you do not need to customize the datasets, you can skip to Step 3.
To customize the datasets, you can run utils/dataformat_preprocess.py with your own parameters (e.g., random seeds, num of tasks). This process will create a new target folder dataset/{dataset-for-continual-learning-name}. In the target folder, two json files continual_data.json and continual_config.json will be saved. For example, you can prepare clinc150 and fewrel dataset by runing
python utils/dataformat_preprocess.py --dataset clinc150 --seed 1
and
python utils/dataformat_preprocess.py --dataset fewrel --seed 1
The program will create target folders dataset/clinc150_task15 and dataset/fewrel_task8.
For NER datasets, for example ontonotes5, you can run the following command
python utils/dataformat_preprocess.py --dataset ontonotes5 --seed 1 --base_task_entity 8 --incremental_task_entity 2 --seen_all_labels False
The program will create a target folder dataset/ontonotes5_task6_base8_inc2. We note that fixing the random seed enables that exctaly the same datasets can be generated on different devices. Finally, the post-precessed dataset clinc150_task15,fewrel_task8, and ontonotes5_task6_base8_inc2 are ready for continual learning!
Step 3: select the yaml file for hyper-parameters
The yaml file contains the hyper-parameters for each method. For example, the hyper-parameter of SEQ* (w/ and w/o pre-allocating future classifiers) for generative backbones under CIL settings is defined in config/CIL/generative_backbones/clinc150_task15/SEQ_pre_warm_fix.yaml and config/CIL/generative_backbones/clinc150_task15/SEQ_warm_fix.yaml respectively.
Step 4: reproduce the results
The scripts for reproducing the probing study are in the folder reproduce_shell/exp-probing.
The scripts for reproducing the probing study with different pre-training steps are in the folder reproduce_shell/exp-probing-pretraining.
The scripts for reproducing the experiments of comparing SEQ* with SOTA methods are in the folder reproduce_shell/exp-sota.
If you want to run an experiment, execute the main_CL.py. For example, you can run SEQ method on clinc150_task15 dataset with bert-base-cased using the following command:
python main_CL.py --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5
If you want to use wandb for logging (see here for more help):
python main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5
If you want to use accelerate for data/model parallel (see here for more help):
accelerate launch --config_file {your-accelerate-config-file} main_CL.py --is_wandb True --wandb_project {your-project-name} --wandb_entity {your-entity-name} --exp_prefix {your-experiment-name} --cfg './config/clinc150_task15/SEQ_full.yaml' --backbone bert-base-cased --classifier Linear --training_epochs 5
Please refer to utils/config.py for more general paramters and models/{model-name}.py for more model-specific parameters.
Main Results
The results on IIL scenario.
The results on CIL and TIL scenario.
Questions and Citation
If you have questions about this repository, please feel free to contact me at junhaozheng47@outlook.com.
If you find this repository useful, please consider citing our paper.
@misc{zheng2023learn,
title={Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models},
author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
year={2023},
eprint={2312.07887},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{qiu2024incremental,
title={Incremental Sequence Labeling: A Tale of Two Shifts},
author={Qiu, Shengjie and Zheng, Junhao and Liu, Zhen and Luo, Yicheng and Ma, Qianli},
journal={arXiv preprint arXiv:2402.10447},
year={2024}
}
@misc{zheng2024concept1k,
title={Concept-1K: A Novel Benchmark for Instance Incremental Learning},
author={Junhao Zheng and Shengjie Qiu and Qianli Ma},
year={2024},
eprint={2402.08526},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
<!-- ### Star History
[![Star History Chart](https://api.star-history.com/svg?repos=zzz47zzz/pretrained-lm-for-incremental-learning&type=Timeline)](https://star-history.com/#zzz47zzz/pretrained-lm-for-incremental-learning&Timeline) -->