Awesome
CoIN: A Benchmark of ContinuaL Instruction tuNing for Multimodel Large Language Model
Cheng Chen, Junchen Zhu, Xu Luo, Hengtao Shen, LianLi Gao, Jingkuan Song.
<img src="./assets/architecture.png">Abstract
Instruction tuning represents a prevalent strategy employed by Multimodal Large Language Models (MLLMs) to align with human instructions and adapt to new tasks. Nevertheless, MLLMs encounter the challenge of adapting to users' evolving knowledge and demands. Therefore, how to retain existing skills while acquiring new knowledge needs to be investigated. In this paper, we present a comprehensive benchmark, namely ContinuaL Instruction tuNing (CoIN), to assess existing MLLMs in sequential instruction tuning paradigm. CoIN comprises 10 commonly used datasets spanning 8 task categories, ensuring a diverse range of instructions and tasks. Besides, the trained model is evaluated from two aspects: Instruction Following and General Knowledge, which assess the alignment with human intention and knowledge preserved for reasoning, respectively. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting, and the failure in intention alignment assumes the main responsibility, instead of the knowledge forgetting. To this end, we introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment. Experimental results consistently illustrate the forgetting decreased from this method on CoIN.
Install
- Clone this repository and navigate to CoIN folder
git clone https://github.com/zackschen/CoIN.git
cd CoIN
- Install Package
conda create -n coin python=3.10 -y
conda activate coin
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
This repo is based on LLaVA. If you meet a problem, maybe you could find some solutions in issuses.
Dataset
Please download the images from the constituting dataset: ScienceQA, VQAv2, VizWiz, TextVQA, GQA, OCR-VQA, ImageNet, RefCOCO, RefCOCO+, and RefCOCOg.
Image Source | Download Path |
---|---|
COCO | train2014, test2015, val2014 |
RefCOCO | annotation |
RefCOCO+ | annotation |
RefCOCOg | annotation |
ImageNet | images |
OCR-VQA | images |
GQA | images |
TextVQA | train,test |
ScienceQA | images |
VizWiz | train, val, test |
After downloading all of them, organize the data as follows:
├── COCO2014
│ └── train2014
├── GQA
│ └── images
├── OCR-VQA
│ └── images
├── TextVQA
│ └── train_images
│ └── test_images
Then, please download the instructions from our datasets path: CoIN_Dataset then, organize the instructions as follows:
├── Instruction_Type1
│ └── GQA
│ └── train.json
│ └── test.json
│ └── ScienceQA
│ └── train.json
│ └── test.json
├── Instruction_Type2
│ └── GQA
│ └── train.json
│ └── test.json
Instruction Tuning
First, downloading the pretrained projectors in LLaVA Model_Zoo.
Setting pretrain_mm_mlp_adapter
to the projector path.
You could modify the deepspeed config
to change the deepspeed config.
We provide the scripts of our train order in scripts/CoIN/Train
.
Note, the output_dir
of the previous script is the previous_task_model_path
of the next training process.
Then, you could tune these datasets in your order.
MoELoRA
For training with MoeLoRA, you could train scripts in scripts/CoIN/Train_MOE
.
Evaluation
We have prepared the scripts to evaluate the trained model in scripts/CoIN/Eval
.
These scripts will evalute the trained model and create the prompts (prompt_to_eval.json
) for evaluating the general knowldege.
To evaluate the general knowldege, you could add the result path to llava/eval/CoIN/to_eval_prompt.txt
and run the llava/eval/CoIN/evaluate_generalknowledege.py
, this python file will output a score to indicate the general knowledge.
To Do
-
- [] Evaluating on more MLLM, MiniGPT-4, MiniGPT-V2, InstrctBlip, Qwen-VL;
-
- [] Evaluating on different size of MLLM;
-
- [] Evaluating on full finetune.
Citation
@misc{chen2024coin,
title={CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model},
author={Cheng Chen and Junchen Zhu and Xu Luo and Hengtao Shen and Lianli Gao and Jingkuan Song},
year={2024},
eprint={2403.08350},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgement
LLaVA: the codebase we built upon, and our base model LLaVA-1.5-7b that has the amazing vision-language capabilities!