Awesome
Link-Context Learning for Multimodal LLMs [CVPR 2024]
<p align="center" width="100%"> <img src="ISEKAI_overview.png" width="80%" height="80%"> </p> <div> <div align="center"> <a href='https://macavityt.github.io/' target='_blank'>Yan Tai<sup>*,2,3,4</sup></a>  <a href='https://weichenfan.github.io/Weichen/' target='_blank'>Weichen Fan<sup>*,†,3</sup></a>  <a href='https://zhaozhang.net/' target='_blank'>Zhao Zhang<sup>3</sup></a>  <a href='https://liuziwei7.github.io/' target='_blank'>Ziwei Liu<sup>✉,1</sup></a> </div> <div> <div align="center"> <sup>1</sup>S-Lab, Nanyang Technological University  <sup>2</sup>Shanghai Jiao Tong University  <sup>3</sup>SenseTime Research  <br><sup>4</sup>Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China<br>  </br> <sup>*</sup> Equal Contribution  <sup>†</sup> Project Lead  <sup>✉</sup> Corresponding Author </div>Official PyTorch implementation of "Link-Context Learning for Multimodal LLMs" [CVPR 2024].
Updates
- 28 Feb, 2024 :boom::boom: Our paper has been accepted by CVPR 2024! 🎉
- 05 Sep, 2023: We release the code, data, and LCL-2WAY-WEIGHT checkpoint.
- 24 Aug, 2023: We release the online demo at 🔗LCL-Demo🔗.
- 17 Aug, 2023: We release the two subsets of ISEKAI (ISEKAI-10 and ISEKAI-pair) at [Hugging Face 🤗].
This repository contains the official implementation and dataset of the following paper:
Link-Context Learning for Multimodal LLMs<br> https://arxiv.org/abs/2308.07891
Abstract: The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning, where models are encouraged to "learn to learn" from limited tasks and generalize to unseen tasks. In this work, we propose link-context learning (LCL), which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links, LCL guides the model to discern not only the analogy but also the underlying causal associations between data points, which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset, comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs.
Todo
- Release the ISEKAI-10 and ISEKAI-pair.
- Release the dataset usage.
- Release the demo.
- Release the codes and checkpoints.
- Release the full ISEKAI dataset.
- Release checkpoints supporting few-shot detection and vqa tasks.
Get Start
Install
conda create -n lcl python=3.10
conda activate lcl
pip install -r requirements.txt
configure accelerate
accelerate config
Dataset
ImageNet
We train the LCL setting on our rebuild ImageNet-900 set, and evaluate model on ImageNet-100 set. You can get the dataset json here.
ISEKAI
We evaluate model on ISEKAI-10 and ISEKAI-Pair, you can download ISEKAI Dataset in ISEKAI-10 and ISEKAI-pair.
Checkpoint
Download our LCL-2WAY-WEIGHT and LCL-MIX checkpoints in huggingface.
Demo
To launch a Gradio web demo, use the following command. Please note that the model evaluates in the torch.float16 format, which requires a GPU with at least 16GB of memory.
python ./mllm/demo/demo.py --model_path /path/to/lcl/ckpt
It is also possible to use it in 8-bit quantization, albeit at the expense of sacrificing some performance.
python ./mllm/demo/demo.py --model_path /path/to/lcl/ckpt --load_in_8bit
Train
After preparing data, you can train the model using the command:
LCL-2Way-Weight
accelerate launch --num_processes 4 \
--main_process_port 23786 \
mllm/pipeline/finetune.py \
config/lcl_train_2way_weight.py \
--cfg-options data_args.use_icl=True \
--cfg-options model_args.model_name_or_path=/path/to/init/checkpoint
LCL-2Way-Mix
accelerate launch --num_processes 4 \
--main_process_port 23786 \
mllm/pipeline/finetune.py \
config/lcl_train_mix1.py \
--cfg-options data_args.use_icl=True \
--cfg-options model_args.model_name_or_path=/path/to/init/checkpoint
Inference
After preparing data, you can inference the model using the command:
ImageNet-100
accelerate launch --num_processes 4 \
--main_process_port 23786 \
mllm/pipeline/finetune.py \
config/lcl_eval_ISEKAI_10.py \
--cfg-options data_args.use_icl=True \
--cfg-options model_args.model_name_or_path=/path/to/checkpoint
mmengine style args and huggingface:Trainer args are supported. for example, you can change eval batchsize like this:
ISEKAI
# ISEKAI10
accelerate launch --num_processes 4 \
--main_process_port 23786 \
mllm/pipeline/finetune.py \
config/shikra_eval_multi_pope.py \
--cfg-options data_args.use_icl=True \
--cfg-options model_args.model_name_or_path=/path/to/checkpoint \
--per_device_eval_batch_size 1
# ISEKAI-PAIR
accelerate launch --num_processes 4 \
--main_process_port 23786 \
mllm/pipeline/finetune.py \
config/shikra_eval_multi_pope.py \
--cfg-options data_args.use_icl=True \
--cfg-options model_args.model_name_or_path=/path/to/checkpoint \
--per_device_eval_batch_size 1
where --cfg-options a=balabala b=balabala
is mmengine style argument. They will overwrite the argument predefined in config file. And --per_device_eval_batch_size
is huggingface:Trainer argument.
the prediction result will be saved in output_dir/multitest_xxxx_extra_prediction.jsonl
, which hold the same order as the input dataset.
Cite
@inproceedings{tai2023link,
title={Link-Context Learning for Multimodal LLMs},
author={Tai, Yan and Fan, Weichen and Zhang, Zhao and Liu, Ziwei},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
year={2024}
}