Home

Awesome


This repository contains the implementation of the NeurIPS 2023 paper:

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models [Project Page] [Paper] <br> Gen Luo<sup>1</sup>, Yiyi Zhou<sup>12</sup>, Tianhe Ren<sup>1</sup>, Shengxin Chen<sup>1</sup>, Xiaoshuai Sun<sup>12</sup>, Rongrong Ji<sup>12</sup><br> <sup>1</sup>Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University
<sup>2</sup>Institute of Artificial Intelligence, Xiamen University

In this work, we propose a novel and affordable solution for vision-language instruction tuning, namely Mixture-of-Modality Adaptation (MMA). Particularly, MMA is an end-to-end optimization regime, which connects the image encoder and LLM via lightweight adapters. Meanwhile, we also propose a novel routing algorithm in MMA, which can help the model automatically shifts the reasoning paths for single- and multi-modal instructions. Based on MMA, we develop a large vision-language instructed model called LaVIN, which demonstrates superior training efficiency and better reasoning ability than existing multimodal LLMs in various instruction-following tasks.


<div align="center"> <img src="./assets/teaser-1.png" width="95%"> </div>

News

TODO

Contents

Setup

Install Package

conda create -n lavin python=3.8 -y
conda activate lavin

# install pytorch
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 -c pytorch

# install dependency and lavin
pip install -r requirements.txt
pip install -e .

Data Preparation

LaVIN/
  |-- lavin
  |-- scripts
  |-- train.py
  |-- eval.py
  ......
data/
  |-- problem.json
  |-- pid_splits.json
  |-- captions.json
  |-- all_data.json
  |-- images
      |-- train2014      # MSCOCO 2014
      |-- val2014        # MSCOCO 2014
      |-- train          # ScienceQA train image
      |-- val            # ScienceQA val image
      |-- test           # ScienceQA test image
  |-- weights
      |-- tokenizer.model
          |--7B
              |-- params.json
              |-- consolidated.00.pth
          |--13B
              |-- params.json
              |-- consolidated.00.pth
              |-- consolidated.01.pth
          |--vicuna_7B
          |--vicuna_13B
              |-- config.json
              |-- generation_config.json
              |-- pytorch_model.bin.index.json
              |-- special_tokens_map.json
              |-- tokenizer_config.json
              |-- tokenizer.model
              |-- pytorch_model-00001-of-00003.bin
              |-- pytorch_model-00002-of-00003.bin
              |-- pytorch_model-00003-of-00003.bin
          ......

Fine-tuning

ScienceQA

Reproduce the performance of LaVIN-7B on ScienceQA. For 7B model, we fine-tune it on 2x A100 (we find that the performance will be affected by the number of GPUs. We are working to address this problem).

LLaMA weights:

bash ./scripts/finetuning_sqa_7b.sh

Vicuna weights:

bash ./scripts/finetuning_sqa_vicuna_7b.sh

LaVIN-lite with LLaMA weights (single GPU):

bash ./scripts/finetuning_sqa_vicuna_7b_lite.sh

Reproduce the performance of LaVIN-13B on ScienceQA (~2 hours on 8x A100 (80G)). For 13B model, we fine-tune it on 8x A100.

LLaMA weights:

bash ./scripts/finetuning_sqa_13b.sh

Vicuna weights:

bash ./scripts/finetuning_sqa_vicuna_13b.sh

LaVIN-lite with LLaMA weights (single GPU):

bash ./scripts/finetuning_sqa_vicuna_13b_lite.sh

MultiModal ChatBot

Fine-tune LaVIN-13B on 210k instruction-following data (~ 75 hours with 15 epochs and ~25 hours with 5 epochs on 8x A100 (80G) )

LLaMA weights:

bash ./scripts/vl_instruction_tuning_13b.sh

Vicuna weights:

bash ./scripts/vl_instruction_tuning_vicuna_13b.sh

To train on fewer GPUs, you can reduce the number of gpus in the scripts and increase gradient accumulation via --accum_iter to guarantee the total batch size of 32. Setting --gradient_checkpointing and --bits 4bit in the scripts will greatly reduce the requirements of GPU memory.

Demo

LaVIN supports both single- and multi-modal instruction inputs. Try your custom instructions in our demo:

torchrun --nproc_per_node 1 demo.py --server_name 127.0.0.1
<div align="center"> <img src="./assets/demo.gif" width="95%"> </div>

Model Zoo

ScienceQA

ModelWeightsTimeMemory#ParamsAccWeights
LaVIN-7B-liteLLaMA29 hours (single GPU)9G3.8M88.35google drive
LaVIN-13B-liteLLaMA42 hours (single GPU)14G5.4M89.44google drive
LaVIN-7BLLaMA1.4 hours33.9G3.8M89.37google drive
LaVIN-7BVicuna1.4 hours33.9G3.8M89.41google drive
LaVIN-13BLLaMA2 hours55.9G5.4M90.54google drive
LaVIN-13BLLaMA4 hours55.9G5.4M90.8-

Multimodal ChatBot

ModelWeightsTimeMemory#ParamsAccWeights
LaVIN-13BLLaMA25 hours55.9G5.4M--
LaVIN-13BLLaMA75 hours55.9G5.4M-google drive

Examples

<div align="center"> <img src="./assets/examples.png" width="95%"> </div>

Star History

Star History Chart

Citation

If you think our code and paper helpful, please kindly cite LaVIN and RepAdapter:

@article{luo2023towards,
  title={Towards Efficient Visual Adaption via Structural Re-parameterization},
  author={Luo, Gen and Huang, Minglang and Zhou, Yiyi  and Sun, Xiaoshuai and Jiang, Guangnan and Wang, Zhiyu and Ji, Rongrong},
  journal={arXiv preprint arXiv:2302.08106},
  year={2023}
}

@article{luo2023cheap,
 title={Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models},
 author={Luo, Gen and  Zhou, Yiyi and Ren, Tianhe and Chen, Shengxin and Sun, Xiaoshuai and Ji, Rongrong},
 journal={Advances in neural information processing systems (NeurIPS)},
 year={2023}
  }

Acknowledgement

This repo borrows some data and codes from LLaMA, Stanford Alpaca, LLaVA, MiniGPT-4 and LLaMA-Adapter. Thanks for their great works.