Home

Awesome

SVIT: Scaling up Visual Instruction Tuning

Scale up visual instruction tuning to millions by GPT-4.

📖 arXiv | 🤗 Data | 🤖 Data | ✨ Models

Introduction

We Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image description, by prompting GPT-4 with the abundant manual annotations of image.

DatasetImageObject BBoxRegion DescriptionImage CaptionInstruction QuestionResponse AnswerGPT
MiniGPT-43.5K---43.5KGPT-3.5
LLaVAR*16K---16K16KGPT-4
LLaVA81.5K600K-404.7K150K150KGPT-4
SVIT108.1K3.8M5.4M257.6K4.2M4.2MGPT-4

*LLaVAR collects 422K noisy instruction-following data using OCR results and 16K high-quality data using GPT-4.

Model Zoo

CheckpointDataScheduleMME perceptionMME cognitionMMBenchMMBench-ChineseSEED-Bench-1MMMUVQA-v2GQAVisWizScienceQA-IMGTextVQA
SVIT-v1.5-LoRASVIT-mix-665Klora-1e1560.3364.368.363.261.834.180.163.456.769.961.1
SVIT-v1.5-FullSVIT-mix-665Kfull_ft-1e1565.8323.269.163.161.933.380.364.156.470.060.8

Training and Evaluation

The above models are trained on LLaVA-v1.5's architecture. Please follow LLaVA to set up the code and evaluate the models.

Specifically for training, please refer to the visual instruction tuning stage of LLaVA-v1.5, you should just replace LLaVA-v1.5-mix-665K with our SVIT-mix-665K and keep all others remaining.

Dataset

We build SVIT based on Visual Genome dataset that comprises 108,077 images with dense annotations within each image, including region descriptions, objects, attributes, relationships etc. Since Visual Genome is partially sourced from MS-COCO, we also collect captions for images from MS-COCO. Leveraging these annotations, we are able to gather thorough and detailed descriptions for the images, including: (1) the 257,633 captions from MS-COCO; (2) the 3,802,374 object names and their corresponding bounding boxes from Visual Genome; (3) the 5,406,592 region descriptions from Visual Genome.

Inspired by LLaVA, we design four tasks and prompt the language-only GPT-4 ChatBot to generate the questions and answers accordingly. The prompts are summarized in this folder.

For rich diversity, we randomly sample an instruction for detail description task, e.g., "can you describe the image in detail". The complete list of the alternative instructions can be found in this file.

Method

We employ the open-source Multimodal Large Language Model - LLaVA, which consists of a vision encoder, a large language model and a vision-language connector. We illustrate the model in Figure 1.

<p align="center"> <img src="./images/model.png" width="100%"> <figcaption align = "center">Figure 1: SVIT-v1.5 (LoRA) model architecture and abilities.</figcaption> </p>

Qualitative Evaluation

<p align="center"> <img src="./images/demo.png" width="100%"> <figcaption align = "center">Figure 2: Demonstration of different abilities of SVIT-v1.5.</figcaption> </p>

Citation

If you find this repository helpful, please cite the paper below.

@article{zhao2023svit,
      title={SVIT: Scaling up Visual Instruction Tuning}, 
      author={Zhao, Bo and Wu, Boya and He, Muyang and Huang, Tiejun},
      journal={arXiv preprint arXiv:2307.04087},
      year={2023}
}