Home

Awesome

Vision-Language Instruction Tuning: A Review and Analysis


Chen Li<sup>1</sup>, Yixiao Ge<sup>1</sup>, Dian Li<sup>2</sup>, and Ying Shan<sup>1</sup>.

<sup>1</sup>ARC Lab, Tencent PCG<br> <sup>2</sup>Foundation Technology Center, Tencent PCG

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

<a href='https://huggingface.co/datasets/lllchenlll/COCO_ARC'><img src='https://img.shields.io/badge/Data-Huggingface-ebc634'></a> <a href='https://creativecommons.org/licenses/by/4.0/'><img src='https://img.shields.io/badge/License-CC%20BY--SA%204.0-eb9334'></a> <a href='https://arxiv.org/abs/2311.08172'><img src='https://img.shields.io/badge/Paper-ArXiv-eb4c34'></a>

This paper is a review of all the works related to vision-language instruction tuning (VLIT). We will periodically update the recent public VLIT dataset and the VLIT data constructed by the pipeline in this paper.


📆 Schedule

🏷️ Catalogue

  1. <a href="#label_evd">Existing VLIT Data</a>
  2. <a href="#label_vdctp">VLIT Data Constructed in This Paper</a>

<span id="label_evd"> </span>

🗒️ Existing VLIT Dataset

Currently, the existing VLIT generation schemes can be divided into two categories, among which Annotation Adaption mainly relies on directly adjusting and rewriting the existing annotation data to adapt to the VLIT data template. Self-Instruct relies on the Large Language Model (LLM) to synthesize annotation data from more sources and reorganize it to generate VLIT data with more diversity and complexity (of course, it also brings more noise and hallucination).

VLIT Data
├─ General Instruction
│   ├─ Annotation Adaption
│   └─ Self-Instruct
├─ Specific Instruction
│   ├─ Object/Task-Specific
│   │   ├─ Region
│   │   ├─ Video
│   │   └─ Text
│   └─ Domain-Specific
│       ├─ Medicine
│       ├─ Document
│       └─ PointCloud
├─ Construction Tools
└─ Data Mixing

Dataset

If there is any missing, please notify us by email(palchenli@tencent.com) and we will update as soon as possible.

DatasetMLLMPaper
.........
ShareGPT4V-ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
LVLM_NLF-DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
LVIS-INSTRUCT4V-To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
GranDGLaMMGLaMM: Pixel Grounding Large Multimodal Model
ComVint-What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
MiniGPT-v2MiniGPT-v2MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
GRITFerretFERRET REFER AND GROUND ANYTHING ANYWHERE AT ANY GRANULARITY
SparklesDialogue-VGSparklesChatSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
SparklesDialogue-CCSparklesChatSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
MMC-Instruction-MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
InternLM-XComposerInternLM-XComposerInternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
AnyMALAnyMALAnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
DreamLLMDreamLLMDREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION
TextBindTextBindTEXTBIND: Multi-turn Interleaved Multimodal Instruction-following in the Wild
PVITPVITPosition-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
T2MNExT-GPTNExT-GPT: Any-to-Any Multimodal LLM
MosITNExT-GPTNExT-GPT: Any-to-Any Multimodal LLM
GPTVQAMLLM-DataEngineMLLM-DataEngine: An Iterative Refinement Approach for MLLM
LrvInstruction-Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
CIEM-CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
PointLLMPointLLMPointLLM: Empowering Large Language Models to Understand Point Clouds
VIGCVIGCVIGC: Visual Instruction Generation and Correction
M-HalDetec-Detecting and Preventing Hallucinations in Large Vision Language Models
StableLLaVAStableLLaVAStableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
I4CheetorEMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS
AS-1BASMThe All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Multimodal_id_v1LMEye(IPN)LMEye: An Interactive Perception Network for Large Language Models
LynxLynxWhat Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
MGVLIDChatSpotChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPTBuboGPTBuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
GRIT-20MKOSMOS-2KOSMOS-2: Grounding Multimodal Large Language Models to the World
SVITSVIT(MMLLM)SVIT: Scaling up Visual Instruction Tuning
GPT4RoIGPT4RoIGPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
PF-1MClever FlamingoVisual Instruction Tuning with Polite Flamingo
Shikra-RDShikraShikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
LLaVARLLaVARLLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
OphGLMOphGLMOphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue
LAMMLAMMLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
MACAW-LLMMACAW-LLMMacaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
InstructBLIPInstructBLIPInstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
MultiModal-GPTMultiModal-GPTMultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Valley-Instruct-73VALLEYVALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY
Video-LLaMAVideo-LLaMAVideo-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
MULTIINSTRUCTOFA(multiinstruct)MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Video-ChatGPTVideo-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
MIMIC-ITOtterMIMIC-IT: Multi-Modal In-Context Instruction Tuning
M3ITYing-VLMM3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
GPT4ToolsGPT4ToolsGPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
PMC-VQAMedVInT-TE/TDPMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
pandagpt_vidPandaGPTPandaGPT: One Model To Instruction-Follow Them All
MULTISChatBridgeChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
DetGPTDetGPTDetGPT: Detect What You Need via Reasoning
X-LLMX-LLMX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
VideoChatVideoChatVideochat: Chat-centric video understanding
cc_sbu_alignMini-GPT4Minigpt-4: Enhancing vision-language understanding with advanced large language models
LLaVA_instruct_150KLLaVAVisual Instruction Tuning

<span id="label_vdctp"> </span>

📂 Multi-Modal Instruction Data Constructed in This Paper

In this paper, we propose a vision-language instruction construction pipeline and generate a corresponding dataset. Specifically, the generated instruction data is a multi-round question answering about a given image. Here are some examples of the generated instruction data:

Examples

ImageConversation
<img src="./assert/images/COCO_train2014_000000288586.jpg" width="200%"/>User: Can you describe the scene in the image?<br>AI: Sure, the image captures a large grizzly bear wading through a moving river. The bear has a brown coat and appears to be quite large and furry. It's currently not holding anything in its mouth and it doesn't seem to have caught any fish from the river.<br>User: Are grizzly bears typically found in such environments?<br>AI: Yes, indeed. Grizzly bears are often found near rivers or streams. They are excellent swimmers and often go into the water in search of food. During salmon spawning seasons, they are known to frequent river areas to catch the fish.<br>User: Is the bear possibly looking for food?<br>AI: That's a strong possibility. While there are no fish present in the river in the image, grizzly bears are known to forage for various food sources in rivers, including fish and other aquatic organisms. However, without further context, it's hard to say definitively.<br>User: What other characteristics does a grizzly bear have?<br>AI: Grizzly bears, which are a subspecies of brown bear, are known for their size, strength, and excellent swimming ability. They have a distinct hump over their shoulders which is a muscle mass used for digging and clawing. Their fur color can vary from very light tan to dark brown, with the fur on their back often being darker than the rest of their body. Also, despite their large size, grizzlies are quite fast and can run at speeds of up to 30 miles per hour.
<img src="./assert/images/COCO_train2014_000000030722.jpg" width="200%"/>User: What is the main animal in the picture?<br>AI: The main animal in the picture is a giraffe.<br>User: What is the giraffe doing?<br>AI: The giraffe is standing next to a tree and some rocks. It's not doing much, just hanging around.<br>User: Is the giraffe standing on the rocks?<br>AI: No, the giraffe is not standing on the rocks. It is standing next to them.<br>User: Can you describe the setting in more detail?<br>AI: The picture mainly focuses on the giraffe, a tree, and some rocks. The giraffe is standing right next to the tree. There are rocks scattered around them.<br>User: Are there any elephants in the picture?<br>AI: No, there are no elephants in this picture. Only a single giraffe is present.<br>User: Are there any unique features of the giraffe?<br>AI: As per the details provided in the image, the giraffe is standing and has a spotted pattern on its skin, which is a characteristic feature of giraffes.<br>User: Is the giraffe sitting on the rocks?<br>AI: No, the giraffe is not sitting on the rocks. It's standing by them.<br>User: Is the giraffe eating the tree leaves?<br>AI: There's no indication in the image that the giraffe is eating the leaves of the tree. It's simply standing beside it.<br>User: Where do giraffes usually live?<br>AI: Giraffes are native to Africa and they usually inhabit savannas, grasslands, and open woodlands. However, in this particular image, the exact location isn't specified.
<img src="./assert/images/COCO_train2014_000000247884_0.jpg" width="200%"/>User: Is there a car in the picture?<br>AI: Yes, there is a car in the picture and it appears to be parked.<br>User: What can you tell me about the car?<br>AI: Based on the image, the car is parked, which suggests that it is not currently in use. However, I can't provide specifics about its make, model, or color without more visual information.<br>User: Do you think the car is related to the boy on the skateboard?<br>AI: It's hard to definitively state a connection between the car and the boy on the skateboard based on the image alone. The car may belong to someone who is watching the boy, or it may just be parked in the vicinity where the boy is skateboarding.<br>User: Is the car moving?<br>AI: No, the car isn't moving. It's parked.<br>User: Could the car potentially pose a danger to the skateboarder?<br>AI: Since the car is parked, it's unlikely to pose a direct threat to the skateboarder. However, any vehicle, parked or moving, in the vicinity of skateboarders could potentially become a hazard, especially if the skateboarder loses control. This emphasizes the importance of skateboarding in safe, designated areas.<br>User: Is the car possibly there for a skateboarding event?<br>AI: Based on the information available, I can't definitely say if the car is there for a skateboarding event. It could belong to a spectator or participant, or it could be unrelated to the skateboarding activity. More context would be needed to make a clear conclusion.

There are three different types of instruction data. The data statistics and download links are as follows.

Download Links

Data TypeBaidu CLoudGoogle DriveHuggingface
COCO_2014_Imagesurlurlurl
Globalurlurlurl
Negativeurlurlurl
Regionurlurlurl
Region_Imagesurlurlurl

Data Format

{
    "image_source": "",
    "construction_time": "",
    "annotations": [
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      },
      
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      }
    ]
}

📎 Citation

If you found this repository useful, please consider citing:

@article{li2023visionlanguage,
      title={Vision-Language Instruction Tuning: A Review and Analysis}, 
      author={Chen Li and Yixiao Ge and Dian Li and Ying Shan},
      year={2023},
      eprint={2311.08172},
      archivePrefix={arXiv},
      primaryClass={cs.MM}
}

👍🏻 Acknowledgement

We would like to thank LLaVA, LAVIS and OpenFlamingo for their well-architcated multi-modal LLMs. Thanks to SEED-Bench for being an open source and convenient benchmark for evaluating MLLMs.