Awesome

Vision-Language Instruction Tuning: A Review and Analysis

Chen Li1, Yixiao Ge1, Dian Li2, and Ying Shan1.

1ARC Lab, Tencent PCG 2Foundation Technology Center, Tencent PCG

This paper is a review of all the works related to vision-language instruction tuning (VLIT). We will periodically update the recent public VLIT dataset and the VLIT data constructed by the pipeline in this paper.

📆 Schedule

Release New Vision-Language Instruction Data (periodically) ...
Update Public VLIT Datasets and Related Work (periodically) ...
Release Construction Tools
[2023.11.16] Release Instruction Data
[2023.11.15] Paper Released (ArXiv)

🏷️ Catalogue

<a href="#label_evd">Existing VLIT Data</a>
<a href="#label_vdctp">VLIT Data Constructed in This Paper</a>

🗒️ Existing VLIT Dataset

Currently, the existing VLIT generation schemes can be divided into two categories, among which Annotation Adaption mainly relies on directly adjusting and rewriting the existing annotation data to adapt to the VLIT data template. Self-Instruct relies on the Large Language Model (LLM) to synthesize annotation data from more sources and reorganize it to generate VLIT data with more diversity and complexity (of course, it also brings more noise and hallucination).

VLIT Data
├─ General Instruction
│   ├─ Annotation Adaption
│   └─ Self-Instruct
├─ Specific Instruction
│   ├─ Object/Task-Specific
│   │   ├─ Region
│   │   ├─ Video
│   │   └─ Text
│   └─ Domain-Specific
│       ├─ Medicine
│       ├─ Document
│       └─ PointCloud
├─ Construction Tools
└─ Data Mixing

Dataset

If there is any missing, please notify us by email(palchenli@tencent.com) and we will update as soon as possible.

Dataset	MLLM	Paper
...	...	...
ShareGPT4V	-	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
LVLM_NLF	-	DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
LVIS-INSTRUCT4V	-	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
GranD	GLaMM	GLaMM: Pixel Grounding Large Multimodal Model
ComVint	-	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
MiniGPT-v2	MiniGPT-v2	MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning
GRIT	Ferret	FERRET REFER AND GROUND ANYTHING ANYWHERE AT ANY GRANULARITY
SparklesDialogue-VG	SparklesChat	Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
SparklesDialogue-CC	SparklesChat	Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
MMC-Instruction	-	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
InternLM-XComposer	InternLM-XComposer	InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
AnyMAL	AnyMAL	AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
DreamLLM	DreamLLM	DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION
TextBind	TextBind	TEXTBIND: Multi-turn Interleaved Multimodal Instruction-following in the Wild
PVIT	PVIT	Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
T2M	NExT-GPT	NExT-GPT: Any-to-Any Multimodal LLM
MosIT	NExT-GPT	NExT-GPT: Any-to-Any Multimodal LLM
GPTVQA	MLLM-DataEngine	MLLM-DataEngine: An Iterative Refinement Approach for MLLM
LrvInstruction	-	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
CIEM	-	CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
PointLLM	PointLLM	PointLLM: Empowering Large Language Models to Understand Point Clouds
VIGC	VIGC	VIGC: Visual Instruction Generation and Correction
M-HalDetec	-	Detecting and Preventing Hallucinations in Large Vision Language Models
StableLLaVA	StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
I4	Cheetor	EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS
AS-1B	ASM	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Multimodal_id_v1	LMEye(IPN)	LMEye: An Interactive Perception Network for Large Language Models
Lynx	Lynx	What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
MGVLID	ChatSpot	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
BuboGPT	BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
GRIT-20M	KOSMOS-2	KOSMOS-2: Grounding Multimodal Large Language Models to the World
SVIT	SVIT(MMLLM)	SVIT: Scaling up Visual Instruction Tuning
GPT4RoI	GPT4RoI	GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
PF-1M	Clever Flamingo	Visual Instruction Tuning with Polite Flamingo
Shikra-RD	Shikra	Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
LLaVAR	LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
OphGLM	OphGLM	OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue
LAMM	LAMM	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
MACAW-LLM	MACAW-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
InstructBLIP	InstructBLIP	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
MultiModal-GPT	MultiModal-GPT	MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
Valley-Instruct-73	VALLEY	VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY
Video-LLaMA	Video-LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
MULTIINSTRUCT	OFA(multiinstruct)	MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Video-ChatGPT	Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
MIMIC-IT	Otter	MIMIC-IT: Multi-Modal In-Context Instruction Tuning
M3IT	Ying-VLM	M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
GPT4Tools	GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
PMC-VQA	MedVInT-TE/TD	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
pandagpt_vid	PandaGPT	PandaGPT: One Model To Instruction-Follow Them All
MULTIS	ChatBridge	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
DetGPT	DetGPT	DetGPT: Detect What You Need via Reasoning
X-LLM	X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
VideoChat	VideoChat	Videochat: Chat-centric video understanding
cc_sbu_align	Mini-GPT4	Minigpt-4: Enhancing vision-language understanding with advanced large language models
LLaVA_instruct_150K	LLaVA	Visual Instruction Tuning

📂 Multi-Modal Instruction Data Constructed in This Paper

In this paper, we propose a vision-language instruction construction pipeline and generate a corresponding dataset. Specifically, the generated instruction data is a multi-round question answering about a given image. Here are some examples of the generated instruction data:

Examples

Image	Conversation
<img src="./assert/images/COCO_train2014_000000288586.jpg" width="200%"/>	User: Can you describe the scene in the image?<br>AI: Sure, the image captures a large grizzly bear wading through a moving river. The bear has a brown coat and appears to be quite large and furry. It's currently not holding anything in its mouth and it doesn't seem to have caught any fish from the river.<br>User: Are grizzly bears typically found in such environments?<br>AI: Yes, indeed. Grizzly bears are often found near rivers or streams. They are excellent swimmers and often go into the water in search of food. During salmon spawning seasons, they are known to frequent river areas to catch the fish.<br>User: Is the bear possibly looking for food?<br>AI: That's a strong possibility. While there are no fish present in the river in the image, grizzly bears are known to forage for various food sources in rivers, including fish and other aquatic organisms. However, without further context, it's hard to say definitively.<br>User: What other characteristics does a grizzly bear have?<br>AI: Grizzly bears, which are a subspecies of brown bear, are known for their size, strength, and excellent swimming ability. They have a distinct hump over their shoulders which is a muscle mass used for digging and clawing. Their fur color can vary from very light tan to dark brown, with the fur on their back often being darker than the rest of their body. Also, despite their large size, grizzlies are quite fast and can run at speeds of up to 30 miles per hour.
<img src="./assert/images/COCO_train2014_000000030722.jpg" width="200%"/>	User: What is the main animal in the picture?<br>AI: The main animal in the picture is a giraffe.<br>User: What is the giraffe doing?<br>AI: The giraffe is standing next to a tree and some rocks. It's not doing much, just hanging around.<br>User: Is the giraffe standing on the rocks?<br>AI: No, the giraffe is not standing on the rocks. It is standing next to them.<br>User: Can you describe the setting in more detail?<br>AI: The picture mainly focuses on the giraffe, a tree, and some rocks. The giraffe is standing right next to the tree. There are rocks scattered around them.<br>User: Are there any elephants in the picture?<br>AI: No, there are no elephants in this picture. Only a single giraffe is present.<br>User: Are there any unique features of the giraffe?<br>AI: As per the details provided in the image, the giraffe is standing and has a spotted pattern on its skin, which is a characteristic feature of giraffes.<br>User: Is the giraffe sitting on the rocks?<br>AI: No, the giraffe is not sitting on the rocks. It's standing by them.<br>User: Is the giraffe eating the tree leaves?<br>AI: There's no indication in the image that the giraffe is eating the leaves of the tree. It's simply standing beside it.<br>User: Where do giraffes usually live?<br>AI: Giraffes are native to Africa and they usually inhabit savannas, grasslands, and open woodlands. However, in this particular image, the exact location isn't specified.
<img src="./assert/images/COCO_train2014_000000247884_0.jpg" width="200%"/>	User: Is there a car in the picture?<br>AI: Yes, there is a car in the picture and it appears to be parked.<br>User: What can you tell me about the car?<br>AI: Based on the image, the car is parked, which suggests that it is not currently in use. However, I can't provide specifics about its make, model, or color without more visual information.<br>User: Do you think the car is related to the boy on the skateboard?<br>AI: It's hard to definitively state a connection between the car and the boy on the skateboard based on the image alone. The car may belong to someone who is watching the boy, or it may just be parked in the vicinity where the boy is skateboarding.<br>User: Is the car moving?<br>AI: No, the car isn't moving. It's parked.<br>User: Could the car potentially pose a danger to the skateboarder?<br>AI: Since the car is parked, it's unlikely to pose a direct threat to the skateboarder. However, any vehicle, parked or moving, in the vicinity of skateboarders could potentially become a hazard, especially if the skateboarder loses control. This emphasizes the importance of skateboarding in safe, designated areas.<br>User: Is the car possibly there for a skateboarding event?<br>AI: Based on the information available, I can't definitely say if the car is there for a skateboarding event. It could belong to a spectator or participant, or it could be unrelated to the skateboarding activity. More context would be needed to make a clear conclusion.

There are three different types of instruction data. The data statistics and download links are as follows.

Download Links

Data Type	Baidu CLoud	Google Drive	Huggingface
COCO_2014_Images	url	url	url
Global	url	url	url
Negative	url	url	url
Region	url	url	url
Region_Images	url	url	url

Data Format

{
    "image_source": "",
    "construction_time": "",
    "annotations": [
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      },
      
      {
        "img_ids": "",
        "instruction_type": "",
        "conversations": []
      }
    ]
}

📎 Citation

If you found this repository useful, please consider citing:

@article{li2023visionlanguage,
      title={Vision-Language Instruction Tuning: A Review and Analysis}, 
      author={Chen Li and Yixiao Ge and Dian Li and Ying Shan},
      year={2023},
      eprint={2311.08172},
      archivePrefix={arXiv},
      primaryClass={cs.MM}
}

👍🏻 Acknowledgement

We would like to thank LLaVA, LAVIS and OpenFlamingo for their well-architcated multi-modal LLMs. Thanks to SEED-Bench for being an open source and convenient benchmark for evaluating MLLMs.