Awesome

Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles:

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! :star2:

🔥🔥🔥 VITA: Towards Open-Source Interactive Omni Multimodal LLM

<font size=7><div align='center' > [📖 VITA-1.5 Paper (Comming Soon)] [🌟 GitHub] [🤗 Hugging Face] [🍎 VITA-1.0] [💬 WeChat (微信)]</div></font>

<font size=7><div align='center' > We are excited to introduce the VITA-1.5, a more powerful and more real-time version. ✨ </div></font>

<font size=7><div align='center' >All codes of VITA-1.5 have been released! :star2: </div></font>

🔥🔥🔥 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

<font size=7><div align='center' > [🍎 Project Page] [📖 arXiv Paper] </div></font>

<font size=7><div align='center' > Jointly introduced by MME, MMBench, and LLaVA teams. ✨ </div></font>

🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard

We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟

It includes short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from <b>11 seconds to 1 hour</b>. All data are newly collected and annotated by humans, not from any existing video dataset. ✨

🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | :black_nib: Citation

A representative evaluation benchmark for MLLMs. :sparkles:

🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub

This is the first work to correct hallucination in multimodal large language models. :sparkles:

🔥🔥🔥 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Project Page | Paper | GitHub

A speech-to-speech dialogue model with both low-latency and high intelligence while the training process is based on a frozen LLM. :sparkles:

<font size=5><center><b> Table of Contents </b> </center></font>

Awesome Papers
Awesome Datasets

Awesome Papers

Multimodal Instruction Tuning

Title	Venue	Date	Code	Demo
<br> DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding <br>	arXiv	2024-12-13	Github	-
Apollo: An Exploration of Video Understanding in Large Multimodal Models	arXiv	2024-12-13	-	-
<br> InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions <br>	arXiv	2024-12-12	Github	Local Demo
StreamChat: Chatting with Streaming Video	arXiv	2024-12-11	Coming soon	-
CompCap: Improving Multimodal Large Language Models with Composite Captions	arXiv	2024-12-06	-	-
<br> LinVT: Empower Your Image-level Large Language Model to Understand Videos <br>	arXiv	2024-12-06	Github	-
<br> Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling <br>	arXiv	2024-12-06	Github	Demo
<br> NVILA: Efficient Frontier Visual Language Models <br>	arXiv	2024-12-05	Github	Demo
<br> T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs <br>	arXiv	2024-11-29	Github	-
<br> TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability <br>	arXiv	2024-11-27	Github	-
<br> ChatRex: Taming Multimodal LLM for Joint Perception and Understanding <br>	arXiv	2024-11-27	Github	Local Demo
<br> LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding <br>	arXiv	2024-10-22	Github	Demo
<br> Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate <br>	arXiv	2024-10-09	Github	-
<br> AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark <br>	arXiv	2024-10-04	Github	Local Demo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models	arXiv	2024-09-25	Huggingface	Demo
<br> Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution <br>	arXiv	2024-09-18	Github	Demo
<br> LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture <br>	arXiv	2024-09-04	Github	-
<br> EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders <br>	arXiv	2024-08-28	Github	Demo
<br> LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation <br>	arXiv	2024-08-28	Github	-
<br> mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models <br>	arXiv	2024-08-09	Github	-
<br> VITA: Towards Open-Source Interactive Omni Multimodal LLM <br>	arXiv	2024-08-09	Github	-
<br> LLaVA-OneVision: Easy Visual Task Transfer <br>	arXiv	2024-08-06	Github	Demo
<br> MiniCPM-V: A GPT-4V Level MLLM on Your Phone <br>	arXiv	2024-08-03	Github	Demo
VILA^2: VILA Augmented VILA	arXiv	2024-07-24	-	-
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	arXiv	2024-07-22	-	-
EVLM: An Efficient Vision-Language Model for Visual Understanding	arXiv	2024-07-19	-	-
<br> IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model <br>	arXiv	2024-07-10	Github	-
<br> InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output <br>	arXiv	2024-07-03	Github	Demo
<br> OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding <br>	arXiv	2024-06-27	Github	Local Demo
<br> DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming <br>	AAAI	2024-06-27	Github	-
<br> Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs <br>	arXiv	2024-06-24	Github	Local Demo
<br> Long Context Transfer from Language to Vision <br>	arXiv	2024-06-24	Github	Local Demo
<br> video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models <br>	ICML	2024-06-22	Github	-
<br> TroL: Traversal of Layers for Large Language and Vision Models <br>	EMNLP	2024-06-18	Github	Local Demo
<br> Unveiling Encoder-Free Vision-Language Models <br>	arXiv	2024-06-17	Github	Local Demo
<br> VideoLLM-online: Online Video Large Language Model for Streaming Video <br>	CVPR	2024-06-17	Github	Local Demo
<br> RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics <br>	CoRL	2024-06-15	Github	Demo
<br> Comparison Visual Instruction Tuning <br>	arXiv	2024-06-13	Github	Local Demo
<br> Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models <br>	arXiv	2024-06-12	Github	-
<br> VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs <br>	arXiv	2024-06-11	Github	Local Demo
<br> Parrot: Multilingual Visual Instruction Tuning <br>	arXiv	2024-06-04	Github	-
<br> Ovis: Structural Embedding Alignment for Multimodal Large Language Model <br>	arXiv	2024-05-31	Github	-
<br> Matryoshka Query Transformer for Large Vision-Language Models <br>	arXiv	2024-05-29	Github	Demo
<br> ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models <br>	arXiv	2024-05-24	Github	-
<br> Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models <br>	arXiv	2024-05-24	Github	Demo
<br> Libra: Building Decoupled Vision System on Large Language Models <br>	ICML	2024-05-16	Github	Local Demo
<br> CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts <br>	arXiv	2024-05-09	Github	Local Demo
<br> How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites <br>	arXiv	2024-04-25	Github	Demo
<br> Graphic Design with Large Multimodal Model <br>	arXiv	2024-04-22	Github	-
BRAVE: Broadening the visual encoding of vision-language models	ECCV	2024-04-10	-	-
<br> InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD <br>	arXiv	2024-04-09	Github	Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	arXiv	2024-04-08	-	-
<br> MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding <br>	CVPR	2024-04-08	Github	-
<br> VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing <br>	NeurIPS	2024-04-04	Github	Local Demo
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model	ACM TKDD	2024-03-28	-	-
<br> LITA: Language Instructed Temporal-Localization Assistant	arXiv	2024-03-27	Github	Local Demo
<br> Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models <br>	arXiv	2024-03-27	Github	Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	arXiv	2024-03-14	-	-
<br> MoAI: Mixture of All Intelligence for Large Language and Vision Models <br>	arXiv	2024-03-12	Github	Local Demo
<br> DeepSeek-VL: Towards Real-World Vision-Language Understanding <br>	arXiv	2024-03-08	Github	Demo
<br> TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document <br>	arXiv	2024-03-07	Github	Demo
<br> The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	arXiv	2024-02-29	Github	-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	CVPR	2024-02-26	Coming soon	Coming soon
<br> AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling <br>	arXiv	2024-02-19	Github	-
<br> Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning <br>	arXiv	2024-02-18	Github	-
<br> ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model <br>	arXiv	2024-02-18	Github	Demo
<br> CoLLaVO: Crayon Large Language and Vision mOdel <br>	arXiv	2024-02-17	Github	-
<br> Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models <br>	ICML	2024-02-12	Github	-
<br> CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations <br>	arXiv	2024-02-06	Github	-
<br> MobileVLM V2: Faster and Stronger Baseline for Vision Language Model <br>	arXiv	2024-02-06	Github	-
<br> GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning <br>	NeurIPS	2024-02-03	Github	-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study	arXiv	2024-01-31	Coming soon	-
<br> LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Blog	2024-01-30	Github	Demo
<br> MoE-LLaVA: Mixture of Experts for Large Vision-Language Models <br>	arXiv	2024-01-29	Github	Demo
<br> InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model <br>	arXiv	2024-01-29	Github	Demo
<br> Yi-VL <br>	-	2024-01-23	Github	Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	arXiv	2024-01-22	-	-
<br> ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning <br>	ACL	2024-01-04	Github	Local Demo
<br> MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices <br>	arXiv	2023-12-28	Github	-
<br> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks <br>	CVPR	2023-12-21	Github	Demo
<br> Osprey: Pixel Understanding with Visual Instruction Tuning <br>	CVPR	2023-12-15	Github	Demo
<br> CogAgent: A Visual Language Model for GUI Agents <br>	arXiv	2023-12-14	Github	Coming soon
Pixel Aligned Language Models	arXiv	2023-12-14	Coming soon	-
<br> VILA: On Pre-training for Visual Language Models <br>	CVPR	2023-12-13	Github	Local Demo
See, Say, and Segment: Teaching LMMs to Overcome False Premises	arXiv	2023-12-13	Coming soon	-
<br> Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models <br>	ECCV	2023-12-11	Github	Demo
<br> Honeybee: Locality-enhanced Projector for Multimodal LLM <br>	CVPR	2023-12-11	Github	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
<br> OneLLM: One Framework to Align All Modalities with Language <br>	arXiv	2023-12-06	Github	Demo
<br> Lenna: Language Enhanced Reasoning Detection Assistant <br>	arXiv	2023-12-05	Github	-
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding	arXiv	2023-12-04	-	-
<br> TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding <br>	arXiv	2023-12-04	Github	Local Demo
<br> Making Large Multimodal Models Understand Arbitrary Visual Prompts <br>	CVPR	2023-12-01	Github	Demo
<br> Dolphins: Multimodal Language Model for Driving <br>	arXiv	2023-12-01	Github	-
<br> LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning <br>	arXiv	2023-11-30	Github	Coming soon
<br> VTimeLLM: Empower LLM to Grasp Video Moments <br>	arXiv	2023-11-30	Github	Local Demo
<br> mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model <br>	arXiv	2023-11-30	Github	-
<br> LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models <br>	arXiv	2023-11-28	Github	Coming soon
<br> LLMGA: Multimodal Large Language Model based Generation Assistant <br>	arXiv	2023-11-27	Github	Demo
<br> ChartLlama: A Multimodal LLM for Chart Understanding and Generation <br>	arXiv	2023-11-27	Github	-
<br> ShareGPT4V: Improving Large Multi-Modal Models with Better Captions <br>	arXiv	2023-11-21	Github	Demo
<br> LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge <br>	arXiv	2023-11-20	Github	-
<br> An Embodied Generalist Agent in 3D World <br>	arXiv	2023-11-18	Github	Demo
<br> Video-LLaVA: Learning United Visual Representation by Alignment Before Projection <br>	arXiv	2023-11-16	Github	Demo
<br> Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding <br>	CVPR	2023-11-14	Github	-
<br> To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning <br>	arXiv	2023-11-13	Github	-
<br> SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models <br>	arXiv	2023-11-13	Github	Demo
<br> Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models <br>	CVPR	2023-11-11	Github	Demo
<br> LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents <br>	arXiv	2023-11-09	Github	Demo
<br> NExT-Chat: An LMM for Chat, Detection and Segmentation <br>	arXiv	2023-11-08	Github	Local Demo
<br> mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration <br>	arXiv	2023-11-07	Github	Demo
<br> OtterHD: A High-Resolution Multi-modality Model <br>	arXiv	2023-11-07	Github	-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	arXiv	2023-11-06	Coming soon	-
<br> GLaMM: Pixel Grounding Large Multimodal Model <br>	CVPR	2023-11-06	Github	Demo
<br> What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning <br>	arXiv	2023-11-02	Github	-
<br> MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning <br>	arXiv	2023-10-14	Github	Local Demo
<br> SALMONN: Towards Generic Hearing Abilities for Large Language Models <br>	ICLR	2023-10-20	Github	-
<br> Ferret: Refer and Ground Anything Anywhere at Any Granularity <br>	arXiv	2023-10-11	Github	-
<br> CogVLM: Visual Expert For Large Language Models <br>	arXiv	2023-10-09	Github	Demo
<br> Improved Baselines with Visual Instruction Tuning <br>	arXiv	2023-10-05	Github	Demo
<br> LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment <br>	ICLR	2023-10-03	Github	Demo
<br> Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	arXiv	2023-10-01	Github	-
<br> Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants <br>	arXiv	2023-10-01	Github	Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	arXiv	2023-09-27	-	-
<br> InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition <br>	arXiv	2023-09-26	Github	Local Demo
<br> DreamLLM: Synergistic Multimodal Comprehension and Creation <br>	ICLR	2023-09-20	Github	Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models	arXiv	2023-09-18	Coming soon	-
<br> TextBind: Multi-turn Interleaved Multimodal Instruction-following <br>	arXiv	2023-09-14	Github	Demo
<br> NExT-GPT: Any-to-Any Multimodal LLM <br>	arXiv	2023-09-11	Github	Demo
<br> Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics <br>	arXiv	2023-09-13	Github	-
<br> ImageBind-LLM: Multi-modality Instruction Tuning <br>	arXiv	2023-09-07	Github	Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	arXiv	2023-09-05	-	-
<br> PointLLM: Empowering Large Language Models to Understand Point Clouds <br>	arXiv	2023-08-31	Github	Demo
<br> ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>	arXiv	2023-08-31	Github	Local Demo
<br> MLLM-DataEngine: An Iterative Refinement Approach for MLLM <br>	arXiv	2023-08-25	Github	-
<br> Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models <br>	arXiv	2023-08-25	Github	Demo
<br> Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities <br>	arXiv	2023-08-24	Github	Demo
<br> Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages <br>	ICLR	2023-08-23	Github	Demo
<br> StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data <br>	arXiv	2023-08-20	Github	-
<br> BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions <br>	arXiv	2023-08-19	Github	Demo
<br> Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions <br>	arXiv	2023-08-08	Github	-
<br> The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World <br>	ICLR	2023-08-03	Github	Demo
<br> LISA: Reasoning Segmentation via Large Language Model <br>	arXiv	2023-08-01	Github	Demo
<br> MovieChat: From Dense Token to Sparse Memory for Long Video Understanding <br>	arXiv	2023-07-31	Github	Local Demo
<br> 3D-LLM: Injecting the 3D World into Large Language Models <br>	arXiv	2023-07-24	Github	-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning <br>	arXiv	2023-07-18	-	Demo
<br> BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs <br>	arXiv	2023-07-17	Github	Demo
<br> SVIT: Scaling up Visual Instruction Tuning <br>	arXiv	2023-07-09	Github	-
<br> GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest <br>	arXiv	2023-07-07	Github	Demo
<br> What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? <br>	arXiv	2023-07-05	Github	-
<br> mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding <br>	arXiv	2023-07-04	Github	Demo
<br> Visual Instruction Tuning with Polite Flamingo <br >	arXiv	2023-07-03	Github	Demo
<br> LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding <br>	arXiv	2023-06-29	Github	Demo
<br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>	arXiv	2023-06-27	Github	Demo
<br> MotionGPT: Human Motion as a Foreign Language <br>	arXiv	2023-06-26	Github	-
<br> Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration <br>	arXiv	2023-06-15	Github	Coming soon
<br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>	arXiv	2023-06-11	Github	Demo
<br> Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models <br>	arXiv	2023-06-08	Github	Demo
<br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>	arXiv	2023-06-08	Github	Demo
M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-	-
<br> Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding <br>	arXiv	2023-06-05	Github	Demo
<br> LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day <br>	arXiv	2023-06-01	Github	-
<br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>	arXiv	2023-05-30	Github	Demo
<br> PandaGPT: One Model To Instruction-Follow Them All <br>	arXiv	2023-05-25	Github	Demo
<br> ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst <br>	arXiv	2023-05-25	Github	-
<br> Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models <br>	arXiv	2023-05-24	Github	Local Demo
<br> DetGPT: Detect What You Need via Reasoning <br>	arXiv	2023-05-23	Github	Demo
<br> Pengi: An Audio Language Model for Audio Tasks <br>	NeurIPS	2023-05-19	Github	-
<br> VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks <br>	arXiv	2023-05-18	Github	-
<br> Listen, Think, and Understand <br>	arXiv	2023-05-18	Github	Demo
<br> VisualGLM-6B <br>	-	2023-05-17	Github	Local Demo
<br> PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering <br>	arXiv	2023-05-17	Github	-
<br> InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning <br>	arXiv	2023-05-11	Github	Local Demo
<br> VideoChat: Chat-Centric Video Understanding <br>	arXiv	2023-05-10	Github	Demo
<br> MultiModal-GPT: A Vision and Language Model for Dialogue with Humans <br>	arXiv	2023-05-08	Github	Demo
<br> X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages <br>	arXiv	2023-05-07	Github	-
<br> LMEye: An Interactive Perception Network for Large Language Models <br>	arXiv	2023-05-05	Github	Local Demo
<br> LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model <br>	arXiv	2023-04-28	Github	Demo
<br> mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality <br>	arXiv	2023-04-27	Github	Demo
<br> MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models <br>	arXiv	2023-04-20	Github	-
<br> Visual Instruction Tuning <br>	NeurIPS	2023-04-17	GitHub	Demo
<br> LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention <br>	ICLR	2023-03-28	Github	Demo
<br> MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning <br>	ACL	2022-12-21	Github	-

Multimodal Hallucination

Title	Venue	Date	Code	Demo
<br> Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models <br>	arXiv	2024-10-04	Github	-
<br> Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations <br>	arXiv	2024-10-03	Github	-
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs	arXiv	2024-09-20	Link	-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation	arXiv	2024-08-01	-	-
<br> Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs <br>	ECCV	2024-07-31	Github	-
<br> Evaluating and Analyzing Relationship Hallucinations in LVLMs <br>	ICML	2024-06-24	Github	-
<br> AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention <br>	arXiv	2024-06-18	Github	-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models	arXiv	2024-06-04	Coming soon	-
Mitigating Object Hallucination via Data Augmented Contrastive Tuning	arXiv	2024-05-28	Coming soon	-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap	arXiv	2024-05-24	Coming soon	-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback	arXiv	2024-04-22	-	-
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding	arXiv	2024-03-27	-	-
<br> What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models <br>	arXiv	2024-03-20	Github	-
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization	arXiv	2024-03-13	-	-
<br> Debiasing Multimodal Large Language Models <br>	arXiv	2024-03-08	Github	-
<br> HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding <br>	arXiv	2024-03-01	Github	-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding	arXiv	2024-02-28	-	-
<br> Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective <br>	arXiv	2024-02-22	Github	-
<br> Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models <br>	arXiv	2024-02-18	Github	-
<br> The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs <br>	arXiv	2024-02-06	Github	-
<br> Unified Hallucination Detection for Multimodal Large Language Models <br>	arXiv	2024-02-05	Github	-
A Survey on Hallucination in Large Vision-Language Models	arXiv	2024-02-01	-	-
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models	arXiv	2024-01-18	-	-
<br> Hallucination Augmented Contrastive Learning for Multimodal Large Language Model <br>	arXiv	2023-12-12	Github	-
<br> MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations <br>	arXiv	2023-12-06	Github	-
<br> Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites <br>	arXiv	2023-12-04	Github	-
<br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>	arXiv	2023-12-01	Github	Demo
<br> OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation <br>	CVPR	2023-11-29	Github	-
<br> Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br>	CVPR	2023-11-28	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11-28	Github	Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision	arXiv	2023-11-27	-	-
<br> HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data <br>	arXiv	2023-11-22	Github	-
<br> An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation <br>	arXiv	2023-11-13	Github	-
<br> FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models <br>	arXiv	2023-11-02	Github	-
<br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>	arXiv	2023-10-24	Github	Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models	arXiv	2023-10-09	-	-
<br> HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption <br>	arXiv	2023-10-03	Github	-
<br> Analyzing and Mitigating Object Hallucination in Large Vision-Language Models <br>	ICLR	2023-10-01	Github	-
<br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>	arXiv	2023-09-25	Github	Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models	arXiv	2023-09-07	-	-
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning	arXiv	2023-09-05	-	-
<br> Evaluation and Analysis of Hallucination in Large Vision-Language Models <br>	arXiv	2023-08-29	Github	-
<br> VIGC: Visual Instruction Generation and Correction <br>	arXiv	2023-08-24	Github	Demo
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08-11	-	-
<br> Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning <br>	ICLR	2023-06-26	Github	Demo
<br> Evaluating Object Hallucination in Large Vision-Language Models <br>	EMNLP	2023-05-17	Github	-

Multimodal In-Context Learning

Title	Venue	Date	Code	Demo
Visual In-Context Learning for Large Vision-Language Models	arXiv	2024-02-18	-	-
<br> RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model <br>	RSS	2024-02-16	Github	-
<br> Can MLLMs Perform Text-to-Image In-Context Learning? <br>	arXiv	2024-02-02	Github	-
<br> Generative Multimodal Models are In-Context Learners <br>	CVPR	2023-12-20	Github	Demo
Hijacking Context in Large Multi-modal Models	arXiv	2023-12-07	-	-
Towards More Unified In-context Visual Understanding	arXiv	2023-12-05	-	-
<br> MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning <br>	arXiv	2023-09-14	Github	Demo
<br> Link-Context Learning for Multimodal LLMs <br>	arXiv	2023-08-15	Github	Demo
<br> OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models <br>	arXiv	2023-08-02	Github	Demo
<br> Med-Flamingo: a Multimodal Medical Few-shot Learner <br>	arXiv	2023-07-27	Github	Local Demo
<br> Generative Pretraining in Multimodality <br>	ICLR	2023-07-11	Github	Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
<br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>	arXiv	2023-06-08	Github	Demo
<br> Exploring Diverse In-Context Configurations for Image Captioning <br>	NeurIPS	2023-05-24	Github	-
<br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>	arXiv	2023-04-19	Github	Demo
<br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>	arXiv	2023-03-30	Github	Demo
<br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>	arXiv	2023-03-20	Github	Demo
<br> ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction <br>	ICCV	2023-03-09	Github	-
<br> Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering <br>	CVPR	2023-03-03	Github	-
<br> Visual Programming: Compositional visual reasoning without training <br>	CVPR	2022-11-18	Github	Local Demo
<br> An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA <br>	AAAI	2022-06-28	Github	-
<br> Flamingo: a Visual Language Model for Few-Shot Learning <br>	NeurIPS	2022-04-29	Github	Demo
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021-06-25	-	-

Multimodal Chain-of-Thought

Title	Venue	Date	Code	Demo
<br> Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models <br>	arXiv	2024-11-21	Github	-
<br> Cantor: Inspiring Multimodal Chain-of-Thought of MLLM <br>	arXiv	2024-04-24	Github	Local Demo
<br> Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models <br>	arXiv	2024-03-25	Github	Local Demo
<br> Compositional Chain-of-Thought Prompting for Large Multimodal Models <br>	CVPR	2023-11-27	Github	-
<br> DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models <br>	NeurIPS	2023-10-25	Github	-
<br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>	arXiv	2023-06-27	Github	Demo
<br> Explainable Multimodal Emotion Reasoning <br>	arXiv	2023-06-27	Github	-
<br> EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought <br>	arXiv	2023-05-24	Github	-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	arXiv	2023-05-23	-	-
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering	arXiv	2023-05-05	-	-
<br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>	arXiv	2023-05-04	Github	Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings	arXiv	2023-05-03	Coming soon	-
<br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>	arXiv	2023-04-19	Github	Demo
Chain of Thought Prompt Tuning in Vision Language Models	arXiv	2023-04-16	Coming soon	-
<br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>	arXiv	2023-03-20	Github	Demo
<br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>	arXiv	2023-03-08	Github	Demo
<br> Multimodal Chain-of-Thought Reasoning in Language Models <br>	arXiv	2023-02-02	Github	-
<br> Visual Programming: Compositional visual reasoning without training <br>	CVPR	2022-11-18	Github	Local Demo
<br> Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering <br>	NeurIPS	2022-09-20	Github	-

LLM-Aided Visual Reasoning

Title	Venue	Date	Code	Demo
<br> Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models <br>	arXiv	2024-03-27	Github	-
<br> V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs <br>	arXiv	2023-12-21	Github	Local Demo
<br> LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing <br>	arXiv	2023-11-01	Github	Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)	arXiv	2023-10-30	-	-
<br> ControlLLM: Augment Language Models with Tools by Searching on Graphs <br>	arXiv	2023-10-26	Github	-
<br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>	arXiv	2023-10-24	Github	Demo
<br> MindAgent: Emergent Gaming Interaction <br>	arXiv	2023-09-18	Github	-
<br> Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language <br>	arXiv	2023-06-28	Github	Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	arXiv	2023-06-15	-	-
<br> AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn <br>	arXiv	2023-06-14	Github	-
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
<br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>	arXiv	2023-05-30	Github	Demo
Mindstorms in Natural Language-Based Societies of Mind	arXiv	2023-05-26	-	-
<br> LayoutGPT: Compositional Visual Planning and Generation with Large Language Models <br>	arXiv	2023-05-24	Github	-
<br> IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models <br>	arXiv	2023-05-24	Github	Local Demo
<br> Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation <br>	arXiv	2023-05-10	Github	-
<br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>	arXiv	2023-05-04	Github	Demo
<br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>	arXiv	2023-04-19	Github	Demo
<br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>	arXiv	2023-03-30	Github	Demo
<br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>	arXiv	2023-03-20	Github	Demo
<br> ViperGPT: Visual Inference via Python Execution for Reasoning <br>	arXiv	2023-03-14	Github	Local Demo
<br> ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions <br>	arXiv	2023-03-12	Github	Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	-	-
<br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>	arXiv	2023-03-08	Github	Demo
<br> Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners <br>	CVPR	2023-03-03	Github	-
<br> From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models <br>	CVPR	2022-12-21	Github	Demo
<br> SuS-X: Training-Free Name-Only Transfer of Vision-Language Models <br>	arXiv	2022-11-28	Github	-
<br> PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning <br>	CVPR	2022-11-21	Github	-
<br> Visual Programming: Compositional visual reasoning without training <br>	CVPR	2022-11-18	Github	Local Demo
<br> Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language <br>	arXiv	2022-04-01	Github	-

Foundation Models

Title	Venue	Date	Code	Demo
<br> Emu3: Next-Token Prediction is All You Need <br>	arXiv	2024-09-27	Github	Local Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models	Meta	2024-09-25	-	Demo
Pixtral-12B	Mistral	2024-09-17	-	-
<br> xGen-MM (BLIP-3): A Family of Open Large Multimodal Models <br>	arXiv	2024-08-16	Github	-
The Llama 3 Herd of Models	arXiv	2024-07-31	-	-
Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	2024-05-16	-	-
Hello GPT-4o	OpenAI	2024-05-13	-	-
The Claude 3 Model Family: Opus, Sonnet, Haiku	Anthropic	2024-03-04	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	Google	2024-02-15	-	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
Fuyu-8B: A Multimodal Architecture for AI Agents	blog	2023-10-17	Huggingface	Demo
<br> Unified Model for Image, Video, Audio and Language Tasks <br>	arXiv	2023-07-30	Github	Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	arXiv	2023-10-13	-	-
GPT-4V(ision) System Card	OpenAI	2023-09-25	-	-
<br> Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization <br>	arXiv	2023-09-09	Github	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	arXiv	2023-09-18	-	-
<br> Bootstrapping Vision-Language Learning with Decoupled Language Pre-training <br>	NeurIPS	2023-07-13	Github	-
<br> Generative Pretraining in Multimodality <br>	arXiv	2023-07-11	Github	Demo
<br> Kosmos-2: Grounding Multimodal Large Language Models to the World <br>	arXiv	2023-06-26	Github	Demo
<br> Transfer Visual Prompt Generator across LLMs <br>	arXiv	2023-05-02	Github	Demo
GPT-4 Technical Report	arXiv	2023-03-15	-	-
PaLM-E: An Embodied Multimodal Language Model	arXiv	2023-03-06	-	Demo
<br> Prismer: A Vision-Language Model with An Ensemble of Experts <br>	arXiv	2023-03-04	Github	Demo
<br> Language Is Not All You Need: Aligning Perception with Language Models <br>	arXiv	2023-02-27	Github	-
<br> BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models <br>	arXiv	2023-01-30	Github	Demo
<br> VIMA: General Robot Manipulation with Multimodal Prompts <br>	ICML	2022-10-06	Github	Local Demo
<br> MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge <br>	NeurIPS	2022-06-17	Github	-
<br> Write and Paint: Generative Vision-Language Models are Unified Modal Learners <br>	ICLR	2022-06-15	Github	-
<br> Language Models are General-Purpose Interfaces <br>	arXiv	2022-06-13	Github	-

Evaluation

Title	Venue	Date	Page
<br> Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces <br>	arXiv	2024-12-18	Github
<br> MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective <br>	arXiv	2024-11-21	Github
<br> OmniBench: Towards The Future of Universal Omni-Language Models <br>	arXiv	2024-09-23	Github
<br> MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? <br>	arXiv	2024-08-23	Github
<br> UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models <br>	TPAMI	2023-10-17	Github
<br> MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation <br>	arXiv	2024-06-29	Github
<br> Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs <br>	arXiv	2024-06-28	Github
<br> CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs <br>	arXiv	2024-06-26	Github
<br> ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation <br>	arXiv	2024-04-15	Github
<br> Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis <br>	arXiv	2024-05-31	Github
<br> Benchmarking Large Multimodal Models against Common Corruptions <br>	NAACL	2024-01-22	Github
<br> Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs <br>	arXiv	2024-01-11	Github
<br> A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise <br>	arXiv	2023-12-19	Github
<br> BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models <br>	arXiv	2023-12-05	Github
<br> How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs <br>	arXiv	2023-11-27	Github
<br> Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs <br>	arXiv	2023-11-24	Github
<br> MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V <br>	arXiv	2023-11-23	Github
VLM-Eval: A General Evaluation on Video Large Language Models	arXiv	2023-11-20	Coming soon
<br> Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges <br>	arXiv	2023-11-06	Github
<br> On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving <br>	arXiv	2023-11-09	Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead	arXiv	2023-11-05	-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging	arXiv	2023-10-31	-
<br> An Early Evaluation of GPT-4V(ision) <br>	arXiv	2023-10-25	Github
<br> Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation <br>	arXiv	2023-10-25	Github
<br> HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models <br>	CVPR	2023-10-23	Github
<br> MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models <br>	ICLR	2023-10-03	Github
<br> Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations <br>	arXiv	2023-10-02	Github
<br> Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning <br>	arXiv	2023-10-01	Github
<br> Can We Edit Multimodal Large Language Models? <br>	arXiv	2023-10-12	Github
<br> REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets <br>	arXiv	2023-10-10	Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)	arXiv	2023-09-29	-
<br> TouchStone: Evaluating Vision-Language Models by Language Models <br>	arXiv	2023-08-31	Github
<br> ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>	arXiv	2023-08-31	Github
<br> SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs <br>	arXiv	2023-08-07	Github
<br> Tiny LVLM-eHub: Early Multimodal Experiments with Bard <br>	arXiv	2023-08-07	Github
<br> MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities <br>	arXiv	2023-08-04	Github
<br> SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension <br>	CVPR	2023-07-30	Github
<br> MMBench: Is Your Multi-modal Model an All-around Player? <br>	arXiv	2023-07-12	Github
<br> MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models <br>	arXiv	2023-06-23	Github
<br> LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models <br>	arXiv	2023-06-15	Github
<br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>	arXiv	2023-06-11	Github
<br> M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models <br>	arXiv	2023-06-08	Github
<br> On The Hidden Mystery of OCR in Large Multimodal Models <br>	arXiv	2023-05-13	Github

Multimodal RLHF

Title	Venue	Date	Code	Demo
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	arXiv	2024-10-09	-	-
<br> Silkie: Preference Distillation for Large Visual Language Models <br>	arXiv	2023-12-17	Github	-
<br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>	arXiv	2023-12-01	Github	Demo
<br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>	arXiv	2023-09-25	Github	Demo
<br> RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data <br>	arXiv	2024-08-22	Github	-

Others

Title	Venue	Date	Code	Demo
<br> TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models <br>	arXiv	2024-11-17	Github	-
<br> Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models <br>	arXiv	2024-02-03	Github	-
<br> VCoder: Versatile Vision Encoders for Multimodal Large Language Models <br>	arXiv	2023-12-21	Github	Local Demo
<br> Prompt Highlighter: Interactive Control for Multi-Modal LLMs <br>	arXiv	2023-12-07	Github	-
<br> Planting a SEED of Vision in Large Language Model <br>	arXiv	2023-07-16	Github
<br> Can Large Pre-trained Models Help Vision Models on Perception Tasks? <br>	arXiv	2023-06-01	Github	-
<br> Contextual Object Detection with Multimodal Large Language Models <br>	arXiv	2023-05-29	Github	Demo
<br> Generating Images with Multimodal Language Models <br>	arXiv	2023-05-26	Github	-
<br> On Evaluating Adversarial Robustness of Large Vision-Language Models <br>	arXiv	2023-05-26	Github	-
<br> Grounding Language Models to Images for Multimodal Inputs and Outputs <br>	ICML	2023-01-31	Github	Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name	Paper	Type	Modalities
ShareGPT4Video	ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	Caption	Video-Text
COYO-700M	COYO-700M: Image-Text Pair Dataset	Caption	Image-Text
ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Caption	Image-Text
AS-1B	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Hybrid	Image-Text
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Caption	Video-Text
MS-COCO	Microsoft COCO: Common Objects in Context	Caption	Image-Text
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs	Caption	Image-Text
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning	Caption	Image-Text
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	Caption	Image-Text
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations	Caption	Image-Text
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models	Caption	Image-Text
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding	Caption	Image-Text
Wukong Captions	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark	Caption	Image-Text
GRIT	Kosmos-2: Grounding Multimodal Large Language Models to the World	Caption	Image-Text-Bounding-Box
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	Caption	Video-Text
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Caption	Video-Text
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Caption	Video-Text
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	Caption	Audio-Text
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline	ASR	Audio-Text
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale	ASR	Audio-Text
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ASR	Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Notes
E.T. Instruct 164K	E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	Link	An instruction-tuning dataset for time-sensitive video understanding
MSQA	Multi-modal Situated Reasoning in 3D Scenes	Link	A large scale dataset for multi-modal situated reasoning in 3D scenes
MM-Evol	MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct	Link	An instruction dataset with rich diversity
UNK-VQA	UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models	Link	A dataset designed to teach models to refrain from answering unanswerable questions
VEGA	VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models	Link	A dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4V	ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	Link	Vision and language caption and instruction dataset generated by GPT4V
IDK	Visually Dehallucinative Instruction Generation: Know What You Don't Know	Link	Dehallucinative visual instruction for "I Know" hallucination
CAP2QA	Visually Dehallucinative Instruction Generation	Link	Image-aligned visual instruction dataset
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	Link	A visual instruction dataset via self-instruction from GPT-4V
ComVint	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Link	A synthetic instruction dataset for complex visual reasoning
SparklesDialogue	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Link	A cheap and effective approach to collect visual instruction tuning data
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	-	A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Link	A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT	SVIT: Scaling up Visual Instruction Tuning	Link	A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Link	An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M	Visual Instruction Tuning with Polite Flamingo	Link	A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama	ChartLlama: A Multimodal LLM for Chart Understanding and Generation	Link	A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Link	A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT	MotionGPT: Human Motion as a Foreign Language	Link	A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Link	A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	100K high-quality video instruction dataset
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction tuning
M<sup>3</sup>IT	M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	Link	Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon	A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link	Tool-related instruction datasets
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon	Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT	DetGPT: Detect What You Need via Reasoning	Link	Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon	Large-scale medical visual question-answering dataset
VideoChat	VideoChat: Chat-Centric Video Understanding	Link	Video-centric multimodal instruction dataset
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link	Chinese multimodal instruction dataset
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link	A multi-modal instruction-tuning dataset
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link	Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K	Visual Instruction Tuning	Link	Multimodal instruction-following data generated by GPT
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link	The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name	Paper	Link	Notes
MIC	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Link	A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name	Paper	Link	Notes
EMER	Explainable Multimodal Emotion Reasoning	Coming soon	A benchmark dataset for explainable emotion reasoning task
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Coming soon	Large-scale embodied planning dataset
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	Coming soon	An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Link	Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name	Paper	Link	Notes
VLFeedback	Silkie: Preference Distillation for Large Visual Language Models	Link	A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name	Paper	Link	Notes
M<sup>3</sup>CoT	M<sup>3</sup>CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	Link	A multi-domain, multi-step benchmark for multimodal CoT
MMGenBench	MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective	Link	A benchmark that gauges the performance of constructing image-generation prompt given an image
MiCEval	MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps	Link	A multimodal CoT benchmark to evaluate MLLMs' reasoning capabilities
LiveXiv	LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content	Link	A live benchmark based on arXiv papers
TemporalBench	TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Link	A benchmark for evaluation of fine-grained temporal understanding
OmniBench	OmniBench: Towards The Future of Universal Omni-Language Models	Link	A benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorld	MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	Link	A challenging benchmark that involves real-life scenarios
VELOCITI	VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?	Link	A video benhcmark that evaluates on perception and binding capabilities
MMR	Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions	Link	A benchmark for measuring MLLMs' understanding capability and robustness to leading questions
CharXiv	CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	Link	Chart understanding benchmark curated by human experts
Video-MME	Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	Link	A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL Bench	VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning	Link	A benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompass	TempCompass: Do Video LLMs Really Understand Videos?	Link	A benchmark to evaluate the temporal perception ability of Video LLMs
GVLQA	GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning	Link	A benchmark for evaluation of graph reasoning capabilities
CoBSAT	Can MLLMs Perform Text-to-Image In-Context Learning?	Link	A benchmark for text-to-image ICL
VQAv2-IDK	Visually Dehallucinative Instruction Generation: Know What You Don't Know	Link	A benchmark for assessing "I Know" visual hallucination
Math-Vision	Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset	Link	A diverse mathematical reasoning benchmark
SciMMIR	SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval	Link
CMMMU	CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark	Link	A Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBench	Benchmarking Large Multimodal Models against Common Corruptions	Link	A benchmark for examining self-consistency under common corruptions
MMVP	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Link	A benchmark for assessing visual capabilities
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	Link	A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-Bench	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A benchmark for visual prompts
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A 3D-centric benchmark
Video-Bench	Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	Link	A benchmark for video-MLLM evaluation
Charting-New-Territories	Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	Link	A benchmark for evaluating geographic and geospatial capabilities
MLLM-Bench	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	Link	GPT-4V evaluation with per-sample criteria
BenchLMM	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Link	A benchmark for assessment of the robustness against different image styles
MMC-Benchmark	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning	Link	A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Link	A comprehensive multimodal benchmark for video understanding
Bingo	Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	Link	A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench	OtterHD: A High-Resolution Multi-modality Model	Link	A benchmark designed to probe models' ability of fine-grained perception
HallusionBench	HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	Link	An image-context reasoning benchmark for evaluation of hallucination
PCA-EVAL	Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond	Link	A benchmark for evaluating multi-domain embodied decision-making.
MMHal-Bench	Aligning Large Multimodal Models with Factually Augmented RLHF	Link	A benchmark for hallucination evaluation
MathVista	MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	Link	A benchmark that challenges both visual and math reasoning capabilities
SparklesEval	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI	Link-Context Learning for Multimodal LLMs	Link	A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
I4	Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions	Link	A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA	SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	Link	A large-scale chart-visual question-answering dataset
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	Link	An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Link	A benchmark for evaluation of generative comprehension in MLLMs
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	Link	A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx	What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Link	A comprehensive evaluation benchmark including both image and video tasks
GAVIE	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	A benchmark to evaluate the hallucination and instruction following ability
MME	MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Link	A comprehensive MLLM Evaluation benchmark
LVLM-eHub	LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	Link	An evaluation platform for MLLMs
LAMM-Benchmark	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam	M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	Link	A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Link	Dataset for evaluation on multiple capabilities

Others

Name	Paper	Link	Notes
IMAD	IMAD: IMage-Augmented multi-modal Dialogue	Link	Multimodal dialogue dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek	Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?	Link	A VQA dataset that focuses on asking information-seeking questions
OVEN	Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities	Link	A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild