Home

Awesome

Awesome-Multimodal-Large-Language-Models

Our MLLM works

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles: </div>

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! :star2: </div>


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

<p align="center"> <img src="./images/freeze-omni.png" width="80%" height="80%"> </p>

<font size=7><div align='center' > [๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐ŸŒŸ GitHub] </div></font>

The VITA team proposes Freeze-Omni, a speech-to-speech dialogue model with both low-latency and high intelligence while the training process is based on a frozen LLM. ๐ŸŒŸ

Freeze-Omni exhibits the characteristic of being smart as it is constructed upon a frozen text-modality LLM. This enables it to keep the original intelligence of the LLM backbone, without being affected by the forgetting problem induced by the fine-tuning process for integration of the speech modality. โœจ


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ VITA: Towards Open-Source Interactive Omni Multimodal LLM

<p align="center"> <img src="./images/vita.png" width="70%" height="70%"> </p>

<font size=7><div align='center' > [๐ŸŽ Project Page] [๐Ÿ“– arXiv Paper] [๐ŸŒŸ GitHub] [๐Ÿค— Hugging Face] [๐Ÿ’ฌ WeChat (ๅพฎไฟก)] </div></font>


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard

We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! ๐ŸŒŸ

It includes short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from <b>11 seconds to 1 hour</b>. All data are newly collected and annotated by humans, not from any existing video dataset. โœจ


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | :black_nib: Citation

A representative evaluation benchmark for MLLMs. :sparkles:


๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub

This is the first work to correct hallucination in multimodal large language models. :sparkles: </div>


<font size=5><center><b> Table of Contents </b> </center></font>


Awesome Papers

Multimodal Instruction Tuning

TitleVenueDateCodeDemo
Star <br> LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding <br>arXiv2024-10-22GithubDemo
Star <br> Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate <br>arXiv2024-10-09Github-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal ModelsarXiv2024-09-25HuggingfaceDemo
Star <br> Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution <br>arXiv2024-09-18GithubDemo
Star <br> LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture <br>arXiv2024-09-04Github-
Star <br> EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders <br>arXiv2024-08-28GithubDemo
Star <br> mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models <br>arXiv2024-08-09Github-
Star <br> VITA: Towards Open-Source Interactive Omni Multimodal LLM <br>arXiv2024-08-09Github-
Star <br> LLaVA-OneVision: Easy Visual Task Transfer <br>arXiv2024-08-06GithubDemo
Star <br> MiniCPM-V: A GPT-4V Level MLLM on Your Phone <br>arXiv2024-08-03GithubDemo
VILA^2: VILA Augmented VILAarXiv2024-07-24--
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsarXiv2024-07-22--
EVLM: An Efficient Vision-Language Model for Visual UnderstandingarXiv2024-07-19--
Star <br> IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model <br>arXiv2024-07-10Github-
Star <br> InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output <br>arXiv2024-07-03GithubDemo
Star <br> OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding <br>arXiv2024-06-27GithubLocal Demo
Star <br> Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs <br>arXiv2024-06-24GithubLocal Demo
Star <br> Long Context Transfer from Language to Vision <br>arXiv2024-06-24GithubLocal Demo
Star <br> video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models <br>ICML2024-06-22Github-
Star <br> Unveiling Encoder-Free Vision-Language Models <br>arXiv2024-06-17GithubLocal Demo
Star <br> RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics <br>CoRL2024-06-15GithubDemo
Star <br> Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models <br>arXiv2024-06-12Github-
Star <br> VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs <br>arXiv2024-06-11GithubLocal Demo
Star <br> Parrot: Multilingual Visual Instruction Tuning <br>arXiv2024-06-04Github-
Star <br> Ovis: Structural Embedding Alignment for Multimodal Large Language Model <br>arXiv2024-05-31Github-
Star <br> Matryoshka Query Transformer for Large Vision-Language Models <br>arXiv2024-05-29GithubDemo
Star <br> ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models <br>arXiv2024-05-24Github-
Star <br> Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models <br>arXiv2024-05-24GithubDemo
Star <br> Libra: Building Decoupled Vision System on Large Language Models <br>ICML2024-05-16GithubLocal Demo
Star <br> CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts <br>arXiv2024-05-09GithubLocal Demo
Star <br> How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites <br>arXiv2024-04-25GithubDemo
Star <br> Graphic Design with Large Multimodal Model <br>arXiv2024-04-22Github-
Star <br> InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD <br>arXiv2024-04-09GithubDemo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsarXiv2024-04-08--
Star <br> MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding <br>CVPR2024-04-08Github-
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language ModelACM TKDD2024-03-28--
Star <br> Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models <br>arXiv2024-03-27GithubDemo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingarXiv2024-03-14--
Star <br> MoAI: Mixture of All Intelligence for Large Language and Vision Models <br>arXiv2024-03-12GithubLocal Demo
Star <br> TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document <br>arXiv2024-03-07GithubDemo
Star <br> The All-Seeing Project V2: Towards General Relation Comprehension of the Open WorldarXiv2024-02-29Github-
GROUNDHOG: Grounding Large Language Models to Holistic SegmentationCVPR2024-02-26Coming soonComing soon
Star <br> AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling <br>arXiv2024-02-19Github-
Star <br> Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning <br>arXiv2024-02-18Github-
Star <br> ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model <br>arXiv2024-02-18GithubDemo
Star <br> CoLLaVO: Crayon Large Language and Vision mOdel <br>arXiv2024-02-17Github-
Star <br> CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations <br>arXiv2024-02-06Github-
Star <br> MobileVLM V2: Faster and Stronger Baseline for Vision Language Model <br>arXiv2024-02-06Github-
Star <br> GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning <br>NeurIPS2024-02-03Github-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical StudyarXiv2024-01-31Coming soon-
Star <br> LLaVA-NeXT: Improved reasoning, OCR, and world knowledgeBlog2024-01-30GithubDemo
Star <br> MoE-LLaVA: Mixture of Experts for Large Vision-Language Models <br>arXiv2024-01-29GithubDemo
Star <br> InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model <br>arXiv2024-01-29GithubDemo
Star <br> Yi-VL <br>-2024-01-23GithubLocal Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesarXiv2024-01-22--
Star <br> ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning <br>ACL2024-01-04GithubLocal Demo
Star <br> MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices <br>arXiv2023-12-28Github-
Star <br> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks <br>CVPR2023-12-21GithubDemo
Star <br> Osprey: Pixel Understanding with Visual Instruction Tuning <br>CVPR2023-12-15GithubDemo
Star <br> CogAgent: A Visual Language Model for GUI Agents <br>arXiv2023-12-14GithubComing soon
Pixel Aligned Language ModelsarXiv2023-12-14Coming soon-
See, Say, and Segment: Teaching LMMs to Overcome False PremisesarXiv2023-12-13Coming soon-
Star <br> Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models <br>ECCV2023-12-11GithubDemo
Star <br> Honeybee: Locality-enhanced Projector for Multimodal LLM <br>CVPR2023-12-11Github-
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Star <br> OneLLM: One Framework to Align All Modalities with Language <br>arXiv2023-12-06GithubDemo
Star <br> Lenna: Language Enhanced Reasoning Detection Assistant <br>arXiv2023-12-05Github-
VaQuitA: Enhancing Alignment in LLM-Assisted Video UnderstandingarXiv2023-12-04--
Star <br> TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding <br>arXiv2023-12-04GithubLocal Demo
Star <br> Making Large Multimodal Models Understand Arbitrary Visual Prompts <br>CVPR2023-12-01GithubDemo
Star <br> Dolphins: Multimodal Language Model for Driving <br>arXiv2023-12-01Github-
Star <br> LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning <br>arXiv2023-11-30GithubComing soon
Star <br> VTimeLLM: Empower LLM to Grasp Video Moments <br>arXiv2023-11-30GithubLocal Demo
Star <br> mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model <br>arXiv2023-11-30Github-
Star <br> LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models <br>arXiv2023-11-28GithubComing soon
Star <br> LLMGA: Multimodal Large Language Model based Generation Assistant <br>arXiv2023-11-27GithubDemo
Star <br> ChartLlama: A Multimodal LLM for Chart Understanding and Generation <br>arXiv2023-11-27Github-
Star <br> ShareGPT4V: Improving Large Multi-Modal Models with Better Captions <br>arXiv2023-11-21GithubDemo
Star <br> LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge <br>arXiv2023-11-20Github-
Star <br> An Embodied Generalist Agent in 3D World <br>arXiv2023-11-18GithubDemo
Star <br> Video-LLaVA: Learning United Visual Representation by Alignment Before Projection <br>arXiv2023-11-16GithubDemo
Star <br> Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding <br>CVPR2023-11-14Github-
Star <br> To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning <br>arXiv2023-11-13Github-
Star <br> SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models <br>arXiv2023-11-13GithubDemo
Star <br> Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models <br>CVPR2023-11-11GithubDemo
Star <br> LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents <br>arXiv2023-11-09GithubDemo
Star <br> NExT-Chat: An LMM for Chat, Detection and Segmentation <br>arXiv2023-11-08GithubLocal Demo
Star <br> mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration <br>arXiv2023-11-07GithubDemo
Star <br> OtterHD: A High-Resolution Multi-modality Model <br>arXiv2023-11-07Github-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative DecodingarXiv2023-11-06Coming soon-
Star <br> GLaMM: Pixel Grounding Large Multimodal Model <br>CVPR2023-11-06GithubDemo
Star <br> What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning <br>arXiv2023-11-02Github-
Star <br> MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning <br>arXiv2023-10-14GithubLocal Demo
Star <br> SALMONN: Towards Generic Hearing Abilities for Large Language Models <br>ICLR2023-10-20Github-
Star <br> Ferret: Refer and Ground Anything Anywhere at Any Granularity <br>arXiv2023-10-11Github-
Star <br> CogVLM: Visual Expert For Large Language Models <br>arXiv2023-10-09GithubDemo
Star <br> Improved Baselines with Visual Instruction Tuning <br>arXiv2023-10-05GithubDemo
Star <br> LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment <br>ICLR2023-10-03GithubDemo
Star <br> Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsarXiv2023-10-01Github-
Star <br> Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants <br>arXiv2023-10-01GithubLocal Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language ModelarXiv2023-09-27--
Star <br> InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition <br>arXiv2023-09-26GithubLocal Demo
Star <br> DreamLLM: Synergistic Multimodal Comprehension and Creation <br>ICLR2023-09-20GithubComing soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal ModelsarXiv2023-09-18Coming soon-
Star <br> TextBind: Multi-turn Interleaved Multimodal Instruction-following <br>arXiv2023-09-14GithubDemo
Star <br> NExT-GPT: Any-to-Any Multimodal LLM <br>arXiv2023-09-11GithubDemo
Star <br> Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics <br>arXiv2023-09-13Github-
Star <br> ImageBind-LLM: Multi-modality Instruction Tuning <br>arXiv2023-09-07GithubDemo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction TuningarXiv2023-09-05--
Star <br> PointLLM: Empowering Large Language Models to Understand Point Clouds <br>arXiv2023-08-31GithubDemo
Star <br> โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>arXiv2023-08-31GithubLocal Demo
Star <br> MLLM-DataEngine: An Iterative Refinement Approach for MLLM <br>arXiv2023-08-25Github-
Star <br> Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models <br>arXiv2023-08-25GithubDemo
Star <br> Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities <br>arXiv2023-08-24GithubDemo
Star <br> Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages <br>ICLR2023-08-23GithubDemo
Star <br> StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data <br>arXiv2023-08-20Github-
Star <br> BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions <br>arXiv2023-08-19GithubDemo
Star <br> Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions <br>arXiv2023-08-08Github-
Star <br> The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World <br>ICLR2023-08-03GithubDemo
Star <br> LISA: Reasoning Segmentation via Large Language Model <br>arXiv2023-08-01GithubDemo
Star <br> MovieChat: From Dense Token to Sparse Memory for Long Video Understanding <br>arXiv2023-07-31GithubLocal Demo
Star <br> 3D-LLM: Injecting the 3D World into Large Language Models <br>arXiv2023-07-24Github-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning <br>arXiv2023-07-18-Demo
Star <br> BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs <br>arXiv2023-07-17GithubDemo
Star <br> SVIT: Scaling up Visual Instruction Tuning <br>arXiv2023-07-09Github-
Star <br> GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest <br>arXiv2023-07-07GithubDemo
Star <br> What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? <br>arXiv2023-07-05Github-
Star <br> mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding <br>arXiv2023-07-04GithubDemo
Star <br> Visual Instruction Tuning with Polite Flamingo <br >arXiv2023-07-03GithubDemo
Star <br> LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding <br>arXiv2023-06-29GithubDemo
Star <br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>arXiv2023-06-27GithubDemo
Star <br> MotionGPT: Human Motion as a Foreign Language <br>arXiv2023-06-26Github-
Star <br> Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration <br>arXiv2023-06-15GithubComing soon
Star <br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>arXiv2023-06-11GithubDemo
Star <br> Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models <br>arXiv2023-06-08GithubDemo
Star <br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>arXiv2023-06-08GithubDemo
M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningarXiv2023-06-07--
Star <br> Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding <br>arXiv2023-06-05GithubDemo
Star <br> LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day <br>arXiv2023-06-01Github-
Star <br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>arXiv2023-05-30GithubDemo
Star <br> PandaGPT: One Model To Instruction-Follow Them All <br>arXiv2023-05-25GithubDemo
Star <br> ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst <br>arXiv2023-05-25Github-
Star <br> Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models <br>arXiv2023-05-24GithubLocal Demo
Star <br> DetGPT: Detect What You Need via Reasoning <br>arXiv2023-05-23GithubDemo
Star <br> Pengi: An Audio Language Model for Audio Tasks <br>NeurIPS2023-05-19Github-
Star <br> VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks <br>arXiv2023-05-18Github-
Star <br> Listen, Think, and Understand <br>arXiv2023-05-18GithubDemo
Star <br> VisualGLM-6B <br>-2023-05-17GithubLocal Demo
Star <br> PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering <br>arXiv2023-05-17Github-
Star <br> InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning <br>arXiv2023-05-11GithubLocal Demo
Star <br> VideoChat: Chat-Centric Video Understanding <br>arXiv2023-05-10GithubDemo
Star <br> MultiModal-GPT: A Vision and Language Model for Dialogue with Humans <br>arXiv2023-05-08GithubDemo
Star <br> X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages <br>arXiv2023-05-07Github-
Star <br> LMEye: An Interactive Perception Network for Large Language Models <br>arXiv2023-05-05GithubLocal Demo
Star <br> LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model <br>arXiv2023-04-28GithubDemo
Star <br> mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality <br>arXiv2023-04-27GithubDemo
Star <br> MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models <br>arXiv2023-04-20Github-
Star <br> Visual Instruction Tuning <br>NeurIPS2023-04-17GitHubDemo
Star <br> LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention <br>ICLR2023-03-28GithubDemo
Star <br> MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning <br>ACL2022-12-21Github-

Multimodal Hallucination

TitleVenueDateCodeDemo
Star <br> Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models <br>arXiv2024-10-04Github-
Star <br> Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations <br>arXiv2024-10-03Github-
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene GraphsarXiv2024-09-20Link-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval AugmentationarXiv2024-08-01--
Star <br> Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs <br>ECCV2024-07-31Github-
Star <br> Evaluating and Analyzing Relationship Hallucinations in LVLMs <br>ICML2024-06-24Github-
Star <br> AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention <br>arXiv2024-06-18Github-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal ModelsarXiv2024-06-04Coming soon-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception GaparXiv2024-05-24Coming soon-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI FeedbackarXiv2024-04-22--
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive DecodingarXiv2024-03-27--
Star <br> What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models <br>arXiv2024-03-20Github-
Strengthening Multimodal Large Language Model with Bootstrapped Preference OptimizationarXiv2024-03-13--
Star <br> Debiasing Multimodal Large Language Models <br>arXiv2024-03-08Github-
Star <br> HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding <br>arXiv2024-03-01Github-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased DecodingarXiv2024-02-28--
Star <br> Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective <br>arXiv2024-02-22Github-
Star <br> Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models <br>arXiv2024-02-18Github-
Star <br> The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs <br>arXiv2024-02-06Github-
Star <br> Unified Hallucination Detection for Multimodal Large Language Models <br>arXiv2024-02-05Github-
A Survey on Hallucination in Large Vision-Language ModelsarXiv2024-02-01--
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language ModelsarXiv2024-01-18--
Star <br> Hallucination Augmented Contrastive Learning for Multimodal Large Language Model <br>arXiv2023-12-12Github-
Star <br> MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations <br>arXiv2023-12-06Github-
Star <br> Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites <br>arXiv2023-12-04Github-
Star <br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>arXiv2023-12-01GithubDemo
Star <br> OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation <br>CVPR2023-11-29Github-
Star <br> Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br>CVPR2023-11-28Github-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationarXiv2023-11-28GithubComins Soon
Mitigating Hallucination in Visual Language Models with Visual SupervisionarXiv2023-11-27--
Star <br> HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data <br>arXiv2023-11-22Github-
Star <br> An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation <br>arXiv2023-11-13Github-
Star <br> FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models <br>arXiv2023-11-02Github-
Star <br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>arXiv2023-10-24GithubDemo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language ModelsarXiv2023-10-09--
Star <br> HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption <br>arXiv2023-10-03Github-
Star <br> Analyzing and Mitigating Object Hallucination in Large Vision-Language Models <br>ICLR2023-10-01Github-
Star <br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>arXiv2023-09-25GithubDemo
Evaluation and Mitigation of Agnosia in Multimodal Large Language ModelsarXiv2023-09-07--
CIEM: Contrastive Instruction Evaluation Method for Better Instruction TuningarXiv2023-09-05--
Star <br> Evaluation and Analysis of Hallucination in Large Vision-Language Models <br>arXiv2023-08-29Github-
Star <br> VIGC: Visual Instruction Generation and Correction <br>arXiv2023-08-24GithubDemo
Detecting and Preventing Hallucinations in Large Vision Language ModelsarXiv2023-08-11--
Star <br> Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning <br>ICLR2023-06-26GithubDemo
Star <br> Evaluating Object Hallucination in Large Vision-Language Models <br>EMNLP2023-05-17Github-

Multimodal In-Context Learning

TitleVenueDateCodeDemo
Visual In-Context Learning for Large Vision-Language ModelsarXiv2024-02-18--
Star <br> Can MLLMs Perform Text-to-Image In-Context Learning? <br>arXiv2024-02-02Github-
Star <br> Generative Multimodal Models are In-Context Learners <br>CVPR2023-12-20GithubDemo
Hijacking Context in Large Multi-modal ModelsarXiv2023-12-07--
Towards More Unified In-context Visual UnderstandingarXiv2023-12-05--
Star <br> MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning <br>arXiv2023-09-14GithubDemo
Star <br> Link-Context Learning for Multimodal LLMs <br>arXiv2023-08-15GithubDemo
Star <br> OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models <br>arXiv2023-08-02GithubDemo
Star <br> Med-Flamingo: a Multimodal Medical Few-shot Learner <br>arXiv2023-07-27GithubLocal Demo
Star <br> Generative Pretraining in Multimodality <br>ICLR2023-07-11GithubDemo
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star <br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>arXiv2023-06-08GithubDemo
Star <br> Exploring Diverse In-Context Configurations for Image Captioning <br>NeurIPS2023-05-24Github-
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Star <br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>arXiv2023-03-30GithubDemo
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction <br>ICCV2023-03-09Github-
Star <br> Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering <br>CVPR2023-03-03Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA <br>AAAI2022-06-28Github-
Star <br> Flamingo: a Visual Language Model for Few-Shot Learning <br>NeurIPS2022-04-29GithubDemo
Multimodal Few-Shot Learning with Frozen Language ModelsNeurIPS2021-06-25--

Multimodal Chain-of-Thought

TitleVenueDateCodeDemo
Star <br> Cantor: Inspiring Multimodal Chain-of-Thought of MLLM <br>arXiv2024-04-24GithubLocal Demo
Star <br> Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models <br>arXiv2024-03-25GithubLocal Demo
Star <br> DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models <br>NeurIPS2023-10-25Github-
Star <br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>arXiv2023-06-27GithubDemo
Star <br> Explainable Multimodal Emotion Reasoning <br>arXiv2023-06-27Github-
Star <br> EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought <br>arXiv2023-05-24Github-
Letโ€™s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionarXiv2023-05-23--
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question AnsweringarXiv2023-05-05--
Star <br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>arXiv2023-05-04GithubDemo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal InfillingsarXiv2023-05-03Coming soon-
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Chain of Thought Prompt Tuning in Vision Language ModelsarXiv2023-04-16Coming soon-
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>arXiv2023-03-08GithubDemo
Star <br> Multimodal Chain-of-Thought Reasoning in Language Models <br>arXiv2023-02-02Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering <br>NeurIPS2022-09-20Github-

LLM-Aided Visual Reasoning

TitleVenueDateCodeDemo
Star <br> Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models <br>arXiv2024-03-27Github-
Star <br> Vโˆ—: Guided Visual Search as a Core Mechanism in Multimodal LLMs <br>arXiv2023-12-21GithubLocal Demo
Star <br> LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing <br>arXiv2023-11-01GithubDemo
MM-VID: Advancing Video Understanding with GPT-4V(vision)arXiv2023-10-30--
Star <br> ControlLLM: Augment Language Models with Tools by Searching on Graphs <br>arXiv2023-10-26Github-
Star <br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>arXiv2023-10-24GithubDemo
Star <br> MindAgent: Emergent Gaming Interaction <br>arXiv2023-09-18Github-
Star <br> Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language <br>arXiv2023-06-28GithubDemo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsarXiv2023-06-15--
Star <br> AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn <br>arXiv2023-06-14Github-
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star <br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>arXiv2023-05-30GithubDemo
Mindstorms in Natural Language-Based Societies of MindarXiv2023-05-26--
Star <br> LayoutGPT: Compositional Visual Planning and Generation with Large Language Models <br>arXiv2023-05-24Github-
Star <br> IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models <br>arXiv2023-05-24GithubLocal Demo
Star <br> Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation <br>arXiv2023-05-10Github-
Star <br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>arXiv2023-05-04GithubDemo
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Star <br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>arXiv2023-03-30GithubDemo
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> ViperGPT: Visual Inference via Python Execution for Reasoning <br>arXiv2023-03-14GithubLocal Demo
Star <br> ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions <br>arXiv2023-03-12GithubLocal Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information ExtractionICCV2023-03-09--
Star <br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>arXiv2023-03-08GithubDemo
Star <br> Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners <br>CVPR2023-03-03Github-
Star <br> From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models <br>CVPR2022-12-21GithubDemo
Star <br> SuS-X: Training-Free Name-Only Transfer of Vision-Language Models <br>arXiv2022-11-28Github-
Star <br> PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning <br>CVPR2022-11-21Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language <br>arXiv2022-04-01Github-

Foundation Models

TitleVenueDateCodeDemo
Star <br> Emu3: Next-Token Prediction is All You Need <br>arXiv2024-09-27GithubLocal Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable modelsMeta2024-09-25-Demo
Pixtral-12BMistral2024-09-17--
Star <br> xGen-MM (BLIP-3): A Family of Open Large Multimodal Models <br>arXiv2024-08-16Github-
The Llama 3 Herd of ModelsarXiv2024-07-31--
Chameleon: Mixed-Modal Early-Fusion Foundation ModelsarXiv2024-05-16--
Hello GPT-4oOpenAI2024-05-13--
The Claude 3 Model Family: Opus, Sonnet, HaikuAnthropic2024-03-04--
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGoogle2024-02-15--
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Fuyu-8B: A Multimodal Architecture for AI Agentsblog2023-10-17HuggingfaceDemo
Star <br> Unified Model for Image, Video, Audio and Language Tasks <br>arXiv2023-07-30GithubDemo
PaLI-3 Vision Language Models: Smaller, Faster, StrongerarXiv2023-10-13--
GPT-4V(ision) System CardOpenAI2023-09-25--
Star <br> Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization <br>arXiv2023-09-09Github-
Multimodal Foundation Models: From Specialists to General-Purpose AssistantsarXiv2023-09-18--
Star <br> Bootstrapping Vision-Language Learning with Decoupled Language Pre-training <br>NeurIPS2023-07-13Github-
Star <br> Generative Pretraining in Multimodality <br>arXiv2023-07-11GithubDemo
Star <br> Kosmos-2: Grounding Multimodal Large Language Models to the World <br>arXiv2023-06-26GithubDemo
Star <br> Transfer Visual Prompt Generator across LLMs <br>arXiv2023-05-02GithubDemo
GPT-4 Technical ReportarXiv2023-03-15--
PaLM-E: An Embodied Multimodal Language ModelarXiv2023-03-06-Demo
Star <br> Prismer: A Vision-Language Model with An Ensemble of Experts <br>arXiv2023-03-04GithubDemo
Star <br> Language Is Not All You Need: Aligning Perception with Language Models <br>arXiv2023-02-27Github-
Star <br> BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models <br>arXiv2023-01-30GithubDemo
Star <br> VIMA: General Robot Manipulation with Multimodal Prompts <br>ICML2022-10-06GithubLocal Demo
Star <br> MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge <br>NeurIPS2022-06-17Github-
Star <br> Write and Paint: Generative Vision-Language Models are Unified Modal Learners <br>ICLR2022-06-15Github-
Star <br> Language Models are General-Purpose Interfaces <br>arXiv2022-06-13Github-

Evaluation

TitleVenueDatePage
Stars <br> OmniBench: Towards The Future of Universal Omni-Language Models <br>arXiv2024-09-23Github
Stars <br> MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? <br>arXiv2024-08-23Github
Stars <br> UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models <br>TPAMI2023-10-17Github
Stars <br> MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation <br>arXiv2024-06-29Github
Stars <br> Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs <br>arXiv2024-06-28Github
Stars <br> CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs <br>arXiv2024-06-26Github
Stars <br> ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation <br>arXiv2024-04-15Github
Stars <br> Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis <br>arXiv2024-05-31Github
Stars <br> Benchmarking Large Multimodal Models against Common Corruptions <br>NAACL2024-01-22Github
Stars <br> Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs <br>arXiv2024-01-11Github
Stars <br> A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise <br>arXiv2023-12-19Github
Stars <br> BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models <br>arXiv2023-12-05Github
Star <br> How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs <br>arXiv2023-11-27Github
Star <br> Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs <br>arXiv2023-11-24Github
Star <br> MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V <br>arXiv2023-11-23Github
VLM-Eval: A General Evaluation on Video Large Language ModelsarXiv2023-11-20Coming soon
Star <br> Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges <br>arXiv2023-11-06Github
Star <br> On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving <br>arXiv2023-11-09Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the LeadarXiv2023-11-05-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical ImagingarXiv2023-10-31-
Star <br> An Early Evaluation of GPT-4V(ision) <br>arXiv2023-10-25Github
Star <br> Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation <br>arXiv2023-10-25Github
Star <br> HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models <br>CVPR2023-10-23Github
Star <br> MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models <br>ICLR2023-10-03Github
Star <br> Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations <br>arXiv2023-10-02Github
Star <br> Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning <br>arXiv2023-10-01Github
Star <br> Can We Edit Multimodal Large Language Models? <br>arXiv2023-10-12Github
Star <br> REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets <br>arXiv2023-10-10Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)arXiv2023-09-29-
Star <br> TouchStone: Evaluating Vision-Language Models by Language Models <br>arXiv2023-08-31Github
Star <br> โœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>arXiv2023-08-31Github
Star <br> SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs <br>arXiv2023-08-07Github
Star <br> Tiny LVLM-eHub: Early Multimodal Experiments with Bard <br>arXiv2023-08-07Github
Star <br> MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities <br>arXiv2023-08-04Github
Star <br> SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension <br>CVPR2023-07-30Github
Star <br> MMBench: Is Your Multi-modal Model an All-around Player? <br>arXiv2023-07-12Github
Star <br> MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models <br>arXiv2023-06-23Github
Star <br> LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models <br>arXiv2023-06-15Github
Star <br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>arXiv2023-06-11Github
Star <br> M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models <br>arXiv2023-06-08Github
Star <br> On The Hidden Mystery of OCR in Large Multimodal Models <br>arXiv2023-05-13Github

Multimodal RLHF

TitleVenueDateCodeDemo
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference OptimizationarXiv2024-10-09--
Star <br> Silkie: Preference Distillation for Large Visual Language Models <br>arXiv2023-12-17Github-
Star <br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>arXiv2023-12-01GithubDemo
Star <br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>arXiv2023-09-25GithubDemo

Others

TitleVenueDateCodeDemo
Star <br> Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models <br>arXiv2024-02-03Github-
Star <br> VCoder: Versatile Vision Encoders for Multimodal Large Language Models <br>arXiv2023-12-21GithubLocal Demo
Star <br> Prompt Highlighter: Interactive Control for Multi-Modal LLMs <br>arXiv2023-12-07Github-
Star <br> Planting a SEED of Vision in Large Language Model <br>arXiv2023-07-16Github
Star <br> Can Large Pre-trained Models Help Vision Models on Perception Tasks? <br>arXiv2023-06-01Github-
Star <br> Contextual Object Detection with Multimodal Large Language Models <br>arXiv2023-05-29GithubDemo
Star <br> Generating Images with Multimodal Language Models <br>arXiv2023-05-26Github-
Star <br> On Evaluating Adversarial Robustness of Large Vision-Language Models <br>arXiv2023-05-26Github-
Star <br> Grounding Language Models to Images for Multimodal Inputs and Outputs <br>ICML2023-01-31GithubDemo

Awesome Datasets

Datasets of Pre-Training for Alignment

NamePaperTypeModalities
ShareGPT4VideoShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsCaptionVideo-Text
COYO-700MCOYO-700M: Image-Text Pair DatasetCaptionImage-Text
ShareGPT4VShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsCaptionImage-Text
AS-1BThe All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open WorldHybridImage-Text
InternVidInternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationCaptionVideo-Text
MS-COCOMicrosoft COCO: Common Objects in ContextCaptionImage-Text
SBU CaptionsIm2Text: Describing Images Using 1 Million Captioned PhotographsCaptionImage-Text
Conceptual CaptionsConceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image CaptioningCaptionImage-Text
LAION-400MLAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text PairsCaptionImage-Text
VG CaptionsVisual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsCaptionImage-Text
Flickr30kFlickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCaptionImage-Text
AI-CapsAI Challenger : A Large-scale Dataset for Going Deeper in Image UnderstandingCaptionImage-Text
Wukong CaptionsWukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkCaptionImage-Text
GRITKosmos-2: Grounding Multimodal Large Language Models to the WorldCaptionImage-Text-Bounding-Box
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksCaptionVideo-Text
MSR-VTTMSR-VTT: A Large Video Description Dataset for Bridging Video and LanguageCaptionVideo-Text
Webvid10MFrozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCaptionVideo-Text
WavCapsWavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchCaptionAudio-Text
AISHELL-1AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baselineASRAudio-Text
AISHELL-2AISHELL-2: Transforming Mandarin ASR Research Into Industrial ScaleASRAudio-Text
VSDial-CNX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesASRImage-Audio-Text

Datasets of Multimodal Instruction Tuning

NamePaperLinkNotes
UNK-VQAUNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsLinkA dataset designed to teach models to refrain from answering unanswerable questions
VEGAVEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large ModelsLinkA dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4VALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language ModelLinkVision and language caption and instruction dataset generated by GPT4V
IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkDehallucinative visual instruction for "I Know" hallucination
CAP2QAVisually Dehallucinative Instruction GenerationLinkImage-aligned visual instruction dataset
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA large-scale 3D instruction tuning dataset
ViP-LLaVA-InstructMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4VTo See is to Believe: Prompting GPT-4V for Better Visual Instruction TuningLinkA visual instruction dataset via self-instruction from GPT-4V
ComVintWhat Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningLinkA synthetic instruction dataset for complex visual reasoning
SparklesDialogueโœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVAStableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue DataLinkA cheap and effective approach to collect visual instruction tuning data
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
MGVLIDChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning-A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPTBuboGPT: Enabling Visual Grounding in Multi-Modal LLMsLinkA high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVITSVIT: Scaling up Visual Instruction TuningLinkA large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwlmPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingLinkAn instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1MVisual Instruction Tuning with Polite FlamingoLinkA collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlamaChartLlama: A Multimodal LLM for Chart Understanding and GenerationLinkA multi-modal instruction-tuning dataset for chart understanding and generation
LLaVARLLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image UnderstandingLinkA visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPTMotionGPT: Human Motion as a Foreign LanguageLinkA instruction-tuning dataset including multiple human motion-related tasks
LRV-InstructionMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkVisual instruction tuning dataset for addressing hallucination issue
Macaw-LLMMacaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationLinkA large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-DatasetLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA comprehensive multi-modal instruction tuning dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLink100K high-quality video instruction dataset
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction tuning
M<sup>3</sup>ITM<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningLinkLarge-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-MedLLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayComing soonA large-scale, broad-coverage biomedical instruction-following dataset
GPT4ToolsGPT4Tools: Teaching Large Language Model to Use Tools via Self-instructionLinkTool-related instruction datasets
MULTISChatBridge: Bridging Modalities with Large Language Model as a Language CatalystComing soonMultimodal instruction tuning dataset covering 16 multimodal tasks
DetGPTDetGPT: Detect What You Need via ReasoningLinkInstruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQAPMC-VQA: Visual Instruction Tuning for Medical Visual Question AnsweringComing soonLarge-scale medical visual question-answering dataset
VideoChatVideoChat: Chat-Centric Video UnderstandingLinkVideo-centric multimodal instruction dataset
X-LLMX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesLinkChinese multimodal instruction dataset
LMEyeLMEye: An Interactive Perception Network for Large Language ModelsLinkA multi-modal instruction-tuning dataset
cc-sbu-alignMiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsLinkMultimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150KVisual Instruction TuningLinkMultimodal instruction-following data generated by GPT
MultiInstructMultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction TuningLinkThe first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

NamePaperLinkNotes
MICMMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningLinkA manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

NamePaperLinkNotes
EMERExplainable Multimodal Emotion ReasoningComing soonA benchmark dataset for explainable emotion reasoning task
EgoCOTEmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of ThoughtComing soonLarge-scale embodied planning dataset
VIPLetโ€™s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionComing soonAn inference-time dataset that can be used to evaluate VideoCOT
ScienceQALearn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringLinkLarge-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

NamePaperLinkNotes
VLFeedbackSilkie: Preference Distillation for Large Visual Language ModelsLinkA vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

NamePaperLinkNotes
LiveXivLiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers ContentLinkA live benchmark based on arXiv papers
TemporalBenchTemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video ModelsLinkA benchmark for evaluation of fine-grained temporal understanding
OmniBenchOmniBench: Towards The Future of Universal Omni-Language ModelsLinkA benchmark that evaluates models' capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorldMME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?LinkA challenging benchmark that involves real-life scenarios
CharXivCharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMsLinkChart understanding benchmark curated by human experts
Video-MMEVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisLinkA comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL BenchVL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context LearningLinkA benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompassTempCompass: Do Video LLMs Really Understand Videos?LinkA benchmark to evaluate the temporal perception ability of Video LLMs
CoBSATCan MLLMs Perform Text-to-Image In-Context Learning?LinkA benchmark for text-to-image ICL
VQAv2-IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkA benchmark for assessing "I Know" visual hallucination
Math-VisionMeasuring Multimodal Mathematical Reasoning with MATH-Vision DatasetLinkA diverse mathematical reasoning benchmark
CMMMUCMMMU: A Chinese Massive Multi-discipline Multimodal Understanding BenchmarkLinkA Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBenchBenchmarking Large Multimodal Models against Common CorruptionsLinkA benchmark for examining self-consistency under common corruptions
MMVPEyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsLinkA benchmark for assessing visual capabilities
TimeITTimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingLinkA video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-BenchMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA benchmark for visual prompts
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA 3D-centric benchmark
Video-BenchVideo-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language ModelsLinkA benchmark for video-MLLM evaluation
Charting-New-TerritoriesCharting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMsLinkA benchmark for evaluating geographic and geospatial capabilities
MLLM-BenchMLLM-Bench, Evaluating Multi-modal LLMs using GPT-4VLinkGPT-4V evaluation with per-sample criteria
BenchLMMBenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsLinkA benchmark for assessment of the robustness against different image styles
MMC-BenchmarkMMC: Advancing Multimodal Chart Understanding with Large-scale Instruction TuningLinkA comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBenchMVBench: A Comprehensive Multi-modal Video Understanding BenchmarkLinkA comprehensive multimodal benchmark for video understanding
BingoHolistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference ChallengesLinkA benchmark for hallucination evaluation that focuses on two common types
MagnifierBenchOtterHD: A High-Resolution Multi-modality ModelLinkA benchmark designed to probe models' ability of fine-grained perception
HallusionBenchHallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality ModelsLinkAn image-context reasoning benchmark for evaluation of hallucination
PCA-EVALTowards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and BeyondLinkA benchmark for evaluating multi-domain embodied decision-making.
MMHal-BenchAligning Large Multimodal Models with Factually Augmented RLHFLinkA benchmark for hallucination evaluation
MathVistaMathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal ModelsLinkA benchmark that challenges both visual and math reasoning capabilities
SparklesEvalโœจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAILink-Context Learning for Multimodal LLMsLinkA benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
I4Empowering Vision-Language Models to Follow Interleaved Vision-Language InstructionsLinkA benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQASciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific GraphsLinkA large-scale chart-visual question-answering dataset
MM-VetMM-Vet: Evaluating Large Multimodal Models for Integrated CapabilitiesLinkAn evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-BenchSEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionLinkA benchmark for evaluation of generative comprehension in MLLMs
MMBenchMMBench: Is Your Multi-modal Model an All-around Player?LinkA systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
LynxWhat Matters in Training a GPT4-Style Language Model with Multimodal Inputs?LinkA comprehensive evaluation benchmark including both image and video tasks
GAVIEMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkA benchmark to evaluate the hallucination and instruction following ability
MMEMME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsLinkA comprehensive MLLM Evaluation benchmark
LVLM-eHubLVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language ModelsLinkAn evaluation platform for MLLMs
LAMM-BenchmarkLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3ExamM3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language ModelsLinkA multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEvalmPLUG-Owl: Modularization Empowers Large Language Models with MultimodalityLinkDataset for evaluation on multiple capabilities

Others

NamePaperLinkNotes
IMADIMAD: IMage-Augmented multi-modal DialogueLinkMultimodal dialogue dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLinkA quantitative evaluation framework for video-based dialogue models
CLEVR-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeekCan Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?LinkA VQA dataset that focuses on asking information-seeking questions
OVENOpen-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia EntitiesLinkA dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild