Home

Awesome

Awesome-Multimodal-Large-Language-Models

Our MLLM works

πŸ”₯πŸ”₯πŸ”₯ A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles: </div>

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! :star2: </div>


πŸ”₯πŸ”₯πŸ”₯ VITA: Towards Open-Source Interactive Omni Multimodal LLM

<p align="center"> <img src="./images/vita.png" width="80%" height="80%"> </p>

<font size=7><div align='center' > [🍎 Project Page] [πŸ“– arXiv Paper] [🌼 GitHub] </div></font>

[2024.08.12] We are announcing VITA, the first-ever open-source Multimodal LLM that can process Video, Image, Text, and Audio, and meanwhile has an advanced multimodal interactive experience. 🌟

<b>Omni Multimodal Understanding</b>. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. ✨

<b>Non-awakening Interaction</b>. VITA can be activated and respond to user audio questions in the environment without the need for a wake-up word or button. ✨

<b>Audio Interrupt Interaction</b>. VITA is able to simultaneously track and filter external queries in real-time. This allows users to interrupt the model's generation at any time with new questions, and VITA will respond to the new query accordingly. ✨


πŸ”₯πŸ”₯πŸ”₯ Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

<p align="center"> <img src="./images/videomme.jpg" width="80%" height="80%"> </p>

<font size=7><div align='center' > [🍎 Project Page] [πŸ“– arXiv Paper] [πŸ“Š Dataset][πŸ† Leaderboard] </div></font>

[2024.06.03] We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟

It applies to both <b>image MLLMs</b>, i.e., generalizing to multiple images, and <b>video MLLMs</b>. Our leaderboard involes SOTA models like Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5, and Qwen-VL-Max. 🌟

It includes both <b>short- (< 2min)</b>, <b>medium- (4min~15min)</b>, and <b>long-term (30min~60min)</b> videos, ranging from <b>11 seconds to 1 hour</b>. ✨

<b>All data are newly collected and annotated by humans, not from any existing video dataset</b>. ✨


πŸ”₯πŸ”₯πŸ”₯ MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper | :black_nib: Citation

A comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. :sparkles:

If you want to add your model in our leaderboards, please feel free to email bradyfu24@gmail.com. We will update the leaderboards in time. :sparkles:

<details><summary>Download MME :star2::star2: </summary>

The benchmark dataset is collected by Xiamen University for academic research only. You can email yongdongluo@stu.xmu.edu.cn to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as xx@stu.xmu.edu.cn and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)
</details>

<br> πŸ“‘ If you find our projects helpful to your research, please consider citing: <br>

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{fu2024vita,
  title={VITA: Towards Open-Source Interactive Omni Multimodal LLM},
  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and He, Ran and Ji, Rongrong and Wu, Yunsheng and Shan, Caifeng and Sun, Xing},
  journal={arXiv preprint arXiv:2408.05211},
  year={2024}
}

@article{fu2024video,
  title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
  author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
  journal={arXiv preprint arXiv:2405.21075},
  year={2024}
}

@article{yin2023survey,
  title={A survey on multimodal large language models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}


<font size=5><center><b> Table of Contents </b> </center></font>


Awesome Papers

Multimodal Instruction Tuning

TitleVenueDateCodeDemo
Star <br> mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models <br>arXiv2024-08-09Github-
Star <br> VITA: Towards Open-Source Interactive Omni Multimodal LLM <br>arXiv2024-08-09Github-
Star <br> LLaVA-OneVision: Easy Visual Task Transfer <br>arXiv2024-08-06GithubDemo
Star <br> MiniCPM-V: A GPT-4V Level MLLM on Your Phone <br>arXiv2024-08-03GithubDemo
VILA^2: VILA Augmented VILAarXiv2024-07-24--
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsarXiv2024-07-22--
EVLM: An Efficient Vision-Language Model for Visual UnderstandingarXiv2024-07-19--
Star <br> InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output <br>arXiv2024-07-03GithubDemo
Star <br> OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding <br>arXiv2024-06-27GithubLocal Demo
Star <br> Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs <br>arXiv2024-06-24GithubLocal Demo
Star <br> Long Context Transfer from Language to Vision <br>arXiv2024-06-24GithubLocal Demo
Star <br> Unveiling Encoder-Free Vision-Language Models <br>arXiv2024-06-17GithubLocal Demo
Star <br> Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models <br>arXiv2024-06-12Github-
Star <br> VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs <br>arXiv2024-06-11GithubLocal Demo
Star <br> Parrot: Multilingual Visual Instruction Tuning <br>arXiv2024-06-04Github-
Star <br> Ovis: Structural Embedding Alignment for Multimodal Large Language Model <br>arXiv2024-05-31Github-
Star <br> Matryoshka Query Transformer for Large Vision-Language Models <br>arXiv2024-05-29GithubDemo
Star <br> ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models <br>arXiv2024-05-24Github-
Star <br> Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models <br>arXiv2024-05-24GithubDemo
Star <br> Libra: Building Decoupled Vision System on Large Language Models <br>ICML2024-05-16GithubLocal Demo
Star <br> CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts <br>arXiv2024-05-09GithubLocal Demo
Star <br> How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites <br>arXiv2024-04-25GithubDemo
Star <br> Graphic Design with Large Multimodal Model <br>arXiv2024-04-22Github-
Star <br> InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD <br>arXiv2024-04-09GithubDemo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMsarXiv2024-04-08--
Star <br> MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding <br>CVPR2024-04-08Github-
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language ModelACM TKDD2024-03-28--
Star <br> Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models <br>arXiv2024-03-27GithubDemo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingarXiv2024-03-14--
Star <br> MoAI: Mixture of All Intelligence for Large Language and Vision Models <br>arXiv2024-03-12GithubLocal Demo
Star <br> TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document <br>arXiv2024-03-07GithubDemo
Star <br> The All-Seeing Project V2: Towards General Relation Comprehension of the Open WorldarXiv2024-02-29Github-
GROUNDHOG: Grounding Large Language Models to Holistic SegmentationCVPR2024-02-26Coming soonComing soon
Star <br> AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling <br>arXiv2024-02-19Github-
Star <br> Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning <br>arXiv2024-02-18Github-
Star <br> ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model <br>arXiv2024-02-18GithubDemo
Star <br> CoLLaVO: Crayon Large Language and Vision mOdel <br>arXiv2024-02-17Github-
Star <br> CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations <br>arXiv2024-02-06Github-
Star <br> MobileVLM V2: Faster and Stronger Baseline for Vision Language Model <br>arXiv2024-02-06Github-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical StudyarXiv2024-01-31Coming soon-
Star <br> LLaVA-NeXT: Improved reasoning, OCR, and world knowledgeBlog2024-01-30GithubDemo
Star <br> MoE-LLaVA: Mixture of Experts for Large Vision-Language Models <br>arXiv2024-01-29GithubDemo
Star <br> InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model <br>arXiv2024-01-29GithubDemo
Star <br> Yi-VL <br>-2024-01-23GithubLocal Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesarXiv2024-01-22--
Star <br> MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices <br>arXiv2023-12-28Github-
Star <br> InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks <br>CVPR2023-12-21GithubDemo
Star <br> Osprey: Pixel Understanding with Visual Instruction Tuning <br>CVPR2023-12-15GithubDemo
Star <br> CogAgent: A Visual Language Model for GUI Agents <br>arXiv2023-12-14GithubComing soon
Pixel Aligned Language ModelsarXiv2023-12-14Coming soon-
See, Say, and Segment: Teaching LMMs to Overcome False PremisesarXiv2023-12-13Coming soon-
Star <br> Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models <br>arXiv2023-12-11GithubDemo
Star <br> Honeybee: Locality-enhanced Projector for Multimodal LLM <br>arXiv2023-12-11Github-
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Star <br> OneLLM: One Framework to Align All Modalities with Language <br>arXiv2023-12-06GithubDemo
Star <br> Lenna: Language Enhanced Reasoning Detection Assistant <br>arXiv2023-12-05Github-
VaQuitA: Enhancing Alignment in LLM-Assisted Video UnderstandingarXiv2023-12-04--
Star <br> TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding <br>arXiv2023-12-04GithubLocal Demo
Star <br> Making Large Multimodal Models Understand Arbitrary Visual Prompts <br>CVPR2023-12-01GithubDemo
Star <br> Dolphins: Multimodal Language Model for Driving <br>arXiv2023-12-01Github-
Star <br> LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning <br>arXiv2023-11-30GithubComing soon
Star <br> VTimeLLM: Empower LLM to Grasp Video Moments <br>arXiv2023-11-30GithubLocal Demo
Star <br> mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model <br>arXiv2023-11-30Github-
Star <br> LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models <br>arXiv2023-11-28GithubComing soon
Star <br> LLMGA: Multimodal Large Language Model based Generation Assistant <br>arXiv2023-11-27GithubDemo
Star <br> ChartLlama: A Multimodal LLM for Chart Understanding and Generation <br>arXiv2023-11-27Github-
Star <br> ShareGPT4V: Improving Large Multi-Modal Models with Better Captions <br>arXiv2023-11-21GithubDemo
Star <br> LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge <br>arXiv2023-11-20Github-
Star <br> An Embodied Generalist Agent in 3D World <br>arXiv2023-11-18GithubDemo
Star <br> Video-LLaVA: Learning United Visual Representation by Alignment Before Projection <br>arXiv2023-11-16GithubDemo
Star <br> Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding <br>CVPR2023-11-14Github-
Star <br> To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning <br>arXiv2023-11-13Github-
Star <br> SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models <br>arXiv2023-11-13GithubDemo
Star <br> Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models <br>CVPR2023-11-11GithubDemo
Star <br> LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents <br>arXiv2023-11-09GithubDemo
Star <br> NExT-Chat: An LMM for Chat, Detection and Segmentation <br>arXiv2023-11-08GithubLocal Demo
Star <br> mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration <br>arXiv2023-11-07GithubDemo
Star <br> OtterHD: A High-Resolution Multi-modality Model <br>arXiv2023-11-07Github-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative DecodingarXiv2023-11-06Coming soon-
Star <br> GLaMM: Pixel Grounding Large Multimodal Model <br>CVPR2023-11-06GithubDemo
Star <br> What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning <br>arXiv2023-11-02Github-
Star <br> MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning <br>arXiv2023-10-14GithubLocal Demo
Star <br> Ferret: Refer and Ground Anything Anywhere at Any Granularity <br>arXiv2023-10-11Github-
Star <br> CogVLM: Visual Expert For Large Language Models <br>arXiv2023-10-09GithubDemo
Star <br> Improved Baselines with Visual Instruction Tuning <br>arXiv2023-10-05GithubDemo
Star <br> LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment <br>ICLR2023-10-03GithubDemo
Star <br> Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsarXiv2023-10-01Github-
Star <br> Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants <br>arXiv2023-10-01GithubLocal Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language ModelarXiv2023-09-27--
Star <br> InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition <br>arXiv2023-09-26GithubLocal Demo
Star <br> DreamLLM: Synergistic Multimodal Comprehension and Creation <br>ICLR2023-09-20GithubComing soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal ModelsarXiv2023-09-18Coming soon-
Star <br> TextBind: Multi-turn Interleaved Multimodal Instruction-following <br>arXiv2023-09-14GithubDemo
Star <br> NExT-GPT: Any-to-Any Multimodal LLM <br>arXiv2023-09-11GithubDemo
Star <br> Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics <br>arXiv2023-09-13Github-
Star <br> ImageBind-LLM: Multi-modality Instruction Tuning <br>arXiv2023-09-07GithubDemo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction TuningarXiv2023-09-05--
Star <br> PointLLM: Empowering Large Language Models to Understand Point Clouds <br>arXiv2023-08-31GithubDemo
Star <br> ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>arXiv2023-08-31GithubLocal Demo
Star <br> MLLM-DataEngine: An Iterative Refinement Approach for MLLM <br>arXiv2023-08-25Github-
Star <br> Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models <br>arXiv2023-08-25GithubDemo
Star <br> Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities <br>arXiv2023-08-24GithubDemo
Star <br> Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages <br>ICLR2023-08-23GithubDemo
Star <br> StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data <br>arXiv2023-08-20Github-
Star <br> BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions <br>arXiv2023-08-19GithubDemo
Star <br> Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions <br>arXiv2023-08-08Github-
Star <br> The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World <br>ICLR2023-08-03GithubDemo
Star <br> LISA: Reasoning Segmentation via Large Language Model <br>arXiv2023-08-01GithubDemo
Star <br> MovieChat: From Dense Token to Sparse Memory for Long Video Understanding <br>arXiv2023-07-31GithubLocal Demo
Star <br> 3D-LLM: Injecting the 3D World into Large Language Models <br>arXiv2023-07-24Github-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning <br>arXiv2023-07-18-Demo
Star <br> BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs <br>arXiv2023-07-17GithubDemo
Star <br> SVIT: Scaling up Visual Instruction Tuning <br>arXiv2023-07-09Github-
Star <br> GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest <br>arXiv2023-07-07GithubDemo
Star <br> What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? <br>arXiv2023-07-05Github-
Star <br> mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding <br>arXiv2023-07-04GithubDemo
Star <br> Visual Instruction Tuning with Polite Flamingo <br >arXiv2023-07-03GithubDemo
Star <br> LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding <br>arXiv2023-06-29GithubDemo
Star <br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>arXiv2023-06-27GithubDemo
Star <br> MotionGPT: Human Motion as a Foreign Language <br>arXiv2023-06-26Github-
Star <br> Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration <br>arXiv2023-06-15GithubComing soon
Star <br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>arXiv2023-06-11GithubDemo
Star <br> Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models <br>arXiv2023-06-08GithubDemo
Star <br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>arXiv2023-06-08GithubDemo
M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningarXiv2023-06-07--
Star <br> Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding <br>arXiv2023-06-05GithubDemo
Star <br> LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day <br>arXiv2023-06-01Github-
Star <br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>arXiv2023-05-30GithubDemo
Star <br> PandaGPT: One Model To Instruction-Follow Them All <br>arXiv2023-05-25GithubDemo
Star <br> ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst <br>arXiv2023-05-25Github-
Star <br> Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models <br>arXiv2023-05-24GithubLocal Demo
Star <br> DetGPT: Detect What You Need via Reasoning <br>arXiv2023-05-23GithubDemo
Star <br> Pengi: An Audio Language Model for Audio Tasks <br>NeurIPS2023-05-19Github-
Star <br> VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks <br>arXiv2023-05-18Github-
Star <br> Listen, Think, and Understand <br>arXiv2023-05-18GithubDemo
Star <br> VisualGLM-6B <br>-2023-05-17GithubLocal Demo
Star <br> PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering <br>arXiv2023-05-17Github-
Star <br> InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning <br>arXiv2023-05-11GithubLocal Demo
Star <br> VideoChat: Chat-Centric Video Understanding <br>arXiv2023-05-10GithubDemo
Star <br> MultiModal-GPT: A Vision and Language Model for Dialogue with Humans <br>arXiv2023-05-08GithubDemo
Star <br> X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages <br>arXiv2023-05-07Github-
Star <br> LMEye: An Interactive Perception Network for Large Language Models <br>arXiv2023-05-05GithubLocal Demo
Star <br> LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model <br>arXiv2023-04-28GithubDemo
Star <br> mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality <br>arXiv2023-04-27GithubDemo
Star <br> MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models <br>arXiv2023-04-20Github-
Star <br> Visual Instruction Tuning <br>NeurIPS2023-04-17GitHubDemo
Star <br> LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention <br>ICLR2023-03-28GithubDemo
Star <br> MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning <br>ACL2022-12-21Github-

Multimodal Hallucination

TitleVenueDateCodeDemo
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval AugmentationarXiv2024-08-01--
Star <br> Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs <br>ECCV2024-07-31Github-
Star <br> Evaluating and Analyzing Relationship Hallucinations in LVLMs <br>ICML2024-06-24Github-
Star <br> AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention <br>arXiv2024-06-18Github-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal ModelsarXiv2024-06-04Coming soon-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception GaparXiv2024-05-24Coming soon-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI FeedbackarXiv2024-04-22--
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive DecodingarXiv2024-03-27--
Star <br> What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models <br>arXiv2024-03-20Github-
Strengthening Multimodal Large Language Model with Bootstrapped Preference OptimizationarXiv2024-03-13--
Star <br> Debiasing Multimodal Large Language Models <br>arXiv2024-03-08Github-
Star <br> HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding <br>arXiv2024-03-01Github-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased DecodingarXiv2024-02-28--
Star <br> Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective <br>arXiv2024-02-22Github-
Star <br> Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models <br>arXiv2024-02-18Github-
Star <br> The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs <br>arXiv2024-02-06Github-
Star <br> Unified Hallucination Detection for Multimodal Large Language Models <br>arXiv2024-02-05Github-
A Survey on Hallucination in Large Vision-Language ModelsarXiv2024-02-01--
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language ModelsarXiv2024-01-18--
Star <br> Hallucination Augmented Contrastive Learning for Multimodal Large Language Model <br>arXiv2023-12-12Github-
Star <br> MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations <br>arXiv2023-12-06Github-
Star <br> Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites <br>arXiv2023-12-04Github-
Star <br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>arXiv2023-12-01GithubDemo
Star <br> OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation <br>CVPR2023-11-29Github-
Star <br> Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br>CVPR2023-11-28Github-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference OptimizationarXiv2023-11-28GithubComins Soon
Mitigating Hallucination in Visual Language Models with Visual SupervisionarXiv2023-11-27--
Star <br> HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data <br>arXiv2023-11-22Github-
Star <br> An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation <br>arXiv2023-11-13Github-
Star <br> FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models <br>arXiv2023-11-02Github-
Star <br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>arXiv2023-10-24GithubDemo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language ModelsarXiv2023-10-09--
Star <br> HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption <br>arXiv2023-10-03Github-
Star <br> Analyzing and Mitigating Object Hallucination in Large Vision-Language Models <br>ICLR2023-10-01Github-
Star <br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>arXiv2023-09-25GithubDemo
Evaluation and Mitigation of Agnosia in Multimodal Large Language ModelsarXiv2023-09-07--
CIEM: Contrastive Instruction Evaluation Method for Better Instruction TuningarXiv2023-09-05--
Star <br> Evaluation and Analysis of Hallucination in Large Vision-Language Models <br>arXiv2023-08-29Github-
Star <br> VIGC: Visual Instruction Generation and Correction <br>arXiv2023-08-24GithubDemo
Detecting and Preventing Hallucinations in Large Vision Language ModelsarXiv2023-08-11--
Star <br> Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning <br>ICLR2023-06-26GithubDemo
Star <br> Evaluating Object Hallucination in Large Vision-Language Models <br>EMNLP2023-05-17Github-

Multimodal In-Context Learning

TitleVenueDateCodeDemo
Visual In-Context Learning for Large Vision-Language ModelsarXiv2024-02-18--
Star <br> Can MLLMs Perform Text-to-Image In-Context Learning? <br>arXiv2024-02-02Github-
Star <br> Generative Multimodal Models are In-Context Learners <br>CVPR2023-12-20GithubDemo
Hijacking Context in Large Multi-modal ModelsarXiv2023-12-07--
Towards More Unified In-context Visual UnderstandingarXiv2023-12-05--
Star <br> MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning <br>arXiv2023-09-14GithubDemo
Star <br> Link-Context Learning for Multimodal LLMs <br>arXiv2023-08-15GithubDemo
Star <br> OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models <br>arXiv2023-08-02GithubDemo
Star <br> Med-Flamingo: a Multimodal Medical Few-shot Learner <br>arXiv2023-07-27GithubLocal Demo
Star <br> Generative Pretraining in Multimodality <br>ICLR2023-07-11GithubDemo
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star <br> MIMIC-IT: Multi-Modal In-Context Instruction Tuning <br>arXiv2023-06-08GithubDemo
Star <br> Exploring Diverse In-Context Configurations for Image Captioning <br>NeurIPS2023-05-24Github-
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Star <br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>arXiv2023-03-30GithubDemo
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction <br>ICCV2023-03-09Github-
Star <br> Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering <br>CVPR2023-03-03Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA <br>AAAI2022-06-28Github-
Star <br> Flamingo: a Visual Language Model for Few-Shot Learning <br>NeurIPS2022-04-29GithubDemo
Multimodal Few-Shot Learning with Frozen Language ModelsNeurIPS2021-06-25--

Multimodal Chain-of-Thought

TitleVenueDateCodeDemo
Star <br> Cantor: Inspiring Multimodal Chain-of-Thought of MLLM <br>arXiv2024-04-24GithubLocal Demo
Star <br> Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models <br>arXiv2024-03-25GithubLocal Demo
Star <br> DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models <br>NeurIPS2023-10-25Github-
Star <br> Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic <br>arXiv2023-06-27GithubDemo
Star <br> Explainable Multimodal Emotion Reasoning <br>arXiv2023-06-27Github-
Star <br> EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought <br>arXiv2023-05-24Github-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionarXiv2023-05-23--
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question AnsweringarXiv2023-05-05--
Star <br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>arXiv2023-05-04GithubDemo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal InfillingsarXiv2023-05-03Coming soon-
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Chain of Thought Prompt Tuning in Vision Language ModelsarXiv2023-04-16Coming soon-
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>arXiv2023-03-08GithubDemo
Star <br> Multimodal Chain-of-Thought Reasoning in Language Models <br>arXiv2023-02-02Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering <br>NeurIPS2022-09-20Github-

LLM-Aided Visual Reasoning

TitleVenueDateCodeDemo
Star <br> Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models <br>arXiv2024-03-27Github-
Star <br> Vβˆ—: Guided Visual Search as a Core Mechanism in Multimodal LLMs <br>arXiv2023-12-21GithubLocal Demo
Star <br> LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing <br>arXiv2023-11-01GithubDemo
MM-VID: Advancing Video Understanding with GPT-4V(vision)arXiv2023-10-30--
Star <br> ControlLLM: Augment Language Models with Tools by Searching on Graphs <br>arXiv2023-10-26Github-
Star <br> Woodpecker: Hallucination Correction for Multimodal Large Language Models <br>arXiv2023-10-24GithubDemo
Star <br> MindAgent: Emergent Gaming Interaction <br>arXiv2023-09-18Github-
Star <br> Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language <br>arXiv2023-06-28GithubDemo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsarXiv2023-06-15--
Star <br> AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn <br>arXiv2023-06-14Github-
AVIS: Autonomous Visual Information Seeking with Large Language ModelsarXiv2023-06-13--
Star <br> GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction <br>arXiv2023-05-30GithubDemo
Mindstorms in Natural Language-Based Societies of MindarXiv2023-05-26--
Star <br> LayoutGPT: Compositional Visual Planning and Generation with Large Language Models <br>arXiv2023-05-24Github-
Star <br> IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models <br>arXiv2023-05-24GithubLocal Demo
Star <br> Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation <br>arXiv2023-05-10Github-
Star <br> Caption Anything: Interactive Image Description with Diverse Multimodal Controls <br>arXiv2023-05-04GithubDemo
Star <br> Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models <br>arXiv2023-04-19GithubDemo
Star <br> HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace <br>arXiv2023-03-30GithubDemo
Star <br> MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action <br>arXiv2023-03-20GithubDemo
Star <br> ViperGPT: Visual Inference via Python Execution for Reasoning <br>arXiv2023-03-14GithubLocal Demo
Star <br> ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions <br>arXiv2023-03-12GithubLocal Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information ExtractionICCV2023-03-09--
Star <br> Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models <br>arXiv2023-03-08GithubDemo
Star <br> Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners <br>CVPR2023-03-03Github-
Star <br> From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models <br>CVPR2022-12-21GithubDemo
Star <br> SuS-X: Training-Free Name-Only Transfer of Vision-Language Models <br>arXiv2022-11-28Github-
Star <br> PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning <br>CVPR2022-11-21Github-
Star <br> Visual Programming: Compositional visual reasoning without training <br>CVPR2022-11-18GithubLocal Demo
Star <br> Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language <br>arXiv2022-04-01Github-

Foundation Models

TitleVenueDateCodeDemo
The Llama 3 Herd of ModelsarXiv2024-07-31--
Chameleon: Mixed-Modal Early-Fusion Foundation ModelsarXiv2024-05-16--
Hello GPT-4oOpenAI2024-05-13--
The Claude 3 Model Family: Opus, Sonnet, HaikuAnthropic2024-03-04--
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGoogle2024-02-15--
Gemini: A Family of Highly Capable Multimodal ModelsGoogle2023-12-06--
Fuyu-8B: A Multimodal Architecture for AI Agentsblog2023-10-17HuggingfaceDemo
Star <br> Unified Model for Image, Video, Audio and Language Tasks <br>arXiv2023-07-30GithubDemo
PaLI-3 Vision Language Models: Smaller, Faster, StrongerarXiv2023-10-13--
GPT-4V(ision) System CardOpenAI2023-09-25--
Star <br> Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization <br>arXiv2023-09-09Github-
Multimodal Foundation Models: From Specialists to General-Purpose AssistantsarXiv2023-09-18--
Star <br> Bootstrapping Vision-Language Learning with Decoupled Language Pre-training <br>NeurIPS2023-07-13Github-
Star <br> Generative Pretraining in Multimodality <br>arXiv2023-07-11GithubDemo
Star <br> Kosmos-2: Grounding Multimodal Large Language Models to the World <br>arXiv2023-06-26GithubDemo
Star <br> Transfer Visual Prompt Generator across LLMs <br>arXiv2023-05-02GithubDemo
GPT-4 Technical ReportarXiv2023-03-15--
PaLM-E: An Embodied Multimodal Language ModelarXiv2023-03-06-Demo
Star <br> Prismer: A Vision-Language Model with An Ensemble of Experts <br>arXiv2023-03-04GithubDemo
Star <br> Language Is Not All You Need: Aligning Perception with Language Models <br>arXiv2023-02-27Github-
Star <br> BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models <br>arXiv2023-01-30GithubDemo
Star <br> VIMA: General Robot Manipulation with Multimodal Prompts <br>ICML2022-10-06GithubLocal Demo
Star <br> MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge <br>NeurIPS2022-06-17Github-
Star <br> Write and Paint: Generative Vision-Language Models are Unified Modal Learners <br>ICLR2022-06-15Github-
Star <br> Language Models are General-Purpose Interfaces <br>arXiv2022-06-13Github-

Evaluation

TitleVenueDatePage
Stars <br> MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation <br>arXiv2024-06-29Github
Stars <br> Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs <br>arXiv2024-06-28Github
Stars <br> CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs <br>arXiv2024-06-26Github
Stars <br> ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation <br>arXiv2024-04-15Github
Stars <br> Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis <br>arXiv2024-05-31Github
Stars <br> Benchmarking Large Multimodal Models against Common Corruptions <br>NAACL2024-01-22Github
Stars <br> Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs <br>arXiv2024-01-11Github
Stars <br> A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise <br>arXiv2023-12-19Github
Stars <br> BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models <br>arXiv2023-12-05Github
Star <br> How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs <br>arXiv2023-11-27Github
Star <br> Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs <br>arXiv2023-11-24Github
Star <br> MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V <br>arXiv2023-11-23Github
VLM-Eval: A General Evaluation on Video Large Language ModelsarXiv2023-11-20Coming soon
Star <br> Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges <br>arXiv2023-11-06Github
Star <br> On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving <br>arXiv2023-11-09Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the LeadarXiv2023-11-05-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical ImagingarXiv2023-10-31-
Star <br> An Early Evaluation of GPT-4V(ision) <br>arXiv2023-10-25Github
Star <br> Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation <br>arXiv2023-10-25Github
Star <br> HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models <br>CVPR2023-10-23Github
Star <br> MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models <br>ICLR2023-10-03Github
Star <br> Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations <br>arXiv2023-10-02Github
Star <br> Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning <br>arXiv2023-10-01Github
Star <br> Can We Edit Multimodal Large Language Models? <br>arXiv2023-10-12Github
Star <br> REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets <br>arXiv2023-10-10Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)arXiv2023-09-29-
Star <br> TouchStone: Evaluating Vision-Language Models by Language Models <br>arXiv2023-08-31Github
Star <br> ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models <br>arXiv2023-08-31Github
Star <br> SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs <br>arXiv2023-08-07Github
Star <br> Tiny LVLM-eHub: Early Multimodal Experiments with Bard <br>arXiv2023-08-07Github
Star <br> MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities <br>arXiv2023-08-04Github
Star <br> SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension <br>CVPR2023-07-30Github
Star <br> MMBench: Is Your Multi-modal Model an All-around Player? <br>arXiv2023-07-12Github
Star <br> MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models <br>arXiv2023-06-23Github
Star <br> LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models <br>arXiv2023-06-15Github
Star <br> LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark <br>arXiv2023-06-11Github
Star <br> M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models <br>arXiv2023-06-08Github
Star <br> On The Hidden Mystery of OCR in Large Multimodal Models <br>arXiv2023-05-13Github

Multimodal RLHF

TitleVenueDateCodeDemo
Star <br> Silkie: Preference Distillation for Large Visual Language Models <br>arXiv2023-12-17Github-
Star <br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br>arXiv2023-12-01GithubDemo
Star <br> Aligning Large Multimodal Models with Factually Augmented RLHF <br>arXiv2023-09-25GithubDemo

Others

TitleVenueDateCodeDemo
Star <br> Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models <br>arXiv2024-02-03Github-
Star <br> VCoder: Versatile Vision Encoders for Multimodal Large Language Models <br>arXiv2023-12-21GithubLocal Demo
Star <br> Prompt Highlighter: Interactive Control for Multi-Modal LLMs <br>arXiv2023-12-07Github-
Star <br> Planting a SEED of Vision in Large Language Model <br>arXiv2023-07-16Github
Star <br> Can Large Pre-trained Models Help Vision Models on Perception Tasks? <br>arXiv2023-06-01Github-
Star <br> Contextual Object Detection with Multimodal Large Language Models <br>arXiv2023-05-29GithubDemo
Star <br> Generating Images with Multimodal Language Models <br>arXiv2023-05-26Github-
Star <br> On Evaluating Adversarial Robustness of Large Vision-Language Models <br>arXiv2023-05-26Github-
Star <br> Grounding Language Models to Images for Multimodal Inputs and Outputs <br>ICML2023-01-31GithubDemo

Awesome Datasets

Datasets of Pre-Training for Alignment

NamePaperTypeModalities
ShareGPT4VideoShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsCaptionVideo-Text
COYO-700MCOYO-700M: Image-Text Pair DatasetCaptionImage-Text
ShareGPT4VShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsCaptionImage-Text
AS-1BThe All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open WorldHybridImage-Text
InternVidInternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationCaptionVideo-Text
MS-COCOMicrosoft COCO: Common Objects in ContextCaptionImage-Text
SBU CaptionsIm2Text: Describing Images Using 1 Million Captioned PhotographsCaptionImage-Text
Conceptual CaptionsConceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image CaptioningCaptionImage-Text
LAION-400MLAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text PairsCaptionImage-Text
VG CaptionsVisual Genome: Connecting Language and Vision Using Crowdsourced Dense Image AnnotationsCaptionImage-Text
Flickr30kFlickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCaptionImage-Text
AI-CapsAI Challenger : A Large-scale Dataset for Going Deeper in Image UnderstandingCaptionImage-Text
Wukong CaptionsWukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkCaptionImage-Text
GRITKosmos-2: Grounding Multimodal Large Language Models to the WorldCaptionImage-Text-Bounding-Box
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksCaptionVideo-Text
MSR-VTTMSR-VTT: A Large Video Description Dataset for Bridging Video and LanguageCaptionVideo-Text
Webvid10MFrozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCaptionVideo-Text
WavCapsWavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal ResearchCaptionAudio-Text
AISHELL-1AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baselineASRAudio-Text
AISHELL-2AISHELL-2: Transforming Mandarin ASR Research Into Industrial ScaleASRAudio-Text
VSDial-CNX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesASRImage-Audio-Text

Datasets of Multimodal Instruction Tuning

NamePaperLinkNotes
VEGAVEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large ModelsLinkA dataset for enchancing model capabilities in comprehension of interleaved information
ALLaVA-4VALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language ModelLinkVision and language caption and instruction dataset generated by GPT4V
IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkDehallucinative visual instruction for "I Know" hallucination
CAP2QAVisually Dehallucinative Instruction GenerationLinkImage-aligned visual instruction dataset
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA large-scale 3D instruction tuning dataset
ViP-LLaVA-InstructMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4VTo See is to Believe: Prompting GPT-4V for Better Visual Instruction TuningLinkA visual instruction dataset via self-instruction from GPT-4V
ComVintWhat Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningLinkA synthetic instruction dataset for complex visual reasoning
SparklesDialogue✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVAStableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue DataLinkA cheap and effective approach to collect visual instruction tuning data
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
MGVLIDChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning-A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPTBuboGPT: Enabling Visual Grounding in Multi-Modal LLMsLinkA high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVITSVIT: Scaling up Visual Instruction TuningLinkA large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwlmPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingLinkAn instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1MVisual Instruction Tuning with Polite FlamingoLinkA collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlamaChartLlama: A Multimodal LLM for Chart Understanding and GenerationLinkA multi-modal instruction-tuning dataset for chart understanding and generation
LLaVARLLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image UnderstandingLinkA visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPTMotionGPT: Human Motion as a Foreign LanguageLinkA instruction-tuning dataset including multiple human motion-related tasks
LRV-InstructionMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkVisual instruction tuning dataset for addressing hallucination issue
Macaw-LLMMacaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationLinkA large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-DatasetLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA comprehensive multi-modal instruction tuning dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLink100K high-quality video instruction dataset
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction tuning
M<sup>3</sup>ITM<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction TuningLinkLarge-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-MedLLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayComing soonA large-scale, broad-coverage biomedical instruction-following dataset
GPT4ToolsGPT4Tools: Teaching Large Language Model to Use Tools via Self-instructionLinkTool-related instruction datasets
MULTISChatBridge: Bridging Modalities with Large Language Model as a Language CatalystComing soonMultimodal instruction tuning dataset covering 16 multimodal tasks
DetGPTDetGPT: Detect What You Need via ReasoningLinkInstruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQAPMC-VQA: Visual Instruction Tuning for Medical Visual Question AnsweringComing soonLarge-scale medical visual question-answering dataset
VideoChatVideoChat: Chat-Centric Video UnderstandingLinkVideo-centric multimodal instruction dataset
X-LLMX-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign LanguagesLinkChinese multimodal instruction dataset
LMEyeLMEye: An Interactive Perception Network for Large Language ModelsLinkA multi-modal instruction-tuning dataset
cc-sbu-alignMiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsLinkMultimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150KVisual Instruction TuningLinkMultimodal instruction-following data generated by GPT
MultiInstructMultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction TuningLinkThe first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

NamePaperLinkNotes
MICMMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningLinkA manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction TuningLinkMultimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

NamePaperLinkNotes
EMERExplainable Multimodal Emotion ReasoningComing soonA benchmark dataset for explainable emotion reasoning task
EgoCOTEmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of ThoughtComing soonLarge-scale embodied planning dataset
VIPLet’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and PredictionComing soonAn inference-time dataset that can be used to evaluate VideoCOT
ScienceQALearn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringLinkLarge-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

NamePaperLinkNotes
VLFeedbackSilkie: Preference Distillation for Large Visual Language ModelsLinkA vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

NamePaperLinkNotes
CharXivCharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMsLinkChart understanding benchmark curated by human experts
Video-MMEVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisLinkA comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL BenchVL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context LearningLinkA benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompassTempCompass: Do Video LLMs Really Understand Videos?LinkA benchmark to evaluate the temporal perception ability of Video LLMs
CoBSATCan MLLMs Perform Text-to-Image In-Context Learning?LinkA benchmark for text-to-image ICL
VQAv2-IDKVisually Dehallucinative Instruction Generation: Know What You Don't KnowLinkA benchmark for assessing "I Know" visual hallucination
Math-VisionMeasuring Multimodal Mathematical Reasoning with MATH-Vision DatasetLinkA diverse mathematical reasoning benchmark
CMMMUCMMMU: A Chinese Massive Multi-discipline Multimodal Understanding BenchmarkLinkA Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBenchBenchmarking Large Multimodal Models against Common CorruptionsLinkA benchmark for examining self-consistency under common corruptions
MMVPEyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsLinkA benchmark for assessing visual capabilities
TimeITTimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingLinkA video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-BenchMaking Large Multimodal Models Understand Arbitrary Visual PromptsLinkA benchmark for visual prompts
M3DBenchM3DBench: Let's Instruct Large Models with Multi-modal 3D PromptsLinkA 3D-centric benchmark
Video-BenchVideo-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language ModelsLinkA benchmark for video-MLLM evaluation
Charting-New-TerritoriesCharting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMsLinkA benchmark for evaluating geographic and geospatial capabilities
MLLM-BenchMLLM-Bench, Evaluating Multi-modal LLMs using GPT-4VLinkGPT-4V evaluation with per-sample criteria
BenchLMMBenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsLinkA benchmark for assessment of the robustness against different image styles
MMC-BenchmarkMMC: Advancing Multimodal Chart Understanding with Large-scale Instruction TuningLinkA comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBenchMVBench: A Comprehensive Multi-modal Video Understanding BenchmarkLinkA comprehensive multimodal benchmark for video understanding
BingoHolistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference ChallengesLinkA benchmark for hallucination evaluation that focuses on two common types
MagnifierBenchOtterHD: A High-Resolution Multi-modality ModelLinkA benchmark designed to probe models' ability of fine-grained perception
HallusionBenchHallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality ModelsLinkAn image-context reasoning benchmark for evaluation of hallucination
PCA-EVALTowards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and BeyondLinkA benchmark for evaluating multi-domain embodied decision-making.
MMHal-BenchAligning Large Multimodal Models with Factually Augmented RLHFLinkA benchmark for hallucination evaluation
MathVistaMathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal ModelsLinkA benchmark that challenges both visual and math reasoning capabilities
SparklesEval✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsLinkA GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAILink-Context Learning for Multimodal LLMsLinkA benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetectDetecting and Preventing Hallucinations in Large Vision Language ModelsComing soonA dataset used to train and benchmark models for hallucination detection and prevention
I4Empowering Vision-Language Models to Follow Interleaved Vision-Language InstructionsLinkA benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQASciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific GraphsLinkA large-scale chart-visual question-answering dataset
MM-VetMM-Vet: Evaluating Large Multimodal Models for Integrated CapabilitiesLinkAn evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-BenchSEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionLinkA benchmark for evaluation of generative comprehension in MLLMs
MMBenchMMBench: Is Your Multi-modal Model an All-around Player?LinkA systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
LynxWhat Matters in Training a GPT4-Style Language Model with Multimodal Inputs?LinkA comprehensive evaluation benchmark including both image and video tasks
GAVIEMitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningLinkA benchmark to evaluate the hallucination and instruction following ability
MMEMME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsLinkA comprehensive MLLM Evaluation benchmark
LVLM-eHubLVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language ModelsLinkAn evaluation platform for MLLMs
LAMM-BenchmarkLAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkLinkA benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3ExamM3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language ModelsLinkA multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEvalmPLUG-Owl: Modularization Empowers Large Language Models with MultimodalityLinkDataset for evaluation on multiple capabilities

Others

NamePaperLinkNotes
IMADIMAD: IMage-Augmented multi-modal DialogueLinkMultimodal dialogue dataset
Video-ChatGPTVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsLinkA quantitative evaluation framework for video-based dialogue models
CLEVR-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVCAccountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creationLinkA manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeekCan Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?LinkA VQA dataset that focuses on asking information-seeking questions
OVENOpen-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia EntitiesLinkA dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild