Awesome
Awesome-Multimodal-Large-Language-Models
Our MLLM works
๐ฅ๐ฅ๐ฅ A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper
The first comprehensive survey for Multimodal Large Language Models (MLLMs). :sparkles: </div>
Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! :star2: </div>
๐ฅ๐ฅ๐ฅ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
<p align="center"> <img src="./images/freeze-omni.png" width="80%" height="80%"> </p><font size=7><div align='center' > [๐ Project Page] [๐ arXiv Paper] [๐ GitHub] </div></font>
The VITA team proposes Freeze-Omni, a speech-to-speech dialogue model with both low-latency and high intelligence while the training process is based on a frozen LLM. ๐
Freeze-Omni exhibits the characteristic of being smart as it is constructed upon a frozen text-modality LLM. This enables it to keep the original intelligence of the LLM backbone, without being affected by the forgetting problem induced by the fine-tuning process for integration of the speech modality. โจ
๐ฅ๐ฅ๐ฅ VITA: Towards Open-Source Interactive Omni Multimodal LLM
<p align="center"> <img src="./images/vita.png" width="70%" height="70%"> </p><font size=7><div align='center' > [๐ Project Page] [๐ arXiv Paper] [๐ GitHub] [๐ค Hugging Face] [๐ฌ WeChat (ๅพฎไฟก)] </div></font>
๐ฅ๐ฅ๐ฅ Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard
We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! ๐
It includes short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from <b>11 seconds to 1 hour</b>. All data are newly collected and annotated by humans, not from any existing video dataset. โจ
๐ฅ๐ฅ๐ฅ MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | :black_nib: Citation
A representative evaluation benchmark for MLLMs. :sparkles:
๐ฅ๐ฅ๐ฅ Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub
This is the first work to correct hallucination in multimodal large language models. :sparkles: </div>
<font size=5><center><b> Table of Contents </b> </center></font>
Awesome Papers
Multimodal Instruction Tuning
Multimodal Hallucination
Multimodal In-Context Learning
Multimodal Chain-of-Thought
LLM-Aided Visual Reasoning
Foundation Models
Evaluation
Multimodal RLHF
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | arXiv | 2024-10-09 | - | - |
<br> Silkie: Preference Distillation for Large Visual Language Models <br> | arXiv | 2023-12-17 | Github | - |
<br> RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback <br> | arXiv | 2023-12-01 | Github | Demo |
<br> Aligning Large Multimodal Models with Factually Augmented RLHF <br> | arXiv | 2023-09-25 | Github | Demo |
Others
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
<br> Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models <br> | arXiv | 2024-02-03 | Github | - |
<br> VCoder: Versatile Vision Encoders for Multimodal Large Language Models <br> | arXiv | 2023-12-21 | Github | Local Demo |
<br> Prompt Highlighter: Interactive Control for Multi-Modal LLMs <br> | arXiv | 2023-12-07 | Github | - |
<br> Planting a SEED of Vision in Large Language Model <br> | arXiv | 2023-07-16 | Github | |
<br> Can Large Pre-trained Models Help Vision Models on Perception Tasks? <br> | arXiv | 2023-06-01 | Github | - |
<br> Contextual Object Detection with Multimodal Large Language Models <br> | arXiv | 2023-05-29 | Github | Demo |
<br> Generating Images with Multimodal Language Models <br> | arXiv | 2023-05-26 | Github | - |
<br> On Evaluating Adversarial Robustness of Large Vision-Language Models <br> | arXiv | 2023-05-26 | Github | - |
<br> Grounding Language Models to Images for Multimodal Inputs and Outputs <br> | ICML | 2023-01-31 | Github | Demo |
Awesome Datasets
Datasets of Pre-Training for Alignment
Datasets of Multimodal Instruction Tuning
Name | Paper | Link | Notes |
---|---|---|---|
UNK-VQA | UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Link | A dataset designed to teach models to refrain from answering unanswerable questions |
VEGA | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | Link | A dataset for enhancing model capabilities in comprehension of interleaved information |
ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V |
IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | Dehallucinative visual instruction for "I Know" hallucination |
CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset |
ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |
LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V |
ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning |
SparklesDialogue | โจSparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |
StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation |
LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks |
LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning |
M<sup>3</sup>IT | M<sup>3</sup>IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Datasets of In-Context Learning
Name | Paper | Link | Notes |
---|---|---|---|
MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset |
Datasets of Multimodal Chain-of-Thought
Name | Paper | Link | Notes |
---|---|---|---|
EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Letโs Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Datasets of Multimodal RLHF
Name | Paper | Link | Notes |
---|---|---|---|
VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI |
Benchmarks for Evaluation
Others
Name | Paper | Link | Notes |
---|---|---|---|
IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models |
CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions |
Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |
InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Link | A VQA dataset that focuses on asking information-seeking questions |
OVEN | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | Link | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |