Home

Awesome

Efficient-Multimodal-LLMs-Survey

Efficient Multimodal Large Language Models: A Survey [arXiv]

Yizhang Jin<sup>12</sup>, Jian Li<sup>1</sup>, Yexin Liu<sup>3</sup>, Tianjun Gu<sup>4</sup>, Kai Wu<sup>1</sup>, Zhengkai Jiang<sup>1</sup>, Muyang He<sup>3</sup>, Bo Zhao<sup>3</sup>, Xin Tan<sup>4</sup>, Zhenye Gan<sup>1</sup>, Yabiao Wang<sup>1</sup>, Chengjie Wang<sup>1</sup>, Lizhuang Ma<sup>2</sup>

<sup>1</sup>Tencent YouTu Lab, <sup>2</sup>SJTU, <sup>3</sup>BAAI, <sup>4</sup>ECNU

⚡We will actively maintain this repository and incorporate new research as it emerges. If you have any questions, please contact swordli@tencent.com. Welcome to collaborate on academic research and writing papers together.(欢迎学术合作).

@misc{jin2024efficient,
      title={Efficient Multimodal Large Language Models: A Survey}, 
      author={Yizhang Jin and Jian Li and Yexin Liu and Tianjun Gu and Kai Wu and Zhengkai Jiang and Muyang He and Bo Zhao and Xin Tan and Zhenye Gan and Yabiao Wang and Chengjie Wang and Lizhuang Ma},
      year={2024},
      eprint={2405.10739},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📌 What is This Survey About?

<p align="center"> <img src="./imgs/timeline.png" width="100%" height="100%"> </p>

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions.

<p align="center"> <img src="./imgs/arch.png" width="80%" height="80%"> </p>

Summary of 17 Mainstream Efficient MLLMs

ModelVision EncoderResolutionVision Encoder Parameter SizeLLMLLM Parameter SizeVision-LLM ProjectorTimeline
MobileVLMCLIP ViT-L/143360.3BMobileLLaMA2.7BLDP2023-12
LLaVA-PhiCLIP ViT-L/143360.3BPhi-22.7BMLP2024-01
Imp-v1SigLIP3840.4BPhi-22.7B-2024-02
TinyLLaVASigLIP-SO3840.4BPhi-22.7BMLP2024-02
BunnySigLIP-SO3840.4BPhi-22.7BMLP2024-02
MobileVLM-v2-3BCLIP ViT-L/143360.3BMobileLLaMA2.7BLDPv22024-02
MoE-LLaVA-3.6BCLIP-Large384-Phi-22.7BMLP2024-02
CobraDINOv2, SigLIP-SO3840.3B+0.4BMamba-2.8b-Zephyr2.8BMLP2024-03
Mini-GeminiCLIP-Large336-Gemma2BMLP2024-03
Vary-toyCLIP224-Qwen1.8B-2024-01
TinyGPT-VEVA224/448-Phi-22.7BQ-Former2024-01
SPHINX-TinyDINOv2 , CLIP-ConvNeXt448-TinyLlama1.1B-2024-02
ALLaVA-LongerCLIP-ViT-L/143360.3BPhi-22.7B-2024-02
MM1-3B-MoE-ChatCLIP_DFN-ViT-H378--3BC-Abstractor2024-03
LLaVA-GemmaDinoV2--Gemma-2b-it2B-2024-03
Mipha-3BSigLIP384-Phi-22.7B-2024-03
VL-MambaSigLIP-SO384-Mamba-2.8B-Slimpj2.8BVSS-L22024-03
MiniCPM-V 2.0SigLIP-0.4BMiniCPM2.7BPerceiver Resampler2024-03
DeepSeek-VLSigLIP-L3840.4BDeepSeek-LLM1.3BMLP2024-03
KarmaVLMSigLIP-SO3840.4BQwen1.50.5B-2024-02
moondream2SigLIP--Phi-1.51.3B-2024-03
Bunny-v1.1-4BSigLIP1152-Phi-3-Mini-4K3.8B-2024-02

Efficient MLLMs

Architecture

Vision Encoder

Multiple Vision Encoders
Lightweight Vision Encoder

Vision-Language Projector

MLP-based
Attention-based
CNN-based
Mamba-based
Hybrid Structure

Small Language Models

Vision Token Compression

Multi-view Input
Token processing
Multi-Scale Information Fusion
Vision Expert Agents
Video-Specific Methods

Efficient Structures

Mixture of Experts
Mamba
Inferece Acceleration

Training

Pre-Training

Which part to unfreeze
Multi-stage pre-training

Instruction-Tunining

Efficient IT

Diverse Training Steps

Parameter Efficient Transfer Learning

Applications

Biomedical Analysis

Document Understanding

Video Comprehension