Home

Awesome

MLLM-Bench

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

<center>

Python 3.9+ Pytorch 2.0 transformers accelerate

</center> <p align="center"> 📃 <a href="https://arxiv.org/abs/2311.13951" target="_blank">Paper</a> • 🌐 <a href="https://mllm-bench.llmzoo.com/" target="_blank">Website</a> • 🤗 <a href="huggingface.com" target="_blank">HuggingFace</a> <p align="center"> <img src="./image.png" alt="Data Composition" width="550" height="550">

🌈 Update

Leaderboard

We present the results of voting using LLaVA-v1.5-13B as anchor. The numbers denote win/tie/lose of a benchmarked model over LLaVA-v1.5-13B. See more results of different evaluation protocols and anchors in our paper. The information of benchmarked models is here.

RankModelsPerceptionUnderstandingApplyingAnalyzingEvaluationCreationWin Rates over LLaVA-v1.5-13B
🏅️GPT-4o64/5/198/11/150/8/286/9/540/0/038/1/10.90
🥈Claude-356/13/198/9/345/11/483/14/333/5/233/6/10.83
🥉GPT-4V56/10/4101/6/329/12/1973/22/533/2/52/0/380.70
4LLaVA-v1.6-34B46/17/778/22/1036/15/961/28/1133/3/424/10/60.66
5LLaVA-v1.6-Vicuna-13B40/21/965/33/1235/19/651/26/2333/5/227/9/40.60
6LLaVA-v1.6-Vicuna-7B31/25/1456/37/1726/23/1140/31/2922/10/819/10/110.46
7ALLaVA-3B-Longer22/21/2757/30/2323/17/2044/30/2616/10/1417/12/110.43
8Gemini-1.0-Pro45/10/1536/35/3924/19/1733/28/399/8/2316/8/160.39
9Qwen-VL-Chat34/22/1438/36/3626/18/1635/29/3615/6/199/12/190.37
10LVIS22/28/2032/39/3911/27/2233/36/3114/9/179/16/150.29
11mPLUG-Owl216/24/3030/34/4617/17/2623/38/3915/8/1711/14/150.27
12LLaVA-v1.5-7B19/22/2927/47/3613/29/1821/43/369/14/178/13/190.23
13MiniGPT-v212/25/3324/32/5411/25/2417/38/459/9/226/6/280.19
14InstructBLIP15/16/3913/36/616/23/3113/29/5810/7/234/9/270.15
15Cheetor12/20/387/27/7610/22/2816/23/614/4/323/4/330.12
16SEED-LLaMA16/15/395/25/8010/21/297/25/683/7/303/3/340.10
17kosmos26/22/426/18/866/15/3910/20/701/4/352/3/350.07
18Yi-VL-6B4/17/498/22/805/27/285/29/663/9/283/9/280.07
19Fuyu-8B7/19/447/27/766/14/404/22/743/7/300/6/340.06
20LWM2/18/505/15/904/21/352/18/803/2/352/6/320.04
21OpenFlamingo8/13/492/8/1003/14/432/21/771/2/371/5/340.04
22BLIP23/13/542/15/936/8/460/22/780/1/390/2/380.03

Usage

Environment Setup

<details><summary>Click to expand</summary>

Install required packages:

pip install -r requirements.txt

Update transformers (we used 4.36.0.dev0):

pip install git+https://github.com/huggingface/transformers
</details>

Answer Generation

<details><summary>Click to expand</summary> </details>

Self-Evaluation

<details><summary>Click to expand</summary> </details>

Submission for Leaderboard

Refer to instructions <a href="https://mllm-bench.llmzoo.com/static/submit.html" target="_blank">here</a>.

Citation

@misc{ge2024mllmbench,
      title={MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria}, 
      author={Wentao Ge and Shunian Chen and Guiming Hardy Chen and Zhihong Chen and Junying Chen and Shuo Yan and Chenghao Zhu and Ziyue Lin and Wenya Xie and Xinyi Zhang and Yichen Chai and Xiaoyu Liu and Nuo Chen and Dingjie Song and Xidong Wang and Anningzhe Gao and Zhiyi Zhang and Jianquan Li and Xiang Wan and Benyou Wang},
      year={2024},
      eprint={2311.13951},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Star History

<a href="https://star-history.com/#FreedomIntelligence/MLLM-Bench&Date"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=FreedomIntelligence/MLLM-Bench&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=FreedomIntelligence/MLLM-Bench&type=Date" /> <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=FreedomIntelligence/MLLM-Bench&type=Date" /> </picture> </a>