Home

Awesome

MM-SAP

MM-SAP is a benchmark that systematically evaluates the MLLMs’ self-awareness in perception, which refers to their awareness to recognize what they can and cannot know from the given image.

<p align="center"> 📝 <a href="https://arxiv.org/abs/2401.07529" target="_blank">Paper</a> 📊 <a href="https://huggingface.co/datasets/Elliotwang/MM-SAP" target="_blank">Data</a>

News 🔥🔥🔥

Overview

<p align="center"> <img src="./imgs/overview.jpeg" alt="Data overview" width="650" height="580">

Leaderboard

<p align="center"> <img src="./imgs/detail_results.jpeg" alt="result" width="685" height="700">
RankModelsBasicVisQAKnowVisQABeyondVisQATotal
1GPT-4V61.60 ± 2.3383.43 ± 1.4081.96 ± 0.7075.13 ± 1.30
2Qwen-VL-Chat-7b70.60 ± 0.6872.11 ± 1.5030.33 ± 0.4157.82 ± 0.63
3InfMLLM-7b70.95 ± 1.0955.54 ± 0.5041.03 ± 1.3056.28 ± 0.75
4ShareGPT-4V-7b69.70 ± 0.9156.69 ± 1.5841.03 ± 1.3656.19 ± 1.08
5CogVLM-17b69.70 ± 0.5866.00 ± 1.2630.49 ± 1.3855.64 ± 0.42
6LLaVA-13b68.25 ± 1.5658.11 ± 1.4334.02 ± 0.9053.81 ± 0.39
7ShareGPT-4V-13b68.00 ± 1.7460.29 ± 1.1130.43 ± 0.5453.22 ± 0.60
8LLaVA-7b62.00 ± 0.8554.23 ± 1.9230.27 ± 0.8549.12 ± 0.60

Data Examples

<p align="center"> <img src="./imgs/BasicVisQA.jpg" alt="BasicVisQA" width="685" height="700"> <p align="center"> <img src="./imgs/KnowVisQA.jpg" alt="KnowVisQA" width="685" height="700"> <p align="center"> <img src="./imgs/BeyongVisQA.jpeg" alt="BeyongVisQA" width="685" height="700">

Citation

@article{wang2024mm,
  title={MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception},
  author={Wang, Yuhao and Liao, Yusheng and Liu, Heyang and Liu, Hongcheng and Wang, Yu and Wang, Yanfeng},
  journal={arXiv preprint arXiv:2401.07529},
  year={2024}
}