Home

Awesome

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

<font size=7><div align='center' > [đź“– Paper] </div></font>

MIA-Bench is a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

MIA-Bench

Figure 1: An example from MIA-Bench, featuring an image and a complex instruction to test models’ compliance with layered instructions that are compositional in nature. Responses from GPT-4v and InternVL-v1.5 are evaluated using GPT-4o as the judge.

Evaluate your model on MIA-Bench

Step 1:

Step 2:

Step 3:

Citation

@misc{qian2024miabenchbetterinstructionfollowing,
      title={MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs}, 
      author={Yusu Qian and Hanrong Ye and Jean-Philippe Fauconnier and Peter Grasch and Yinfei Yang and Zhe Gan},
      year={2024},
      eprint={2407.01509},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.01509}, 
}

Example Responses and Scoring

MIA-Bench Figure 2: An example with responses from four MLLMs and their evaluation scores.