

AMBER: An Automated Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation

<div align="center"> Junyang Wang*<sup>1</sup>, Yuhang Wang*<sup>1</sup>, Guohai Xu<sup>2</sup>, Jing Zhang<sup>1</sup>, Yukai Gu<sup>1</sup>, Haitao jia<sup>1</sup>, Jiaqi Wang<sup>1</sup> </div> <div align="center"> Haiyang Xu<sup>2</sup>, Ming Yan<sup>2</sup>, Ji Zhang<sup>2</sup>, Jitao Sang<sup>1</sup> </div> <div align="center"> <sup>1</sup>Beijing Jiaotong University <sup>2</sup>Alibaba Group </div> <div align="center"> *Equal Contribution </div> <div align="center"> <a href="https://arxiv.org/abs/2311.07397"><img src="README_File/Paper-Arxiv-orange.svg" ></a> </div>


AMBER is An LLM-free Multi-dimensional Benchmark for MLLMs hallucination evaluation, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. AMBER has a fine-grained annotation and automated evaluation pipeline. The data statistics and objects distribution. The results of mainstream MLLMs evaluated by AMBER.


Getting Started


1. spacy is used for near-synonym judgment

pip install -U spacy
python -m spacy download en_core_web_lg

2. nltk is used for objects extraction

pip install nltk

Image Download

Download the images from this LINK.

Responses Generation

json fileTask or DimensionEvaluation args
query_all.jsonAll the tasks and dimensionsa
query_generative.jsonGenerative taskg
query_discriminative.jsonDiscriminative taskd
query_discriminative-existence.jsonExistence dimensionde
query_discriminative-attribute.jsonAttribute dimensionda
query_discriminative-relation.jsonRelation dimensiondr

For generative task (1 <= id <= 1004), the format of responses is:

		"id": 1,
		"response": "The description of AMBER_1.jpg from MLLM."
		"id": 1004,
		"response": "The description of AMBER_1004.jpg from MLLM."

For discriminative task (id >= 1005), the format of responses is:

		"id": 1005,
		"response": "Yes" or "No"
		"id": 15220,
		"response": "Yes" or "No"


python inference.py --inference_data path/to/your/inference/file --evaluation_type {Evaluation args}


If you found this work useful, consider giving this repository a star and citing our paper as followed:

  title={An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation},
  author={Wang, Junyang and Wang, Yuhang and Xu, Guohai and Zhang, Jing and Gu, Yukai and Jia, Haitao and Yan, Ming and Zhang, Ji and Sang, Jitao},
  journal={arXiv preprint arXiv:2311.07397},