Home

Awesome

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models [CVPR 2024]

You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Tianrui Guan*, Fuxiao Liu*, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

Updates

🔥🔥🔥

We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community!

🔥🔥🔥

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future.

If you find our paper useful, please cite our paper:

@misc{wu2024autohallusion,
      title={AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models}, 
      author={Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Lee Boyd-Graber and Tianyi Zhou and Dinesh Manocha},
      year={2024},
      eprint={2406.10900},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10900}, 
}
@InProceedings{Guan_2024_CVPR,
    author    = {Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi},
    title     = {HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14375-14385}
}
@misc{liu2023mitigating,
      title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, 
      author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
      year={2023},
      eprint={2306.14565},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{liu2023mmc,
      title={MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning}, 
      author={Fuxiao Liu and Xiaoyang Wang and Wenlin Yao and Jianshu Chen and Kaiqiang Song and Sangwoo Cho and Yaser Yacoob and Dong Yu},
      year={2023},
      eprint={2311.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Download

To keep evaluation simple, we only provide the question in form of yes/no questions.

Updated onQuestions and AnnotationsFiguresQuestion CountFigure Count
Oct 27, 2023HallusionBench.jsonhallusion_bench.zip25469

Evaluation

  1. Clone the repo.
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
  1. Download the images hallusion_bench.zip and unzip the folder in the same directory.

  2. The questions and image locations are saved in ./HallusionBench.json. The data sample are as follows:

{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}

The key visual_inputmeans whether the question needs visual input like images. If visual_input=1, it means the question need visual input. If visual_input=0, it means the question doesn't need visual input. It's the text-only question.

  1. Run your model on ./HallusionBench.json and save the ouput file as ./HallusionBench_result.json. You need to add the output of your model in the key 'model_prediction'. We provide an sample result here.
  2. Finally, run the following code for evaluation:
python evaluation.py

You can use your own API key for GPT4 evaluation by editing the code here.

Leaderboard

Definition

Metric

ModelQuestion Pair AccFigure AccEasy Question AccHard Question AccQuestion AccJson
GPT4V <br />Sep 25, 2023 Version <br />(Human Eval)31.4244.2279.5638.3767.58VD, VS
GPT4V <br />Sep 25, 2023 Version <br />(GPT Eval)28.7939.8875.6037.6765.28VD, VS
Claude 3 <br />(GPT Eval)21.7628.6155.1641.4056.86VD, VS
LLaVA-1.5 <br />(Human Eval)9.4525.4350.7729.0747.12VD, VS
LLaVA-1.5 <br />(GPT Eval)10.5524.8649.6729.7746.94VD, VS
Gemini Pro Vision <br /> Dec, 2023 Version <br />(GPT Eval)7.698.6735.6030.2336.85VD, VS
GUA_VL <br />(GPT Eval)16.7023.1253.6339.7751.82VD, VS
BLIP2-T5 <br />(GPT Eval)15.1620.5245.4943.4948.09VD, VS
Qwen-VL <br />(GPT Eval)5.936.6531.4324.8839.15VD, VS
Open-Flamingo <br />(GPT Eval)6.3711.2739.5627.2138.44VD, VS
MiniGPT5 <br />(GPT Eval)10.559.8336.0428.3740.30VD, VS
MiniGPT4 <br />(GPT Eval)8.7910.1231.8727.6735.78VD, VS
InstructBLIP <br />(GPT Eval)9.4510.1135.6045.1245.26VD, VS
BLIP2 <br />(GPT Eval)5.0512.4333.8540.7040.48VD, VS
mPLUG_Owl-v2 <br />(GPT Eval)13.8519.9444.8439.0747.30VD, VS
mPLUG_Owl-v1 <br />(GPT Eval)9.4510.4039.3429.7743.93VD, VS
LRV_Instruction <br />(GPT Eval)8.7913.0139.7827.4442.78VD, VS
ViLT <br />(GPT Eval)8.351611.271737.802245.348844.4641VD, VS
GiT <br />(GPT Eval)5.276.3626.8131.8634.37VD, VS

Reproduce GPT4V results on leaderboard

  1. We saved the ouput of GPT4V with our annotation. Put HallusionBench.tsv in the root directory of this repo, or set input_file_name in gpt4v_benchmark.py to the location of the HallusionBench.tsv file.

  2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for Visual Dependent and Visual Supplement. Put the json files in the root directory of this repo, or set save_json_path_vd and save_json_path_vd in gpt4v_benchmark.py to their respective locations.

  3. Run python gpt4v_benchmark.py.

Examples and Analysis

<p align="center" > <img src="./examples/f-01.png" alt="Example 1" class="center" width="800"/> <img src="./examples/f-02.png" alt="Example 2" class="center" width="800"/> <img src="./examples/f-04.png" alt="Example 3" class="center" width="800"/> <img src="./examples/f-05.png" alt="Example 4" class="center" width="800"/> <img src="./examples/f-08.png" alt="Example 5" class="center" width="800"/> <img src="./examples/f-15.png" alt="Example 6" class="center" width="800"/> <img src="./examples/f-10.png" alt="Example 7" class="center" width="800"/> <img src="./examples/f-12.png" alt="Example 8" class="center" width="800"/> <img src="./examples/f-17.png" alt="Example 9" class="center" width="800"/> </p>

License

This repository is under BSD 3-Clause License.