Awesome
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Xiyang Wu*, Tianrui Guan*, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha
Paper | Twitter | Website | Code | Dataset
<p align="center" > <img src="./imgs/teaser.png" alt="Teaser" class="center" width="800"/> </p>Updates
- [10/03/2024] 🔥 We release a benchmark datasest generated by AUTOHALLUSION (also includes HallusionBench data) and update the leaderboard.
- [09/20/2024] 🔥 Our AUTOHALLUSION is accepted by EMNLP 2024.
- [07/20/2024] 🔥 We launched our website for AUTOHALLUSION.
- [06/15/2024] 🔥 We release our early version of AUTOHALLUSION, as an extension of our prior work HallusionBench.
- [02/26/2024] 🔥 Our HallusionBench is accepted by CVPR 2024.
- [11/28/2023] 🔥 The full paper (HallusionBench) is uploaded and can be accessed here. The dataset is expanded and leaderboard is updated.
- [10/27/2023] 🔥 The leaderboard for HallusionBench and evaluation code is released! Welcome to update your model on our leaderboard!
- [10/24/2023] 🔥 The early report for HallusionBench with case analysis and insights is available here.
If you find our paper useful, please cite our paper:
@misc{wu2024autohallusionautomaticgenerationhallucination,
title={AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models},
author={Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Lee Boyd-Graber and Tianyi Zhou and Dinesh Manocha},
year={2024},
eprint={2406.10900},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.10900},
}
@misc{guan2023hallusionbench,
title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models},
author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
year={2023},
eprint={2310.14566},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Dependency
Install the dependencies with pip install -r requirements.txt
OR
- Install PyTorch (We use 2.2.2)
- Install object detection model OWL-ViT:
pip install transformer
and follow the instructions provided in the link. - Install LVLMs:
pip install openai
or other LVLMs to be evaluated. - Other dependencies:
pip install opencv-python numpy tqdm pillow rembg
Benchmark
We provide a benchmark including hallucination cases created by abnormal object insertion, paired object insertion and correlated object removal strategies, from both synthetic and real-world images.
To keep evaluation simple, we only provide the question in form of yes/no questions.
Updated on | Questions and Annotations | Figures | Dataset Size |
---|---|---|---|
Oct 3, 2024 | autohallusion_data.json | image.zip | 3129 |
Demo
We provide a few light-weight demos to help users quickly understand the usage of different strategies provided by our AutoHallusion pipeline and craft hallucination cases.
Jupyter Notebook
We provide three jupyter notebooks Abnormal Object Insertion, Paired Object Insertion and Correlated Object Removal. We provide illustrative comments along with each code block with visualization of results to help users understand the purpose of each step throughout the whole hallucination crafting procedure.
Usage
Hallucination Case Crafting
We provide codebase to automatically scale up the benchmark using AutoHallusion. {strategy name} could be replaced with Abnormal Object Insertion, Paired Object Insertion and Correlated Object Removal.
run_{strategy name}.py
: Hyper-parameters and experiment flags for hallucination case crafting.
main_{strategy name}.py
: Main function for hallucination case crafting, including the scene image generation, image
manipulation, question construction and hallucination detection. The specific strategy is determined by chosen
hyper-parameters and experiment flags.
Utilities
utils_merge.py
: A general function to determine LVLM-related functions for scene image generation, object prompting,
VQA, etc. The specific LVLM is decided by object thinking
(for scene and object prompting) and image caption
(for VQA tasks) in the hyper-parameters.
utils_{model name}_clean.py
: All LVLM-related functions for scene image generation, object prompting, VQA, etc., given
the LVLM specified by {model name}
.
utils_eval.py
: All evaluation functions for hallucination detection, supported by GPT-4V-Turbo.
utils.py
: All other non-LVLM-related functions, including object detection, image editing, background removal, ground
truth generation etc.
Leaderboard
Evaluation
We evaluate each model's performance over benchmark created by AutoHallusion. The evaluation procedure for each model to get the results presented in the leaderboard includes:
Step 1: Install the Questions and Annotations and Image. Setup model to be evaluated.
Step 2: We run the VQA tasks for each model over the question-image pair in the benchmark to get the answer,
using the inference code for each model we provided GPT-4V-Turbo,
Gemini Pro Vision and Claude 3.
Results are stored in the autohallusion_data_{model name}_res.json
.
Step 3: We run evaluation code that uses GPT-4V-Turbo to determine if the answer produced
by the models conveys the same meaning as the ground truth. Results produced by autohallusion_data_{model name}_res_evaluated.json
.
are presented as breakdowns of accuracy values over examples from different categories, as presented in the leaderboard.
Metric
The metrics on the leaderboard we provide include:
- Overall Accuracy: The question-answering accuracy over the whole benchmark.
- Breakdown (over Synthetic/Real-world Dataset)
- Overall Accuracy: The question-answering accuracy over examples generated from Synthetic/Real-world images.
- Existence Accuracy: The question-answering accuracy for existence questions over examples generated from Synthetic/Real-world images.
- Spatial Relation Accuracy: The question-answering accuracy for spatial relation questions over examples generated from Synthetic/Real-world images.
Model | Overall Acc. | Overall Acc. (Synthetic) | Exi. Acc. (Synthetic) | Sp. Acc. (Synthetic) | Overall Acc. (Real-world) | Exi. Acc. (Real-world) | Sp. Acc. (Real-world) |
---|---|---|---|---|---|---|---|
GPT4V-Turbo | 66.0 | 68.5 | 68.3 | 68.8 | 62.9 | 71.5 | 56.3 |
Gemini Pro Vision | 51.4 | 53.5 | 59.4 | 43.4 | 48.8 | 70.6 | 31.8 |
Claude 3 | 37.1 | 37.3 | 44.6 | 24.7 | 36.9 | 55.6 | 22.4 |
LLaVA-1.5 | 44.5 | 46.6 | 54.2 | 33.8 | 41.8 | 60.4 | 27.3 |
miniGPT4 | 51.0 | 50.2 | 56.4 | 39.7 | 52.1 | 67.7 | 39.9 |
License
This repository is under BSD 3-Clause License.