Home

Awesome

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Xiyang Wu*, Tianrui Guan*, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

Paper | Twitter | Website | Code | Dataset

<p align="center" > <img src="./imgs/teaser.png" alt="Teaser" class="center" width="800"/> </p>

Updates

If you find our paper useful, please cite our paper:

@misc{wu2024autohallusionautomaticgenerationhallucination,
      title={AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models}, 
      author={Xiyang Wu and Tianrui Guan and Dianqi Li and Shuaiyi Huang and Xiaoyu Liu and Xijun Wang and Ruiqi Xian and Abhinav Shrivastava and Furong Huang and Jordan Lee Boyd-Graber and Tianyi Zhou and Dinesh Manocha},
      year={2024},
      eprint={2406.10900},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10900}, 
}

@misc{guan2023hallusionbench,
      title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, 
      author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
      year={2023},
      eprint={2310.14566},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Dependency

Install the dependencies with pip install -r requirements.txt

OR

  1. Install PyTorch (We use 2.2.2)
  2. Install object detection model OWL-ViT: pip install transformer and follow the instructions provided in the link.
  3. Install LVLMs: pip install openai or other LVLMs to be evaluated.
  4. Other dependencies: pip install opencv-python numpy tqdm pillow rembg

Benchmark

We provide a benchmark including hallucination cases created by abnormal object insertion, paired object insertion and correlated object removal strategies, from both synthetic and real-world images.

To keep evaluation simple, we only provide the question in form of yes/no questions.

Updated onQuestions and AnnotationsFiguresDataset Size
Oct 3, 2024autohallusion_data.jsonimage.zip3129

Demo

We provide a few light-weight demos to help users quickly understand the usage of different strategies provided by our AutoHallusion pipeline and craft hallucination cases.

Jupyter Notebook

We provide three jupyter notebooks Abnormal Object Insertion, Paired Object Insertion and Correlated Object Removal. We provide illustrative comments along with each code block with visualization of results to help users understand the purpose of each step throughout the whole hallucination crafting procedure.

Usage

Hallucination Case Crafting

We provide codebase to automatically scale up the benchmark using AutoHallusion. {strategy name} could be replaced with Abnormal Object Insertion, Paired Object Insertion and Correlated Object Removal.

run_{strategy name}.py: Hyper-parameters and experiment flags for hallucination case crafting. main_{strategy name}.py: Main function for hallucination case crafting, including the scene image generation, image manipulation, question construction and hallucination detection. The specific strategy is determined by chosen hyper-parameters and experiment flags.

Utilities

utils_merge.py: A general function to determine LVLM-related functions for scene image generation, object prompting, VQA, etc. The specific LVLM is decided by object thinking (for scene and object prompting) and image caption (for VQA tasks) in the hyper-parameters. utils_{model name}_clean.py: All LVLM-related functions for scene image generation, object prompting, VQA, etc., given the LVLM specified by {model name}. utils_eval.py: All evaluation functions for hallucination detection, supported by GPT-4V-Turbo. utils.py: All other non-LVLM-related functions, including object detection, image editing, background removal, ground truth generation etc.

Leaderboard

Evaluation

We evaluate each model's performance over benchmark created by AutoHallusion. The evaluation procedure for each model to get the results presented in the leaderboard includes:

Step 1: Install the Questions and Annotations and Image. Setup model to be evaluated.

Step 2: We run the VQA tasks for each model over the question-image pair in the benchmark to get the answer, using the inference code for each model we provided GPT-4V-Turbo, Gemini Pro Vision and Claude 3. Results are stored in the autohallusion_data_{model name}_res.json.

Step 3: We run evaluation code that uses GPT-4V-Turbo to determine if the answer produced by the models conveys the same meaning as the ground truth. Results produced by autohallusion_data_{model name}_res_evaluated.json. are presented as breakdowns of accuracy values over examples from different categories, as presented in the leaderboard.

Metric

The metrics on the leaderboard we provide include:

ModelOverall Acc.Overall Acc. (Synthetic)Exi. Acc. (Synthetic)Sp. Acc. (Synthetic)Overall Acc. (Real-world)Exi. Acc. (Real-world)Sp. Acc. (Real-world)
GPT4V-Turbo66.068.568.368.862.971.556.3
Gemini Pro Vision51.453.559.443.448.870.631.8
Claude 337.137.344.624.736.955.622.4
LLaVA-1.544.546.654.233.841.860.427.3
miniGPT451.050.256.439.752.167.739.9

License

This repository is under BSD 3-Clause License.