Awesome

HaloQuest

Welcome to the repository of our ECCV 2024 paper, HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning. This repository contains a colab that shows how to load the HaloQuest data and how to use the Auto-Eval system.

Code to reproduce the experiments in the paper is available here.

Unofficial Project Page

Updates

2024.07.22: Our paper is on arxiv

Dataset Description

Summary

HaloQuest is a novel visual question answering (VQA) dataset that focuses on multimodal hallucination in vision-language models (VLMs). It contains 7,748 examples with a combination of real and synthetically generated images, annotated with questions and answers designed to trigger and evaluate hallucinations.

Supported Tasks

HaloQuest supports tasks related to hallucination detection and reduction in VLMs, providing a challenging benchmark for Visual Question Answering. The dataset is useful for both evaluation and fine-tuning purposes, aiming to advance multimodal reasoning.

Dataset Details

Data Collection

HaloQuest includes a mix of real images from the Open Images dataset and synthetic images generated using Midjourney. Images were curated based on interest and comprehensibility. Questions and answers were crafted by humans and large language models (LLMs), focusing on false premises, visually challenging questions, and questions with insufficient context.

Data Splits

The dataset is split into training and evaluation sets. The following table provides detailed statistics for each subset.

	Real Images	Synthetic Images	False Premise Questions	Visually Challenging Questions	Insufficient Context Questions	Total Entries
Training Set	2985	4155	2698	2973	1469	7140
Evaluation Set	217	391	304	183	121	608
Total	3202	4546	3002	3156	1590	7748

Leaderboard

(Gemini 1.0 Pro was used for Auto-Eval)

<table> <thead> <tr> <th rowspan="2">Model (#Param)</th> <th rowspan="2">Rank</th> <th colspan="2">Overall</th> <th colspan="2">Generated</th> <th colspan="2">Real</th> <th colspan="2">False Premise</th> <th colspan="2">Visually Challenging</th> <th colspan="2">Insufficient Context</th> </tr> <tr> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> </tr> </thead><tbody><tr><td>Gemini 1.5 Pro (May 2024)</td><td>1</td><td>76.1</td><td>77.9</td><td>74.7</td><td>78.3</td><td>78.7</td><td>77.2</td><td>80.4</td><td>83.7</td><td>57.3</td><td>56.3</td><td>91</td><td>92.5</td></tr><tr><td>GPT-4o (May 2024)</td><td>2</td><td>68.1</td><td>63.2</td><td>68.8</td><td>63.8</td><td>66.9</td><td>62.2</td><td>68.5</td><td>65.2</td><td>58.3</td><td>55.2</td><td>80.6</td><td>68.7</td></tr><tr><td>GPT-4 (May 2024)</td><td>3</td><td>62.9</td><td>61.2</td><td>64.3</td><td>61.1</td><td>60.6</td><td>61.4</td><td>64.7</td><td>63</td><td>46.9</td><td>44.8</td><td>80.6</td><td>79.1</td></tr><tr><td>BEiT-3 (0.7B) (Mar 2024)</td><td>4</td><td>35.9</td><td>40</td><td>41.2</td><td>44.3</td><td>26.3</td><td>32.3</td><td>24.1</td><td>28.4</td><td>36.6</td><td>36.1</td><td>9.1</td><td>10.7</td></tr><tr><td>InstructBLIP (12B) (Mar 2024)</td><td>5</td><td>25.5</td><td>28.5</td><td>28.4</td><td>31.5</td><td>20.3</td><td>23</td><td>28.4</td><td>32</td><td>33.3</td><td>33.9</td><td>6.6</td><td>11.6</td></tr><tr><td>InstructBLIP (8B) (Mar 2024)</td><td>6</td><td>25</td><td>27.3</td><td>28.4</td><td>29.7</td><td>18.9</td><td>23</td><td>28.4</td><td>32</td><td>6.6</td><td>11.6</td><td>33.3</td><td>33.9</td></tr><tr><td>BLIP2 (12B) (Mar 2024)</td><td>7</td><td>21.1</td><td>22.5</td><td>24.8</td><td>26.1</td><td>14.29</td><td>16.1</td><td>16.8</td><td>19.5</td><td>35.5</td><td>32.8</td><td>9.9</td><td>14.9</td></tr><tr><td>MiniGPT4 (13B) (Mar 2024)</td><td>8</td><td>18.7</td><td>25.2</td><td>18.2</td><td>24</td><td>18.9</td><td>27.2</td><td>16.2</td><td>21.5</td><td>10.4</td><td>13.7</td><td>36.4</td><td>51.2</td></tr><tr><td>MiniGPT4 (7B) (Mar 2024)</td><td>9</td><td>18.6</td><td>19.1</td><td>18.1</td><td>19.4</td><td>18</td><td>18.4</td><td>13.2</td><td>13.2</td><td>26.5</td><td>27.3</td><td>15.7</td><td>16.5</td></tr><tr><td>Open-flamingo (9B) (Mar 2024)</td><td>10</td><td>13.8</td><td>15</td><td>16.1</td><td>17.1</td><td>9.7</td><td>11.1</td><td>13.2</td><td>13.9</td><td>19.1</td><td>21.3</td><td>7.4</td><td>8.3</td></tr><tr><td>LLaVA (13B) (Mar 2024)</td><td>11</td><td>10.9</td><td>10.9</td><td>12.3</td><td>12.8</td><td>8.2</td><td>7.4</td><td>2.3</td><td>1.7</td><td>30.6</td><td>31.2</td><td>2.5</td><td>3.3</td></tr><tr><td>BLIP2 (8B) (Mar 2024)</td><td>12</td><td>10.9</td><td>11.8</td><td>11.5</td><td>11.8</td><td>9.7</td><td>12</td><td>5</td><td>4.6</td><td>26.8</td><td>26.8</td><td>1.7</td><td>6.6</td></tr><tr><td>mPLUG-Owl1 (7B) (Mar 2024)</td><td>13</td><td>9.7</td><td>8.7</td><td>11.3</td><td>10.2</td><td>6.9</td><td>6</td><td>1</td><td>0.3</td><td>29</td><td>26.8</td><td>2.5</td><td>2.5</td></tr><tr><td>mPLUG-Owl2 (7B) (Mar 2024)</td><td>14</td><td>9.2</td><td>10.4</td><td>11</td><td>11.3</td><td>6</td><td>8.8</td><td>0.8</td><td>3.3</td><td>28.4</td><td>27.9</td><td>0.8</td><td>3.3</td></tr><tr><td>OFA (1B) (Mar 2024)</td><td>15</td><td>8.7</td><td>10.2</td><td>9.7</td><td>11.3</td><td>6.9</td><td>8.3</td><td>5</td><td>6.3</td><td>19.7</td><td>20.2</td><td>1.7</td><td>5</td></tr><tr><td>Open-flamingo (3B) (Mar 2024)</td><td>16</td><td>6.9</td><td>8.2</td><td>7.4</td><td>8.7</td><td>6</td><td>7.4</td><td>0.7</td><td>1.3</td><td>19.1</td><td>21.3</td><td>4.1</td><td>5.8</td></tr></tbody> </table>

Contributions

Zhecan Wang*, Garrett Bingham*, Adams Wei Yu, Quoc V. Le, Thang Luong, Golnaz Ghiasi

(* ZW and GB are main contributors. ZW did some work while at Google DeepMind.)

Citing this work

@inproceedings{wang2024haloquest,
  title={HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning},
  author={Zhecan Wang and Garrett Bingham and Adams Wei Yu and Quoc V. Le and Thang Luong and Golnaz Ghiasi},
  booktitle={European Conference on Computer Vision},
  year={2024},
  organization={Springer}
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Image URLs are from the Open Images Dataset v7 and Midjourney Showcase. Images may be individually licensed and you should verify the license for each image yourself.

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.