Awesome

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Overview

Welcome to the code repository of our ECCV 2024 paper, HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning. This repository contains code, models, evaluation metrics, and information related to our dataset and research paper.

The data repository can be accessed here.

Structure

base-models: Implementation code for four base models.
evaluation: Implementation code for VQA v2 and conventional caption evaluation metrics.
examples: Examples of Haloquest data.

Base Models

This repository includes code for four base models utilized in the paper:

Training & Inference

Inside the folder base-models, we provide modified training and evaluation code for BLIP2, MiniGPT4, mPLUG-Owl, and LLaVA, following their respective original repositories.

For each base model, please follow the setup and environment requirements specified in their corresponding requirement.txt/.yaml or .toml files within their respective folders.

Evaluation

Inside the folder evaluation, the eval_metrics.py file contains evaluation code for both VQA v2 and conventional metrics such as BLEU, CIDER, ROUGE, and METEOR.

HaloQuest Data

For your reference, we provide some examples of HaloQuest data here.

Examples

If you want to use the data, please refer to the data repository.

Leaderboard

<table> <thead> <tr> <th rowspan="2">Model (#Param)</th> <th rowspan="2">Rank</th> <th colspan="2">Overall</th> <th colspan="2">Generated</th> <th colspan="2">Real</th> <th colspan="2">False Premise</th> <th colspan="2">Visually Challenging</th> <th colspan="2">Insufficient Context</th> </tr> <tr> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> <th>Human Eval</th> <th>Auto-Eval</th> </tr> </thead> <tbody> <tr> <td>Gemini 1.5 Pro</td> <td>1</td> <td>76.1</td> <td>77.9</td> <td>74.7</td> <td>78.3</td> <td>78.7</td> <td>77.2</td> <td>80.4</td> <td>83.7</td> <td>57.3</td> <td>56.3</td> <td>91</td> <td>92.5</td> </tr> <tr> <td>GPT-4o</td> <td>2</td> <td>68.1</td> <td>63.2</td> <td>68.8</td> <td>63.8</td> <td>66.9</td> <td>62.2</td> <td>68.5</td> <td>65.2</td> <td>58.3</td> <td>55.2</td> <td>80.6</td> <td>68.7</td> </tr> <tr> <td>GPT-4</td> <td>3</td> <td>62.9</td> <td>61.2</td> <td>64.3</td> <td>61.1</td> <td>60.6</td> <td>61.4</td> <td>64.7</td> <td>63</td> <td>46.9</td> <td>44.8</td> <td>80.6</td> <td>79.1</td> </tr> <tr> <td>BEiT-3 (0.7B)</td> <td>4</td> <td>35.9</td> <td>40</td> <td>41.2</td> <td>44.3</td> <td>26.3</td> <td>32.3</td> <td>24.1</td> <td>28.4</td> <td>36.6</td> <td>36.1</td> <td>9.1</td> <td>10.7</td> </tr> <tr> <td>InstructBLIP (12B)</td> <td>5</td> <td>25.5</td> <td>28.5</td> <td>28.4</td> <td>31.5</td> <td>20.3</td> <td>23</td> <td>28.4</td> <td>32</td> <td>33.3</td> <td>33.9</td> <td>6.6</td> <td>11.6</td> </tr> <tr> <td>InstructBLIP (8B)</td> <td>6</td> <td>25</td> <td>27.3</td> <td>28.4</td> <td>29.7</td> <td>18.9</td> <td>23</td> <td>28.4</td> <td>32</td> <td>6.6</td> <td>11.6</td> <td>33.3</td> <td>33.9</td> </tr> <tr> <td>BLIP2 (12B)</td> <td>7</td> <td>21.1</td> <td>22.5</td> <td>24.8</td> <td>26.1</td> <td>14.29</td> <td>16.1</td> <td>16.8</td> <td>19.5</td> <td>35.5</td> <td>32.8</td> <td>9.9</td> <td>14.9</td> </tr> <tr> <td>MiniGPT4 (13B)</td> <td>8</td> <td>18.7</td> <td>25.2</td> <td>18.2</td> <td>24</td> <td>18.9</td> <td>27.2</td> <td>16.2</td> <td>21.5</td> <td>10.4</td> <td>13.7</td> <td>36.4</td> <td>51.2</td> </tr> <tr> <td>MiniGPT4 (7B)</td> <td>9</td> <td>18.6</td> <td>19.1</td> <td>18.1</td> <td>19.4</td> <td>18</td> <td>18.4</td> <td>13.2</td> <td>13.2</td> <td>26.5</td> <td>27.3</td> <td>15.7</td> <td>16.5</td> </tr> <tr> <td>Open-flamingo (9B)</td> <td>10</td> <td>13.8</td> <td>15</td> <td>16.1</td> <td>17.1</td> <td>9.7</td> <td>11.1</td> <td>13.2</td> <td>13.9</td> <td>19.1</td> <td>21.3</td> <td>7.4</td> <td>8.3</td> </tr> <tr> <td>LLaVA (13B)</td> <td>11</td> <td>10.9</td> <td>10.9</td> <td>12.3</td> <td>12.8</td> <td>8.2</td> <td>7.4</td> <td>2.3</td> <td>1.7</td> <td>30.6</td> <td>31.2</td> <td>2.5</td> <td>3.3</td> </tr> <tr> <td>BLIP2 (8B)</td> <td>12</td> <td>10.9</td> <td>11.8</td> <td>11.5</td> <td>11.8</td> <td>9.7</td> <td>12</td> <td>5</td> <td>4.6</td> <td>26.8</td> <td>26.8</td> <td>1.7</td> <td>6.6</td> </tr> <tr> <td>mPLUG-Owl1 (7B)</td> <td>13</td> <td>9.7</td> <td>8.7</td> <td>11.3</td> <td>10.2</td> <td>6.9</td> <td>6</td> <td>1</td> <td>0.3</td> <td>29</td> <td>26.8</td> <td>2.5</td> <td>2.5</td> </tr> <tr> <td>mPLUG-Owl2 (7B)</td> <td>14</td> <td>9.2</td> <td>10.4</td> <td>11</td> <td>11.3</td> <td>6</td> <td>8.8</td> <td>0.8</td> <td>3.3</td> <td>28.4</td> <td>27.9</td> <td>0.8</td> <td>3.3</td> </tr> <tr> <td>OFA (1B)</td> <td>15</td> <td>8.7</td> <td>10.2</td> <td>9.7</td> <td>11.3</td> <td>6.9</td> <td>8.3</td> <td>5</td> <td>6.3</td> <td>19.7</td> <td>20.2</td> <td>1.7</td> <td>5</td> </tr> <tr> <td>Open-flamingo (3B)</td> <td>16</td> <td>6.9</td> <td>8.2</td> <td>7.4</td> <td>8.7</td> <td>6</td> <td>7.4</td> <td>0.7</td> <td>1.3</td> <td>19.1</td> <td>21.3</td> <td>4.1</td> <td>5.8</td> </tr> </tbody> </table>

Contributions

Zhecan Wang*, Garrett Bingham*, Adams Wei Yu, Quoc V. Le, Thang Luong, Golnaz Ghiasi

(* ZW and GB are main contributors. ZW did some work while at Google DeepMind.)

Citing this work

@inproceedings{wang2024haloquest,
  title={HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning},
  author={Zhecan Wang and Garrett Bingham and Adams Wei Yu and Quoc V. Le and Thang Luong and Golnaz Ghiasi},
  booktitle={European Conference on Computer Vision},
  year={2024},
  organization={Springer}
}