Home

Awesome

CoTConsistency

This reposition contains the released data for the paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".

Overview

In this work, we build a benchmark dataset CURE to measure the reasoning performance and consistency of existing vision-language models (VLMs). The dataset construction is based on our LLM-Human-in-the-Loop pipeline and the coarse-grained annotated dataset [Sherlock]. Two examples are shown below.

Examples included in CURE

Annotation Format

The annotation file has the following format. Each item in the JSONL file contains annotation for an image with the following fields:

Note that the bboxes, clue, and ground truth inference are annotated by the original Sherlock dataset.

Citation

Please kindly cite our paper:

@article{chen2023measuring,
  title={Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models},
  author={Chen, Yangyi and Sikka, Karan and Cogswell, Michael and Ji, Heng and Divakaran, Ajay},
  journal={arXiv preprint arXiv:2309.04461},
  year={2023}
}