Home

Awesome

<p align="center"> <h1 align="center"> <img src="images/hobby.png" alt="PNG Image" width="25" height="25"> Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images</h1> </p> <div align="center">

Zhiyuan Li<sup></sup> · Heng Wang<sup></sup> · Dongnan Liu<sup></sup> · Chaoyi Zhang<sup></sup> · Ao Ma · Jieting Long · Weidong Cai<sup></sup>

School of Computer Science, The University of Sydney

<a href='https://mucr-benchmark.github.io/'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a href='https://arxiv.org/pdf/2408.08105'><img src='https://img.shields.io/badge/Arxiv-Paper-red'></a> <a href='https://huggingface.co/datasets/Pinkygin/MuCR'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a>

</div> <p align="center"> <b> [<a href="https://arxiv.org/abs/2408.08105">ArXiv</a>] | [<a href="https://huggingface.co/datasets/Pinkygin/MuCR">🤗HuggingFace</a>] | [<a href="https://mucr-benchmark.github.io/">Website</a>] </b> <br /> </p>

MuCR is proposed to challenge VLLMs to infer semantic cause-and-effect relationships when solely relying on visual cues such as action, appearance, clothing, and environment.

<img src='images/picture3.png'>

Release

Demos

Overview

<p align="center"> <img src="images/Picture6.png"> </p>

Model Performance

<p align="center"> <img src="images/performance.png"> </p>

Detailed Examples

<p align="center"> <img src="images/human2.png" alt="Image 1" style="display: inline-block;"> <img src="images/animal3.png" alt="Image 2" style="display: inline-block;"> <img src="images/plant4.png" alt="Image 3" style="display: inline-block;"> <img src="images/character5.png" alt="Image 4" style="display: inline-block;"> <img src="images/mixture6.png" alt="Image 6" style="display: inline-block;"> </p>

Download

You can directly download the model from Huggingface. or load dataset from Huggingface as follows:

import datasets
dataset = datasets.load_dataset("data/")

Dataset Form

Each line of file in jsonl must meet the following format:

{
  "id": "ID",
  "caption_0": "...",
  "caption_1": "...",
  "link_id": "[a,b,c]",
  "cue": "cue",
  "false_cue": ["false_cue1","false_cue2","flase_cue3"],
  "style": "style",
  "label": "label",
  "causal_reason": ["Explanation_1", "Explanation_2", "Explanation_3"],
  "image_0": "cause image",
  "image_1": "effect image"
}

Reference

If you find this project useful for your research, please consider citing the following paper:

@article{li2024multimodal,
  title={Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images},
  author={Li, Zhiyuan and Wang, Heng and Liu, Dongnan and Zhang, Chaoyi and Ma, Ao and Long, Jieting and Cai, Weidong},
  journal={arXiv preprint arXiv:2408.08105},
  year={2024}
}