Home

Awesome

MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

The official repo of "MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era".

More detailed information is in the PAPER.

Download the MMRel Question-Answer pairs, and generated Dall-E images.

Authors: Jiahao Nie<sup>*</sup>, Gongjie Zhang<sup>*</sup>, Wenbin An, Yap-Peng Tan</a>, Alex C. Kot, Shijian Lu

Multi-Modal Relation Understanding (MMRel)

<p align="middle"> <img src="image/mmrel.png"> </p>

MMRel Dataset is a large-scale, high-quality, and diverse multi-modal benchmark for studying inter-object relations with MLLM.

MMRel contains over 15K data that comprises multi-modal data of three categories of relations (i.e. spatial, action, and comparative) and are sourced from three distinct domains (i.e., Real images, synthetic images from SDXL, and synthetic imagesb from Dall-E).

<p align="middle"> <img src="image/statistics.png"> </p>

Semi-Automatic Data Collection (SemiDC)

We adopt a semi-automatic pipeline that leverages MLLMs to generate images and annotations based on textual prompts and then verifies and corrects the generated images and annotations by human reviewers.

<p align="middle"> <img src="image/semidc.png"> </p>

Images

Real images and images sythestized from SDXL

The real images are from Visual Genome. The images sythestized from SDXL can be download from SPEC's official repo. Specifically, we adopt the realtive_spatial and relative_size subsets.

Images synthesized from Dall-E

To diversity the MMRel, we specifally sythesize images via Dall-E. Moreover, we create a challenging subset in MMRel which utilizes relations that deviate from common sense to assess the relation understanding capabilities of MLLMs rigorously. The images from Dall-E are with different four styles (i.e., photo-realistic, watercolor, abstract, and oil painting).

The images generated from Dall-E can be download HERE.

MMRel for evalution

Thanks to its large-scale, high-quality, and diverse multi-modal features, MMRel is ideal for evaluating MLLMs on relation understanding. We utilize all 15K data for evaluation, and the expeierments are conducted with VCD's official code.

<p align="middle"> <img src="image/eval.png"> </p>

The evaluation question-answers can be downloaded from HERE.

MMRel for fine-tuning

We attempt to utilize MMRel for fine-tuning, which consistently improve the MLLM's capabilities. More details can be found in paper. The data for fine-tuning can be downloaded from HERE.

Citation

If you use this codebase for your research, please consider citing:

@article{nie2024mmrel,
  title={MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era},
  author={Nie, Jiahao and Zhang, Gongjie and An, Wenbin and Tan, Yap-Peng and Kot, Alex C and Lu, Shijian},
  journal={arXiv preprint arXiv:2406.09121},
  year={2024}
}

Acknowledgement

Our experiments are conducted based on LLaVA-1.5 and VCD's official code.

Reference

[1] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. NeurIPS, 2023

[2] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744, 2023.

[3] Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding. CVPR, 2024.

[4] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. IJCV, 2017.

[5] Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu. Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu. CVPR, 2024.