Home

Awesome

<img src="resources/logo.webp" style="vertical-align: -10px;" :height="50px" width="50px"> Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

πŸ€— Dataset | πŸ“– Paper

This repo contains the official code and dataset for the paper "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"

πŸ’‘ Highlights

πŸ“œ News

[2024.09.09] πŸš€ HRBench has been supported in the VLMEvalKit repository.

[2024.08.29] πŸš€ We released the ArXiv paper.

[2024.08.23] πŸš€ Huggingface Dataset and $DC^2$ code are available!

πŸ‘€ Introduction

HR-Bench

We find that the highest resolution in existing multimodal benchmarks is only 2K. To address the current lack of high-resolution multimodal benchmarks, we construct HR-Bench. HR-Bench consists two sub-tasks: Fine-grained Single-instance Perception (FSP) and Fine-grained Cross-instance Perception (FCP). The FSP task includes 100 samples, which includes tasks such as attribute recognition, OCR, visual prompting. The FCP task also comprises 100 samples which encompasses map analysis, chart analysis and spatial relationship assessment. We visualize examples of our HR-Bench.πŸ‘‡

<img src="resources/case_study_dataset_13.png">

HR-Bench is available in two versions: HR-Bench 8K and HR-Bench 4K. The HR-Bench 8K includes images with an average resolution of 8K. Additionally, we manually annotate the coordinates of objects relevant to the questions within the 8K image and crop these image to 4K resolution.

<img src="resources/logo.webp" style="vertical-align: -10px;" :height="25px" width="25px"> Divide, Conquer and Combine

We observe that most current MLLMs (e.g., LLaVA-v1.5) perceive images in a fixed resolution (e.g., $336\times336$). This simplification often leads to greater visual information loss. Based on this finding, we propose a novel training-free framework β€”β€” Divide, Conquer and Combine ($DC^2$). We first recursively split an image into image patches until they reach the resolution defined by the pretrained vision encoder (e.g., $336\times 336$), merging similar patches for efficiency (Divide). Next, we utilize MLLM to generate text description for each image patch and extract objects mentioned in the text descriptions (Conquer). Finally, we filter out hallucinated objects resulting from image division and store the coordinates of the image patches which objects appear (Combine). During the inference stage, we retrieve the related image patches according to the user prompt to provide accurate text descriptions.

<img src="resources/framework_version_8.png">

πŸ† Mini-Leaderboard

We show a mini-leaderboard here and please find more information in our paper. (πŸ‘πŸ»Any new results are welcome. Please add your results and model/paper links through an issue or pull request.)

ModelHR-Bench 4K (Acc.)HR-Bench 8K (Acc.)Avg.
Human Baseline πŸ₯‡82.086.884.4
InternVL-2-llama3-76B w/ our $DC^2$ πŸ₯ˆ70.463.366.9
Qwen2VL-7B πŸ₯‰66.866.566.6
InternVL-2-llama3-76B71.061.466.2
Gemini 1.5 Flash66.862.864.8
InternVL-1.5-26B w/ $DC^2$63.461.362.3
Qwen2VL-2B64.058.661.3
InternVL-1.5-26B60.657.959.3
GPT4o59.055.557.3
QWen-VL-max58.552.555.5
Xcomposer2-4kHD-7B57.851.354.6
LLaVA-HR-X-13B53.646.950.3
LLaVA-1.6-34B52.947.450.2
QWen-VL-plus53.046.549.8
LLaVA-HR-X-7B52.041.646.8

πŸ“§ Contact

βœ’οΈ Citation

@article{hrbench,
      title={Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models}, 
      author={Wenbin Wang and Liang Ding and Minyan Zeng and Xiabin Zhou and Li Shen and Yong Luo and Dacheng Tao},
      year={2024},
      journal={arXiv preprint},
      url={https://arxiv.org/abs/2408.15556}, 
}

Acknowledgement