Home

Awesome

<div align="center"> <div>

<a href="https://github.com/Q-Future/"><img src="https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fvqassessment%2FQ-Bench&count_bg=%23E97EBA&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=visitors&edge_flat=false"/></a> <a href="https://github.com/Q-Future/Q-Bench"><img src="https://img.shields.io/github/stars/Q-Future/Q-Bench"/></a> <a href="https://arxiv.org/abs/2309.14181"><img src="https://img.shields.io/badge/Arxiv-2309:14181-red"/></a> <a href="https://arxiv.org/abs/2402.07116"><img src="https://img.shields.io/badge/Extension-2402:07116-yellow"/></a> <a href="https://github.com/Q-Future/Q-Bench/releases/tag/v1.0.1.1014datarelease"><img src="https://img.shields.io/badge/Data-Release-green"></a> <a href="https://github.com/Q-Future/Q-Instruct"><img src="https://img.shields.io/badge/Awesome-QInstruct-orange"/></a>

</div> <h1>Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision</h1>

How do multi-modaility LLMs perform on low-level computer vision?

<div> <a href="https://teowu.github.io/" target="_blank">Haoning Wu</a><sup>1</sup><sup>*</sup>, <a href="https://zzc-1998.github.io/" target="_blank">Zicheng Zhang</a><sup>2</sup><sup>*</sup>, <a href="https://github.com/ZhangErliCarl/" target="_blank">Erli Zhang</a><sup>1</sup><sup>*</sup>, <a href="https://chaofengc.github.io" target="_blank">Chaofeng Chen</a><sup>1</sup>, <a href="https://liaoliang92.github.io" target="_blank">Liang Liao</a><sup>1</sup>, </div> <div> <a href="https://github.com/AnnanWangDaniel" target="_blank">Annan Wang</a><sup>1</sup>, <a href="https://github.com/lcysyzxdxc" target="_blank">Chunyi Li</a><sup>2</sup>, <a href="https://wenxiusun.com" target="_blank">Wenxiu Sun</a><sup>3</sup>, <a href="https://scholar.google.com/citations?user=uT9CtPYAAAAJ&hl=en" target="_blank">Qiong Yan</a><sup>3</sup>, <a href="https://ee.sjtu.edu.cn/en/FacultyDetail.aspx?id=24&infoid=153&flag=153" target="_blank">Guangtao Zhai</a><sup>2</sup>, <a href="https://personal.ntu.edu.sg/wslin/Home.html" target="_blank">Weisi Lin</a><sup>1</sup><sup>#</sup> </div> <div> <sup>1</sup>Nanyang Technological University, <sup>2</sup>Shanghai Jiaotong University, <sup>3</sup>Sensetime Research </div> <div> <sup>*</sup>Equal contribution. <sup>#</sup>Corresponding author. </div> <div> ICLR2024 Spotlight </div> <a href="https://arxiv.org/abs/2309.14181"><strong>Paper</strong></a> | <a href="https://q-future.github.io/Q-Bench"><strong>Project Page</strong></a> | <a href="https://github.com/Q-Future/Q-Bench"><strong>Github</strong></a> | <a href="https://huggingface.co/datasets/nanyangtu/LLVisionQA-QBench"><strong>Data (LLVisionQA)</strong></a> | <a href="https://huggingface.co/datasets/nanyangtu/LLDescribe-QBench"><strong>Data (LLDescribe)</strong></a> | <a href="https://q-future.github.io/Chinese-Q-Bench"><strong>่ดจ่กก (Chinese-Q-Bench)</strong></a> <div style="width: 80%; text-align: center; margin:auto;"> <img style="width:80%" src="logo.png"> </div> <div style="width: 80%; text-align: center; margin:auto;"> <img style="width:80%" src="qbench.png"> </div> </div>

The proposed Q-Bench includes three realms for low-level vision: perception (A1), description (A2), and assessment (A3).

Use with datasets API

For the Q-Bench-A1 (with multi-choice questions), we have converted them into HF-format datasets that can automatically be downloaded and used with datasets API. Please refer to the following instruction:

pip install datasets

Q-Bench (single images)

from datasets import load_dataset

ds = load_dataset("q-future/Q-Bench-HF")

print(ds["dev"][0])

### {'id': 0,
### 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=4160x3120>,
### 'question': 'How is the lighting of this building?',
### 'option0': 'High',
### 'option1': 'Low',
### 'option2': 'Medium',
### 'option3': 'N/A',
### 'question_type': 2,
### 'question_concern': 3,
### 'correct_choice': 'B'}

Q-Bench2 (image pairs)

from datasets import load_dataset

ds = load_dataset("q-future/Q-Bench2-HF")

print(ds["dev"][0])

### {'id': 0,
###  'image1': <PIL.Image.Image image mode=RGB size=4032x3024>,
###  'image2': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=864x1152>,
###  'question': 'Compared to the first image, how is the clarity of the second image?',
###  'option0': 'More blurry',
###  'option1': 'Clearer',
###  'option2': 'About the same',
###  'option3': 'N/A',
###  'question_type': 2,
###  'question_concern': 0,
###  'correct_choice': 'B'}

Release

Close-source MLLMs (GPT-4V-Turbo, Gemini, Qwen-VL-Plus, GPT-4V)

<div style="width: 55%; text-align: center; margin:auto;"> <img style="width:55%" src="gpt-4v-vs-human.png"> </div>

We test on three close-source API models, GPT-4V-Turbo (gpt-4-vision-preview, replacing the no-longer-available old version GPT-4V results), Gemini Pro (gemini-pro-vision) and Qwen-VL-Plus (qwen-vl-plus). Slightly improved compared with the older version, GPT-4V still tops among all MLLMs and almost a junior-level human's performance. Gemini Pro and Qwen-VL-Plus follows behind, still better than best open-source MLLMs (0.65 overall).

Update on [2024/7/18], We are glad to release the new SOTA performance of BlueImage-GPT (close-source).

Perception, A1-Single

Participant Nameyes-or-nowhathowdistortionothersin-context distortionin-context othersoverall
Qwen-VL-Plus (qwen-vl-plus)0.75740.73250.57330.64880.73240.68670.70560.6893
BlueImage-GPT (from VIVO New Champion)0.84670.83510.74690.78190.85940.79950.82400.8107
Gemini-Pro (gemini-pro-vision)0.72210.73000.66450.65300.72910.70820.76650.7058
GPT-4V-Turbo (gpt-4-vision-preview)0.77220.78390.66450.71010.71070.79360.78910.7410
GPT-4V (old version)0.77920.79180.62680.70580.73030.74660.77950.7336
human-1-junior0.82480.79390.60290.75620.72080.76370.73000.7431
human-2-senior0.84310.88940.72020.79650.79470.83900.87070.8174

Perception, A1-Pair

Participant Nameyes-or-nowhathowdistortionotherscomparejointoverall
Qwen-VL-Plus (qwen-vl-plus)0.66850.55790.59910.62460.58770.62170.59200.6148
Qwen-VL-Max (qwen-vl-max)0.67650.67560.65350.69090.61180.68650.61290.6699
BlueImage-GPT (from VIVO New Champion)0.88430.80330.79580.84640.80620.84620.79550.8348
Gemini-Pro (gemini-pro-vision)0.65780.56610.56740.60420.60550.60460.60440.6046
GPT-4V (gpt-4-vision)0.79750.69490.84420.77320.79930.81000.68000.7807
Junior-level Human0.78110.77040.82330.78170.77220.80260.76390.8012
Senior-level Human0.83000.84810.89850.83130.90780.86550.82250.8548

We have also evaluated several new open-source models recently, and will release their results soon.

Submission Guideline for A1/A2

Option 1: Submit Results

Step 1: Download Images

We now provide two ways to download the datasets (LLVisionQA&LLDescribe)

Step 2: Test with Your Model

It is highly recommended to convert your model into Huggingface format to smoothly test these data. See the example scripts for Huggingface's IDEFICS-9B-Instruct as an example, and modify them for your custom model to test on your model.

Please email haoning001@e.ntu.edu.sg to submit your result in json format.

Option 2: Submit Model

You can also submit your model (could be Huggingface AutoModel or ModelScope AutoModel) to us, alongside your custom evaluation scripts. Your custom scripts can be modified from the template scripts that works for LLaVA-v1.5 (for A1/A2), and here (for image quality assessment).

Please email haoning001@e.ntu.edu.sg to submit your model if you are outside China Mainland. Please email zzc1998@sjtu.edu.cn to submit your model if you are inside China Mainland.

A1: Perception

A snapshot for LLVisionQA benchmark dataset for MLLM low-level perception ability is as follows. See the leaderboard here.

Picture

We measure the answer accuracy of MLLMs (provided with the question and all choices) as the metric here.

A2: Description

A snapshot for LLDescribe benchmark dataset for MLLM low-level description ability is as follows. See the leaderboard here.

Picture

We measure the completeness, precision, and relevance of MLLM descriptions as the metric here.

A3: Assessment

An exciting ability that MLLMs are able to predict quantitative scores for IQA!

Methodology

Picture

Predict a Score

Pseudo Code

Similarly as above, as long as a model (based on causal language models) has the following two methods: embed_image_and_text (to allow multi-modality inputs), and forward (for computing logits), the Image Quality Assessment (IQA) with the model can be achieved as follows:

from PIL import Image
from my_mllm_model import Model, Tokenizer, embed_image_and_text

model, tokenizer = Model(), Tokenizer()

prompt = "##User: Rate the quality of the image.\n" \
         "##Assistant: The quality of the image is" ### This line can be modified based on MLLM's default behaviour.

good_idx, poor_idx = tokenizer(["good","poor"]).tolist()

image = Image.open("image_for_iqa.jpg")
input_embeds = embed_image_and_text(image, prompt)
output_logits = model(input_embeds=input_embeds).logits[0,-1]
q_pred = (output_logits[[good_idx, poor_idx]] / 100).softmax(0)[0]

*Note that you can modify the second line based on your model's default format, e.g. for Shikra, the "##Assistant: The quality of the image is" is modified as "##Assistant: The answer is". It is okay if your MLLM will first answer "Ok, I would like to help! The image quality is", just replace this into line 2 of the prompt.

Example Real Code for IDEFICS

We further provide a full implementation of IDEFICS on IQA. See example on how to run IQA with this MLLM. Other MLLMs can also be modified in the same way for use in IQA.

Compute SRCC/PLCC with IQA databases

We have prepared JSON format human opinion scores (MOS) for the seven IQA databases as evaluated in our benchmark.

Please see IQA_databases for details.

Official Results on IQA Databases

Moved to leaderboards. Please click to see details.

Contact

Please contact any of the first authors of this paper for queries.

Citation

If you find our work interesting, please feel free to cite our paper:

@inproceedings{wu2024qbench,
    author = {Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Li, Chunyi and Sun, Wenxiu and Yan, Qiong and Zhai, Guangtao and Lin, Weisi},
    title = {Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision},
    booktitle = {ICLR},
    year = {2024}
}