Home

Awesome

<h1 align = "center"> Multimodal-Robustness-Benchmark </h1> <p align="center"> <a href="http://arxiv.org/abs/2406.10638"> <img alt="Paper" src="http://img.shields.io/badge/Paper-arXiv%3A2406.10638-B31B1B.svg"> </a> <a href="https://huggingface.co/datasets/BAAI/Multimodal-Robustness-Benchmark"> <img alt="Dataset" src="https://img.shields.io/badge/πŸ€—%20Dataset-MMR%20Benchmark-yellow"> </a> <a href="http://mmr.dataoptim.org/"> <img alt="Project Demo" src="https://img.shields.io/badge/πŸ€–%20Project-Demo-blue"> </a> </p> <p align="center"> <a href="https://huggingface.co/AI4VR/Bunny-MMR-3B"> <img alt="Model Bunny-MMR-3B" src="https://img.shields.io/badge/πŸ€—%20Model-Bunny--MMR--3B-green"> </a> <a href="https://huggingface.co/AI4VR/Bunny-MMR-4B"> <img alt="Model Bunny-MMR-4B" src="https://img.shields.io/badge/πŸ€—%20Model-Bunny--MMR--4B-green"> </a> <a href="https://huggingface.co/AI4VR/Bunny-MMR-8B"> <img alt="Model Bunny-MMR-8B" src="https://img.shields.io/badge/πŸ€—%20Model-Bunny--MMR--8B-green"> </a> </p>

This repo contains the official evaluation code and dataset for the paperβ€œSeeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions”.

πŸ“’ News and Updates

πŸ™Œ How to Add a New Model to MMR Benchmark

We will then add the necessary script to our repository and handle the inference and evaluation for you.

πŸ“‡ Contents

βš– MMR-benchmark

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual understanding and reasoning, providing reasonably accurate answers, such as image descriptions. This has spurred extensive research into evaluating MLLMs. Most evaluation benchmarks assume that incorrect answers indicate a lack of understanding of the visual content. However, our findings reveal that, in many cases, MLLMs answer questions incorrectly despite correctly understanding the visual content. This suggests that incorrect answers do not necessarily imply a lack of comprehension but may instead result from a lack of robustness to leading questions.

<p align="center"> <img src="./figure/cover_fig.jpg" alt="Logo"> </p>

To comprehensively measure MLLMs' understanding capability and robustness to leading questions, we introduce a multi-modal robustness benchmark (MMR). MMR contains paired positive and negative questions across 12 categories, meticulously annotated by humans. We manually construct 300 positive and 300 leading negative questions across three levels: character, attribute, and context. Character-level questions prompt identifying elements like characters or numbers, while attribute-level questions focus on properties such as color, texture, and quantity. Context-level inquiries delve into higher-level concepts like emotions, culture, and common sense. The positive questions aim to evaluate the model's understanding ability, while the misleading ones challenge its resistance to interference.

mmr_benchmark

🏁 Evaluation

Please refer to our evaluation folder for more details.

πŸ† Leaderboard

MethodAvg. RA ↑Char/NumPres.Color/TexNum.ShapePosturePos.Abstract.Concrete.Expert.Act.Rel.
GPT-4o πŸ₯‡69.0072.5068.1866.6745.8387.570.8350.0068.1876.1970.9783.3363.64
Mini-Gemini-HD-34B πŸ₯‡69.0062.5063.6470.8354.1779.1762.5072.7386.3685.7154.8419.1768.18
LLaVA-1.6-34B πŸ₯‰68.6775.0068.1866.6741.6779.1754.1772.7281.8171.4264.5279.1768.18
Qwen-VL-max68.3367.5072.7366.6741.6779.1762.563.6477.2780.9561.2979.1772.73
Bunny-Llama-3-8B-V60.6755.0063.6454.1737.5079.1762.5054.5572.7385.7148.3975.0050.00
InternVL-Chat-V1-5 (26B)59.6762.559.0966.6741.6766.6741.6754.5563.6466.6745.1679.1772.73
Yi-VL-34B58.3352.5063.6470.8341.6775.0037.5059.0968.1857.1448.3970.8363.64
Bunny-MMR-3B58.3360.059.0958.3325.083.3350.054.5568.1857.1451.6179.1754.55
Idefics2-8B56.6757.5059.0954.1750.0079.1741.6727.2777.2776.1945.1675.0040.91
Cogvlm2-llama354.0060.0063.6454.1737.570.8333.3340.9150.0085.7141.9462.5050.00
Step-1V53.3360.0054.5558.3320.8370.8354.1731.8254.5557.1445.1679.1750.00
Phi-3-vision (4B)52.3362.5059.0958.3337.5070.8333.3331.8254.5566.6741.9458.3350.00
Glm-4V50.0060.0054.5554.1729.1758.3341.6727.2772.7347.6235.4870.8345.45
Gemini-pro-vision48.6742.5050.0041.6725.0083.3350.0045.4540.9147.6245.1670.8345.45
Deepseek-VL-7B-Chat47.6752.5054.5554.1737.562.525.0018.1854.5552.3835.4875.0050.00
Mplug-owl2-llama2-7B41.3332.5063.6458.3320.8362.5037.5013.6454.5547.6225.8158.3331.82
MiniCPM-Llama3-V40.3337.545.4550.0016.6741.6737.536.3668.1833.3329.0341.6754.55
LLaVA-RLHF (7B)30.677.5036.3633.3333.3350.0016.679.0959.0938.1022.5850.0031.82
Claude3-Opus-V28.6735.0022.7312.5016.6733.3316.6722.7345.4533.3325.8137.5040.91
MethodAvg. MR ↓Char/NumPres.Color/TexNum.ShapePosturePos.Abstract.Concrete.Expert.Act.Rel.
Mini-Gemini-HD-34B πŸ₯‡15.1621.8812.5010.537.145.0028.5715.799.525.2632.009.5211.76
LLaVA-1.6-34B πŸ₯ˆ16.266.2511.7620.0023.089.5235.0011.1114.2825.0020.009.5216.67
GPT-4o πŸ₯‰19.469.3816.6723.8126.674.5519.0538.8928.5715.7924.1413.0422.22
Qwen-VL-max20.2322.8611.1123.8128.575.0025.0030.0019.0519.0529.639.5215.79
Bunny-Llama-3-8B-V22.2215.3822.2218.7540.005.0028.5729.4123.8110.0040.0010.0026.67
Bunny-MMR-3B23.9111.1113.3326.3253.854.7640.0029.4128.5733.3333.339.5214.29
Idefics2-8B26.7223.3327.7823.5320.0013.6450.0040.0022.7311.1141.6714.2940.00
Yi-VL-34B27.3927.5922.2215.0028.5710.0050.0027.7816.6742.8642.3122.7317.65
InternVL-Chat-V1-5 (26B)28.9721.8818.7523.8137.5027.2752.3829.4133.3326.3244.0013.6420.00
Step-1V30.4314.2925.0026.3261.545.5640.9161.1133.3333.3344.009.5221.43
Cogvlm2-llama333.0622.5822.2227.7830.7715.0061.9035.7145.0014.2951.8528.5738.89
Phi-3-vision (4B)34.0319.3518.7526.3243.7519.0555.5658.8240.0022.2248.0033.3331.25
Gemini-pro-vision34.8229.1731.2541.1845.4513.0440.0033.3352.6344.4448.1519.0523.08
Glm-4V38.7827.2736.8435.0056.2533.3352.3853.8520.0047.3757.6919.0537.50
Deepseek-VL-7B-Chat42.3430.0020.0027.7843.7531.8271.4377.7845.4547.6257.6914.2938.89
Mplug-owl2-llama2-7B42.8638.1017.6522.2261.5425.0057.1476.9240.0047.3765.2226.3246.15
LLaVA-RLHF (7B)57.0186.3650.0050.0046.6740.0078.9581.8238.1057.8968.1829.4156.25

🚩 MMR-data

To enhance MLLMs' understanding capability and robustness, we propose a data construction method using GPT-4V to generate paired positive and negative samples for instruction tuning. The method includes three steps: 1) Information extraction. We implicitly and comprehensively extract detailed information from images, including text, object attributes, human characteristics, relationships between objects, relationships between people, events, and overall perception. 2) Instruction tuning data generation. We generate positive samples using the extracted information and construct negative samples that directly contradict the positive ones. 3) Sample filtering. We filter samples through keyword matching to remove those with uncertain answers and redundant phrases.

data_collection

Data generation

python dataset/data_generation.py \
      --input_file /path/to/input.json \
      --output_file /path/to/output.json \
      --image_folder /path/to/image folder \
      --api_key api_key
python dataset/data_reformat.py \
      --input /path/to/input.json \
      --output_pos /path/to/output_pos.json \
      --output_neg /path/to/output_neg.json \
      --output_merge /path/to/merged_output.json
python dataset/data_filtering.py \
      --input /path/to/input.json \
      --output /path/to/output.json

πŸ€– Training

CheckpointVision EncoderLLMPretrain lrPretrain weights
Bunny-MMR-3Bsiglip-so400m-patch14-384microsoft/phi-25e-4bunny-pretrain-phi-2-siglip
Bunny-MMR-4Bsiglip-so400m-patch14-384microsoft/Phi-3-mini-4k-instruct1e-3bunny-pretrain-phi-3-siglip
Bunny-MMR-8Bsiglip-so400m-patch14-384meta-llama/Meta-Llama-3-8B-Instruct1e-3bunny-pretrain-llama3-8b-siglip

🌟 Quickstart

Here we show a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

pip install torch transformers accelerate pillow
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cpu')  # or 'cuda'

offset_bos = 1 # for Bunny-MMR-8B and AI4VR/Bunny-MMR-4B
# offset_bos = 0 for Bunny-MMR-3B

# create model
model = AutoModelForCausalLM.from_pretrained(
    'AI4VR/Bunny-MMR-8B', # or 'AI4VR/Bunny-MMR-3B' or 'AI4VR/Bunny-MMR-4B'.
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'AI4VR/Bunny-MMR-8B', # or 'AI4VR/Bunny-MMR-3B' or 'AI4VR/Bunny-MMR-4B'.
    trust_remote_code=True)

# text prompt
prompt = 'text prompt'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0)

# image input
image = Image.open('path/to/image')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=100,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

πŸ”— Citation

If you find this repository helpful, please cite the paper below.

@misc{liu2024seeing,
    title={Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions},
    author={Yexin Liu and Zhengyang Liang and Yueze Wang and Muyang He and Jian Li and Bo Zhao},
    year={2024},
    eprint={2406.10638},
    archivePrefix={arXiv},
}

🧾 License

Code License Data License Weight License

The project employs specific datasets and checkpoints that are governed by their original licenses. Users must adhere to all terms and conditions outlined in these licenses. The checkpoints are restricted to uses that comply with the license agreements of Bunny, LLaMA 3, Phi-2, Phi-3, and GPT-4. The dataset is provided under the CC-BY-4.0 license.

πŸ“« Acknowledgement