Home

Awesome

WildVision-Bench

WildVision-Bench

installation

As part of the wildvision arena environment, please install the following dependencies:

conda create -n visionbench python==3.9
pip install -e .

WVBench Image-Instruct Pair

🤗 WildVision-Bench

You can get image-instruct pair of WVBench-500 to generate your model answers by loading bench data below:

wildbench_data = load_dataset('WildVision/wildvision-bench', config_name='vision_bench_0617', split='test')

We have two versions of Wildvision-Bench data

Note: For now, if you want to evaluate your model, please use the vision_bench_0617 version to fairly compare the performance with other models in the following leaderboard. We are preparing the leaderboard for vision_bench_0701 and will update it soon.

Evaluation

1. Generate model answers

Example model answers files is shown in data/vision_bench_0617/example_model_answers.jsonl. You need to fill the following fields:

We provide example inference scripts in gen_answers.py and run_vllm.py to generate model answers. You can use the following command to generate model answers:

python run_vllm.py --tokenizer_mode "auto" --max_model_len 65536 --num_gpu 1 --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"

2. Get judgements

First, go to config/judge_config.yaml and add the models you want to evaluate in the model_list field. For example:

# Add your model below for evaluation
model_list:
  - liuhaotian/llava-v1.6-vicuna-7b
  - openbmb/MiniCPM-V
  - deepseek-ai/deepseek-vl-7b-chat
  - BAAI/Bunny-v1_0-3B
  - Salesforce/instructblip-vicuna-7b

Then, run the following command:

python get_judgement.py

Results will be saved in data/release_bench_0617/model_judgements/judge_gpt-4o_reference_claude-3-sonnet-20240229/{model_name}.jsonl

3. Show the results

python show_results.py

Then you will see the results as following leaderboard。

Using lmmseval to evaluate

LMMSEval is a Python package integrated with multiple MLLM's inference and evaluation tools. WildVision-Bench is one of the supported benchmarks. You can use the following command to evaluate your model on WildVision-Bench:

First, install lmmseval:

pip install lmmseval

Then, run the following command:

model_type=llava_hf
pretrained=llava-hf/llava-1.5-7b-hf
model_name=llava-1.5-7b-hf
python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model $model_type \
    --model_args "pretrained=$pretrained" \
    --tasks wildvision_0617 \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $model_name \
    --output_path ./logs/

Then find your lmmseval log dir for this evaluation, and run the following command to get the leaderboard:

python format_lmmseval_answers.py --lmmseval_log_dir {lmmseval_log_dir} --model_name {model_name}
python show_results.py

Leaderboard (vision_bench_0617)

ModelScore95% CIWin RateRewardMuch BetterBetterTieWorseMuch WorseAvg Tokens
gpt-4o89.15(-1.9, 1.5)80.6%56.4255.0148.014.072.011.0142
gpt-4-vision-preview79.78(-2.9, 2.2)71.8%39.4182.0177.022.091.028.0138
aria74.1(-2.9, 3.2)68.0%28.888.0252.047.086.027.0185
Reka-Flash64.65(-2.6, 2.7)58.8%18.9135.0159.028.0116.062.0168
claude-3-opus-2024022962.03(-3.7, 2.8)53.0%13.5103.0162.048.0141.046.0105
yi-vl-plus55.05(-3.4, 2.3)52.8%7.298.0166.029.0124.083.0140
liuhaotian/llava-v1.6-34b51.89(-3.4, 3.8)49.2%2.590.0156.026.0145.083.0153
claude-3-sonnet-2024022950.0(0.0, 0.0)0.2%0.10.01.0499.00.00.0114
claude-3-haiku-2024030737.83(-2.6, 2.8)30.6%-16.554.099.047.0228.072.089
gemini-pro-vision35.57(-3.0, 3.2)32.6%-21.080.083.027.0167.0143.068
liuhaotian/llava-v1.6-vicuna-13b33.87(-2.9, 3.3)33.8%-21.462.0107.025.0167.0139.0136
deepseek-ai/deepseek-vl-7b-chat33.61(-3.3, 3.0)35.6%-21.259.0119.017.0161.0144.0116
THUDM/cogvlm-chat-hf32.01(-2.2, 3.0)30.6%-26.475.078.015.0172.0160.061
liuhaotian/llava-v1.6-vicuna-7b26.41(-3.3, 3.1)27.0%-31.445.090.036.0164.0165.0130
idefics2-8b-chatty23.96(-2.2, 2.4)26.4%-35.844.088.019.0164.0185.0135
Qwen/Qwen-VL-Chat18.08(-1.9, 2.2)19.6%-47.942.056.015.0155.0232.069
llava-1.5-7b-hf15.5(-2.4, 2.4)18.0%-47.828.062.025.0174.0211.0185
liuhaotian/llava-v1.5-13b14.43(-1.7, 1.6)16.8%-52.528.056.019.0157.0240.091
BAAI/Bunny-v1_0-3B12.98(-2.0, 2.1)16.6%-54.423.060.010.0164.0243.072
openbmb/MiniCPM-V11.95(-2.4, 2.1)13.6%-57.525.043.016.0164.0252.086
bczhou/tiny-llava-v1-hf8.3(-1.6, 1.2)11.0%-66.216.039.015.0127.0303.072
unum-cloud/uform-gen2-qwen-500m7.81(-1.3, 1.7)10.8%-68.516.038.011.0115.0320.092

Contributing to the leaderboard

If you want to contribute to the leaderboard, please follow the steps below:

  1. Fork this repository.
  2. Add your model's answers to the data/vision_bench_0617/model_answers/ directory.
  3. Add your model's judgements to the data/release_bench_0617/model_judgements/ directory.
  4. Run python show_results.py to generate the leaderboard.
  5. Copy the elo_leaderboard.md file and paste it into the above "Leaderboard" section.
  6. Create a pull request.

Acknowledgment

We thank LMSYS for their great work on https://chat.lmsys.org/. Our code base is adapted from https://github.com/lm-sys/arena-hard-auto.

Thanks lmmseval for integrating WildVision-Bench into their evaluation platform.

Citation

If you found this repository useful, please consider cite our paper and resources:


@article{lu2024wildvision,
  title={WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences},
  author={Lu, Yujie and Jiang, Dongfu and Chen, Wenhu and Wang, William Yang and Choi, Yejin and Lin, Bill Yuchen},
  publisher={NeurIPS},
  year={2024}
}
@misc{yujie2024wildvisionarena,
    title={WildVision Arena: Benchmarking Multimodal LLMs in the Wild},
    url={https://huggingface.co/spaces/WildVision/vision-arena/},
    author={Lu, Yujie and Jiang, Dongfu and Chen, Hui and Ma, Yingzi and Gu, Jing and Xiao, Chaowei and Chen, Wenhu and Wang, William and Choi, Yejin and Lin, Bill Yuchen},
    year={2024}
}
@misc{yujie2024wildvisionv2,
    title={WildVision Data and Model},
    url={https://huggingface.co/WildVision},
    author={Lu, Yujie* and Jiang, Dongfu* and Chen, Hui* and Fu, Xingyu and Ma, Yingzi and Gu, Jing and Saxon, Michael and Xiao, Chaowei and Chen, Wenhu and Choi, Yejin and Lin, Bill Yuchen and Eckstein, Miguel and Wang, William},
    year={2024}
}