How many <img src="assets/unicorn.png" width="36"> Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Haoqin Tu*, Chenhang Cui*, Zijun Wang*, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie (*equal contribution)

Our paper is online now: https://arxiv.org/abs/2311.16101

For different VLLMs, please refer to their specific envirnments for installation.


We host our datasets here, containing both OOD and redteaming attack datasets. The full dataset should looks like this:

├── ./safety_evaluation_benchmark_datasets//                    
    ├── gpt4v_challenging_set # Contains the challenging test data for GPT4V
        ├── attack_images
        ├── sketchy_images
        ├── oodcv_images
        ├── misleading-attack.json
        ├── sketchy-vqa-challenging.json
        └── oodcv-vqa-counterfactual.json
    ├── redteaming # Contains the test data for redteaming tasks
        ├── misleading_attack
            ├── gaussian_noise
            ├── mixattack_eps32
            ├── mixattack_eps64
            ├── sinattack_eps64_dog
            ├── sinattack_eps64_coconut
            ├── sinattack_eps64_spaceship
            └── annotation.json
        ├── jailbreak_vit # adversarial images for jailbreaking VLLM through ViT
        └── jailbreak_llm # adversarial suffixes for jailbreaking VLLM through LLM
    └── ood # Contains the test data for OOD scenarios
        ├── sketchy-vqa
            ├── sketchy-vqa.json
            ├── sketchy-challenging.json
        └── oodcv-vqa
            ├── oodcv-vqa.json
            └── oodcv-counterfactual.json

Out-of-Distribution Scenario

For $\texttt{OODCV-VQA}$ and its counterfactual version, please download images from OODCV, and put all images in ood/oodcv-vqa.

For $\texttt{Sketchy-VQA}$ and its challenging version, please first download images from here, put the zip file into ood/sketchy-vqa/skechydata/, then unzip it.

Redteaming Attack

For the proposed misleading attack, the full datasets and all trained adversarial examples are in redteaming/misleading_attack, including images with gaussian noise, Sin.Attack and MixAttack with two pertubation budgets $\epsilon=32/255$ (eps32) or $\epsilon=64/255$ (eps64).

For jailbreaking methods, please refer to their respective repositories for more dataset details: Jailbreak through ViT, Jailbreak through LLM.


Before you start, make sure you have modified the CACHE_DIR (where you store all your model weights) and DATA_DIR (where you store the benchmark data) in baselines/config.json according to your local envirnment.

cd baselines
python ../model_testing_zoo.py --model_name LLaVA

Choose --model_name from ["LlamaAdapterV2", "MiniGPT4", "MiniGPT4v2", "LLaVA", "mPLUGOwl", "mPLUGOwl2", "PandaGPT", "InstructBLIP2", "Flamingo", "LLaVAv1.5", "LLaVAv1.5-13B", "LLaVA_llama2-13B", "MiniGPT4_llama2", "Qwen-VL-Chat", "MiniGPT4_13B", "InstructBLIP2-FlanT5-xl", "InstructBLIP2-FlanT5-xxl", "InstructBLIP2-13B", "CogVLM", "Fuyu", "InternLM"].

$\texttt{OODCV-VQA}$ and its Counterfactual Variant

For $\texttt{OODCV-VQA}$:

cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_oodcv

For the counterfactual version:

cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_oodcv_cf

$\texttt{Sketchy-VQA}$ and its Challenging Variant

For $\texttt{Sketchy-VQA}$:

cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_sketch

For the challenging version:

cd baselines
python ../safety_evaluations/ood_scenarios/evaluation.py --model_name LLaVA --eval_sketch_challenging

Misleading Attack

For training the misleading adversarial images:

cd safety_evaluations/redteaming/misleading_vision_attack

python misleading_vis_attack.py --lr 1e-3 --misleading_obj dog --input_folder path/to/attack-bard/NIPS2017 --output_folder ./misleading_adversarial_attack

Change --input_folder to the path of adversarial examples you want to test. If you want to use the MixAttack, add --mix_obj argument to the command.

For testing the VLLMs:

cd baselines

python ../safety_evaluations/redteaming/misleading_vision_attack/test_misleading.py --image_folder redteaming/misleading_attack/mixattack_eps64 --output_name misleading_attack_eps64 --human_annot_path redteaming/misleading_attack/annotation.json

Jailbreaking Methods

Please refer to these two repositories for detailed attack settings: Jailbreak through ViT, Jailbreak through LLM. We give our trained adversarial images and suffixes for jailbreaking ViTs and LLMs in redteaming/jailbreak_vit and redteaming/jailbreak_llm in the data folder.

Usage and License Notices

The data, code and checkpoint is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.


  title={How Many Unicorns Are In This Image? A Safety Evaluation Benchmark For Vision LLMs},
  author={Tu, Haoqin and Cui, Chenhang and Wang, Zijun and Zhou, Yiyang and Zhao, Bingchen and Han, Junlin and Zhou, Wangchunshu and Yao, Huaxiu and Xie, Cihang},
  journal={arXiv preprint arXiv:2311.16101},


This work is partially supported by a gift from Open Philanthropy. We thank Center for AI Safety and Google Cloud for supporting our computing needs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.