Home

Awesome

<h1 align='center' style="text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;"> Visual Adversarial Examples Jailbreak<br>Aligned Large Language Models </h1> <p align='center' style="text-align:center;font-size:1.25em;"> <a href="https://unispac.github.io/" target="_blank" style="text-decoration: none;">Xiangyu Qi<sup>1,*</sup></a>&nbsp;,&nbsp; <a href="https://hackyhuang.github.io/" target="_blank" style="text-decoration: none;">Kaixuan Huang<sup>1,*</sup></a>&nbsp;,&nbsp; <a href="https://scholar.google.com/citations?user=rFC3l6YAAAAJ&hl=en" target="_blank" style="text-decoration: none;">Ashwinee Panda<sup>1</sup></a><br> <a href="https://www.peterhenderson.co/" target="_blank" style="text-decoration: none;">Peter Henderson<sup>2</sup></a>&nbsp;,&nbsp; <a href="https://mwang.princeton.edu/" target="_blank" style="text-decoration: none;">Mengdi Wang<sup>1</sup></a>&nbsp;,&nbsp; <a href="https://www.princeton.edu/~pmittal/" target="_blank" style="text-decoration: none;">Prateek Mittal<sup>1</sup></a>&nbsp;&nbsp; <br/> <sup>*</sup>Equal Contribution<br> Princeton University<sup>1</sup>&nbsp;&nbsp;&nbsp;&nbsp;Stanford University<sup>2</sup><br/> </p> <p align='center';> <b> <em>AAAI (Oral), 2024</em> <br> </b> </p> <p align='center' style="text-align:center;font-size:2.5 em;"> <b> <a href="https://arxiv.org/abs/2306.13213" target="_blank" style="text-decoration: none;">arXiv</a>&nbsp; </b> </p>

!!! Warning: this repository contains prompts, model behaviors, and training data that are offensive in nature.

<br> <br>


Overview: A single visual adversarial example can jailbreak MiniGPT-4.

Note

  1. For each instruction below, we've sampled 100 random outputs, calculating the refusal and obedience ratios via manual inspection. A representative, redacted output is showcased for each.
  2. We use ɛ = 16/255 in the following demo.

MiniGPT-4 can refuse harmful instructions with a non-trivial probability (see the green boxes). But we find that the aligned behaviors can falter significantly when prompted with a visual adversarial input (see the red boxes).

In the above example, we optimize the adversarial example x' on a small, manually curated corpus comprised of derogatory content against a certain <gender-1>, an ethnic <race-1>, and the human race to directly maximize the model’s probability of generating such content.

Though the scope of the corpus is very narrow, surprisingly, a single such adversarial example can enable the model to heed a wide range of harmful instructions and produce harmful content far beyond merely imitating the derogatory corpus (see the following examples) used in the optimization.

Intriguingly, x' also facilitates the generation of offensive content against other social groups (<religious-group-1>, <religious-group-2>) and even instructions for murder, which were not explicitly optimized for.

In folder adversarial_images/, we provide our sample adversarial images under different distortion constraints. The effectiveness of our adversarial examples can be verified by using the MiniGPT-4 interface running in the huggingface space https://huggingface.co/spaces/Vision-CAIR/minigpt4.


<br> <br>

Step-by-Step Instructions for Reimplementing Our Experiments on MiniGPT-4

Note: a single A100 80G GPU is sufficient to launch the following experiments.

<br>

Installation

We take MiniGPT-4 (13B) as the sandbox to showcase our attacks. The following installation instructions are adapted from the MiniGPT-4 repository.

1. Set up the environment

git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git

cd Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

conda env create -f environment.yml
conda activate minigpt4

2. Prepare the pretrained weights for MiniGPT-4

As we directly inherit the MiniGPT-4 code base, the guide from the MiniGPT-4 repository can also be directly used to get all the weights.

<br>

Generate Visual Adversarial Examples

Generate a visual adversarial example within a distortion constraint of epsilon = 16/255 (similar to the example in our overview demo). The final adversarial examples will be saved to $save_dir/bad_prompt.bmp, and we also save intermediate checkpoints every 100 iterations.

The argument can be adjusted (e.g., --eps=32,64,128) to evaluate the effectiveness of the attacks under different distortion budgets.

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --constrained --eps 16 --alpha 1 --save_dir visual_constrained_eps_16

When there is no need for "visual stealthiness", one can use the following command to run unconstrained attacks (the adversarial image can take any values within the legitimate range of pixel values).

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000  --alpha 1 --save_dir visual_unconstrained
<br>

Evaluation

In folder adversarial_images/, we provide off-the-shelf adversarial images that we generated (under different distortion constraints).

To verify the effectiveness of our adversarial examples:

<br>

Generate Textual Adversarial Examples

We also provide codes for optimizing adversarial text tokens w.r.t. the same attack targets as our visual attacks. A running example:

python minigpt_textual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --n_candidates 50 --save_dir textual_unconstrained

<br><br>

Attacks on Other Models

We also implement our attacks on two other open-sourced VLMs, including InstructBLIP and LLaVA. To launch experiments on these models, we suggest users to create a separate conda environment for each model and install dependencies following the instructions of the original repositories of these models.

Instruct BLIP

LLaVA-LLaMA-2

<br><br>

Citation

If you find this useful in your research, please consider citing:

@misc{qi2023visual,
      title={Visual Adversarial Examples Jailbreak Aligned Large Language Models}, 
      author={Xiangyu Qi and Kaixuan Huang and Ashwinee Panda and Peter Henderson and Mengdi Wang and Prateek Mittal},
      year={2023},
      eprint={2306.13213},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}