Home

Awesome

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

This repo provides the source code & data of our paper: Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models.

@article{Li-HADES-2024,
  author       = {Yifan Li and Hangyu Guo and Kun Zhou and Wayne Xin Zhao and Ji{-}Rong Wen},
  title        = {Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2403.09792},
  year         = {2024}
}

Overview

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.

model_figure

Update

Dataset

We are excited to release the data we used in our paper generated by HADES, which can be accessed at this link. You can use these data to reproduce our experiment results and you can also use HADES to build your own data by following the guidance in the following part. The dataset is structured as follows:

Safety Declaration:

While we release this dataset for research purposes, we emphasize the importance of using it responsibly and ethically. The data contained within the HADES dataset may depict or pertain to sensitive or harmful subjects. By accessing the HADES dataset, you agree not to use the data for any illegal or harmful activities. Please ensure responsible and ethical use at all times.

Preparation

HADES is based on LLaVA 1.5, PixArt, and the evaluation of HADES is based on Beaver-7b. You can download the corresponding weights from the following Huggingface space by cloning the repository using git-lfs.

HADES Base: LLaVA 1.5 WeightsHADES Base: PixArt XL 2-1024-MS WeightsEvaluation Base: Beaver-dam-7b Weights
DownloadDownloadDownload

Then you can copy the weights folder to ./checkpoint

Harmful Data Collection using HADES

  1. Generating harmful text instructions
bash ./generate_benchmark.sh
  1. Amplifying image harmfulness with LLMs
bash ./amplifying_toxic.sh
  1. Amplifying image harmfulness with gradient update
bash ./white_box.sh

Evaluation

Now you can use the collected images and text to evaluate the safaty alignment of MLLMs by running the following script. The 'abstract' parameter refers to the 'text-to-image pointer' setting in our paper.

bash run_evaluation.sh abstract gpt4v hades

The script will report the Attack Success Rate (ASR) associated with GPT-4V with our HADES in abstract settings.

bash run_evaluation.sh abstract gpt4v black_box

And you can run the script to obtain the ASR on GPT-4V with our HADES without a white-box attack.

Furthermore, the run_evaluation.sh script can be also employed to calculate the ASR of HADES on other models such as LLaVA and Gemini.

Related Projects