Awesome

Jailbreaking Attack against Multimodal Large Language Model

Overview

We are the first to comprehensively study jailbreaking against MLLMs, showcasing strong data-universal property. Moreover, it exhibits notable modeltransferability, allowing for the jailbreaking of various models in a black-box manner.
We propose a construction-based method to harness our approach for LLM-jailbreaks, demonstrating superior efficiency compared to LLM-jailbreaking methods.

Dataset

Until now, there is no existing multimodal dataset available for evaluating MLLM-jailbreaks. However, there are some pure text datasets for LLM-jailbreaking evaluation, such as AdvBench. Therefore, we construct a multimodal dataset, namely AdvBench-M, based on AdvBench in this paper.

We group all the harmful behaviors within AdvBench into 8 distinct semantic categories, specifically, “Bombs or Explosives”, “Drugs”, “Self-harm and Suicide”, “Cybersecurity and Privacy Issues”, “Physical Assault”, “Terrorism and Societal Tensions”, “Stock Market and Economy”, and “Firearms and Ammunition”. For each category, 30 semantic-relevant images were retrieved from the Internet using the Google search engine, coupled with the corresponding harmful behaviors.

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository, creating a python environment and activate it via the following command

git clone https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model.git
cd MLLMs-jailbreaks
conda env create -f environment.yml
conda activate minigptv

2. Prepare the pretrained LLM weights

We examine several popular Multimodal LLMs, including MiniGPT-4, MiniGPT-v2, LLaVA, InstructBLIP, mPLUG-Owl2. Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.

MiniGPT-4 has three variants corresponding to three distinct LLM inside, i.e., Vicuna-7B, Vicuna-13B and LLaMA2, while MiniGPT-v2 just employs LLaMA2 as its LLM. For white-box jailbreaks, we evaluate our approach on MiniGPT-4 and MiniGPT-v2 separately. For evaluating model-transferaibility, we generate the imgJP on MiniGPT4 and subsequently employ it for black-box attacks on MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2.

Model name	Hugging Face Repo
MiniGPT-4(Vicuna7B)	Vision-CAIR/vicuna-7b
MiniGPT-4(Vicuna13B)	Vision-CAIR/vicuna
MiniGPT-4(LLaMA2)	meta-llama/Llama-2-7b-chat-hf
MiniGPT-v2	meta-llama/Llama-2-7b-chat-hf
InstructBLIP	lmsys/vicuna-7b-v1.1
LLaVA	liuhaotian/llava-v1.5-13b
mPLUG-Owl2	MAGAer13/mplug-owl2-llama2-7b

Then, set the variable llama_model in the model config file to the LLM weight path.

For MiniGPT-v2, set the LLM path here at Line 14.
For MiniGPT-4 (LLaMA2), set the LLM path here at Line 15.
For MiniGPT-4 (Vicuna), set the LLM path here at Line 18

3. Prepare the pretrained model checkpoints

Download the pretrained model checkpoints

MiniGPT-4 (Vicuna 7B)	MiniGPT-4 (Vicuna 13B)	MiniGPT-4 (LLaMA-2 Chat 7B)	MiniGPT-v2 (online developing demo)
Download	Download	Download	Download

For MiniGPT-v2, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigptv2_eval.yaml at Line 8.

For MiniGPT-4, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 8 for Vicuna version or eval_configs/minigpt4_llama2_eval.yaml for LLama2 version.

Practical examples

In the fig folder, we showcase numerous successful jailbreaking instances, encompassing white-box attacks on MiniGPT4(LLaMA2) as well as examples of black-box transfer attacks.

Launching Demo Locally

White-box Jailbreaks

1. imgJP-based Jailbreak(Multiple Harmful Behaviors)

For MiniGPT-4(LLaMA2), run

python v1_mprompt.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0

For MiniGPT-4(LLaMA2+Img-suffix), run

python v1_mprompt_img_suffix.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0

For MiniGPT-v2, run

python v2_mprompt.py --cfg-path eval_configs/minigptv2_eval.yaml  --gpu-id 0

2. deltaJP-based Jailbreak

For MiniGPT-4(LLaMA2), run

python v1_Mprompt_Mimage.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0

For MiniGPT-v2, run

python v2_Mprompt_Mimage.py --cfg-path eval_configs/minigptv2_eval.yaml  --gpu-id 0

Model-transferability

We generate imgJP on a surrogate model (e.g., Vicuna and LLaMA2) and use the generated imgJP to jailbreak various target models (e.g., mPLUG-Owl2, LLaVA, MiniGPT-v2, and InstructBLIP) in a black-box manner.

run

python v1_Mprompt_Mmodel.py --gpu-id 0

With the generated imgJP, we execute black-box attacks on all four models with their default conversation template. Taking mPLUG-Owl2 as an example, run

python mPLUG-Owl2_demo.py --gpu-id 0

Construction-base LLM-jailbreaks

In this manuscript, we delve into the exploration of jailbreaking of LLaMA2. We first construct a MLLM that encapsulates it, as shown in the following figure.

Secondly, we perform our MLLM-jailbreak to acquire imgJP, while concurrently recording the embedding embJP, i.e., run

python v1_mprompt_img_suffix.py --gpu-id 0

Thirdly, the embJP is reversed into text space through Deembedding and De-tokenizer operations. And then we execute LLM-jailbreaks with LLaMA2's default conversation template.

run

python Test_Llama2_image_suffix.py

Acknowledgement

MiniGPT-4 This repository is built upon MiniGPT-4!
llm-attacks Andy Zou’s outstanding work has found that a specific prompt suffix allows the jailbreaking of most popular LLMs. Don't forget to check this great open-source work if you don't know it before!
adversarial-attacks-pytorch Torchattacks is a PyTorch library that provides adversarial attacks to generate adversarial examples.