Awesome
😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models
🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.<br> But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.
🤗Introduction
This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.<br> Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.<br>
🧑💻 Four Levels of Multimodal Jailbreak lifecycle
- Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.<br>
- Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.<br>
- Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.<br>
- Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.<br>
Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.<br> We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.<br>
🚀Table of Contents
- 😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models🛡️
🔥Multimodal Generative Models
Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.
📑Any-to-Text Models (LLM Backbone)
Short Name | Modality | Representative Model |
---|---|---|
I+T→T | I + T → T | LLaVA, MiniGPT4, InstructBLIP |
VT2T | V + T → T | Video-LLaVA, Video-LLaMA |
AT2T | A + T → T | Audio Flamingo, Audiopalm |
📖Any-to-Vision (Diffusion Backbone)
Short Name | Modality | Representative Model |
---|---|---|
T→I | T → I | Stable Diffusion, Midjourney, DALLE |
IT→I | I + T → I | DreamBooth, InstructP2P |
T2V | T → V | Open-Sora, Stable Video Diffusion |
IT2V | I + T → V | VideoPoet, CogVideoX |
📰Any-to-Any (Unified Backbone)
Short Name | Modality | Representative Model |
---|---|---|
IT→IT | I + T → I + T | Next-GPT, Chameleon |
TIV2TIV | T + I + V → T + I + V | EMU3 |
Any2Any | Any → Any | GPT-4o, Gemini Ultra |
😈JailBreak Attack
📖Attack-Intro
We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.
- Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
- Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.
- Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
- Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.
📑Papers
Below are the papers related to jailbreak attacks.
Jailbreak Attack of Any-to-Text Models
Jailbreak Attack of Any-to-Vision Models
Jailbreak Attack of Any-to-Any Models
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|---|---|---|---|---|
Gradient-based Jailbreak Images for Multimodal Fusion Models | Arxiv 2024 | 2024/10/4 | Github | Generator Level | I+T→I+T |
Voice jailbreak attacks against gpt-4o | Arxiv 2024 | 2024/05/29 | Github | Output Level | Any→Any |
🛡️Jailbreak Defense
📖Defense-Intro
Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.
- Discriminative defenses: is constrained to classification tasks for assigning binary labels.
- Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.
📑Papers
Below are the papers related to jailbreak defense.
Jailbreak Defense of Any-to-Text Models
Jailbreak Defense of Any-to-Vision Models
Jailbreak Defense of Any-to-Any Models
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|
💯Evaluation
⭐️Evaluation Datasets
Below is a comparison table of publicly available representative evaluation datasets and a description of each attribute in the table.
- Collected: raw data created by humans or collected from real-world websites.<br>
- Reconstructed: Data reorganized from other existing datasets.<br>
- Synthesized: AI-generated data using LLM or diffusion models.<br>
- Adversarial: Adversarial data generated by jailbreak attack methods.<br>
Used to Any-to-Text Models
Dataset | Text Source | Image Source | Volume | Theme | Access |
---|---|---|---|---|---|
Figstep | Synthesized | Adversarial | 500 | 10 | Github |
AdvBench | Synthesized | --- | 500 | --- | Github |
ReadTeam-2K | Collected & Reconstructed & Synthesized | N/A | 2000 | 16 | Huggingface |
HarmBench | Collected | --- | 510 | 4 | Github |
HADES | Synthesized | Collected & Synthesized & Adversarial | 750 | 5 | Github |
MM-SafetyBench | Synthesized | Synthesized & Adversarial | 5040 | 13 | Github |
JailBreakV-28K | Adversarial | Reconstructed & Synthesized | 28000 | 16 | Huggingface |
Used to Any-to-Vision Models
Dataset | Text Source | Image Source | Volume | Access | Theme |
---|---|---|---|---|---|
NSFW-200 | Synthesized | --- | 200 | --- | Github |
MMA | Reconstructed & Adversarial | Adversarial | 1000 | --- | Huggingface |
VBCDE | Reconstructed & Adversarial | --- | 100 | 5 | Github |
I2P | Collected | Collected | 4703 | 7 | Huggingface |
Unsafe Diffusion | Collected & Reconstructed | --- | 1434 | --- | Github |
MACE-Celebrity | Collected | --- | 1000 | --- | Github |
MACE-Art | Reconstructed | --- | 1000 | --- | Github |
MPUP | Synthesized | --- | 1200 | 4 | Huggingface |
T2VSafetyBench | Reconstructed & Synthesized & Adversarial | --- | 4400 | 12 | Github |
📚Evaluation Methods
Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.
- Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
- Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.
Text Detector
Toxicity detector | Access |
---|---|
LLama-Guard | Huggingface |
LLama-Guard2 | Huggingface |
Detoxify | Github |
GPTFUZZER | Huggingface |
Perspective API | Website |
Image Detector
Toxicity detector | Access |
---|---|
NudeNet | Github |
Q16 | Github |
Safety Checker | Huggingface |
Imgcensor | Github |
Multi-headed Safety Classifier | Github |
😉Citation
If you find this work useful in your research, Please kindly cite using the following BibTex:
@article{liu2024jailbreak,
title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
journal={arXiv preprint arXiv:2411.09259},
year={2024},
}