Awesome
🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models
🤗Introduction
Welcome to our Awesome-Jailbreak-against-Multimodal-Generative-Models! We provides a comprehensive overview of jailbreak vulnerabilities in multimodal generative models! 🥰🥰🥰<br>
<br> We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense Multimodel Generative Models.<br> But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.If a resource is relevant to multiple subcategories, we place it under each applicable section.<br>
🧑💻 Our Work
- Leveraging the layered structure of generative models, we systematically examine jailbreak attacks and corresponding defense strategies across the input, encoding, decoding, and output layers.<br>
- We establish a detailed taxonomy of attack vectors, defense mechanisms, and evaluation frameworks specific to multimodal generative models.<br>
- Our review encompasses a wide array of input-output configurations, offering a nuanced examination of jailbreak tactics and defenses applicable to any-to-text, any-to-vision, and any-to-any modalities within generative systems.<br>
✔️ Perfect for Majority
- For beginners curious about jailbreak attack and defense, our repository serves as a compass for grasping the big picture and diving into the details. The brief introductions to papers in different fields retained in the README provide a beginner-friendly navigation through interesting directions in the field;
- For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.
🧭 How to Use this Guide
- Quick Start: In the README, users can find a curated list of select information, along with links to various consultations.
- In-Depth Exploration: If you have a special interest in a particular area of the paper, delve into the markdown file for more information.
🚀Table of Contents
- 🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models🛡️
🔥Multimodal Generative Models
Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.
📑Any-to-Text (LLM Backbone)
Short Name | Modality | Representative Model |
---|---|---|
IT2T | I + T -> T | LLaVA, MiniGPT4, InstructBLIP |
VT2T | V + T -> T | Video-LLaVA, Video-LLaMA |
AT2T | A + T -> T | Audio Flamingo, Audiopalm |
📖Any-to-Vision (Diffusion Backbone)
Short Name | Modality | Representative Model |
---|---|---|
T2T | T -> I | Stable Diffusion, Midjourney, DALLE |
IT2I | I + T -> I | DreamBooth, InstructP2P |
T2V | T -> V | Open-Sora, Stable Video Diffusion |
IT2V | I + T -> V | VideoPoet, CogVideoX |
📰Any-to-Any (Unified Backbone)
Short Name | Modality | Representative Model |
---|---|---|
IT2IT | I + T -> I + T | Next-GPT, Chameleon |
A2A | A -> A | GPT-4o |
😈JailBreak Attack
📖Attack-Intro
In this part, we focus on discussing different advanced jailbreak attacks against multimodal models. We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.
<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_attack_A.png" alt="jailbreak_attack_black_box" />As shown in Fig. A.1, attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques. These techniques can bypass the model’s safeguards, making the models more susceptible to executing prohibited instructions. <br> As shown in Fig. A.2, attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.
<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreal_attack_B.png" alt="jailbreak_attack_white_and_gray_box" />As shown in Fig. B.1, attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe. <br> As shown in Fig. B.2, attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.
📑Papers
Below are the papers related to jailbreak attacks.
Jailbreak Attack of Any-to-Text Models
Jailbreak Attack of Any-to-Vision Models
Jailbreak Attack of Any-to-Any Models
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|---|---|---|---|---|
Voice jailbreak attacks against gpt-4o | Arxiv 2024 | 2024/05/29 | Github | Input Level | A2A |
🛡️Jailbreak Defense
📖Defense-Intro
In order to cope with jailbreak attacks and improve the security of multimodal foundation models, existing work makes efforts in both Transformative defense and Discriminative defense.
<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_discriminative_defense_00.png" alt="jailbreak_discriminative_defense" />Discriminative defenses focus on identifying and analyzing varying classified cues, such as statistical information at the input level, embeddings at the encoder level, activations at the generator level, and response discrepancies at the output level.
<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_defense_all_00.png" alt="jailbreak_transformative_defense" />Transformative defenses that can operate at four levels to influence the model’s generation process, ensuring benign responses even in the presence of adversarial or malicious prompts.
📑Papers
Below are the papers related to jailbreak defense.
Jailbreak Defense of Any-to-Text Models
Jailbreak Defense of Any-to-Vision Models
Jailbreak Defense of Any-to-Any Models
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|---|---|---|---|---|
Safree: Training-free and adaptive guard for safe text-to-image and video generation | Arxiv 2024 | 2024/10/16 | None | Output Level | T2V |
💯Resources
⭐️Datasets
Used to Any-to-Text Models
Dataset | Task | Text Source | Image Source | Volume | Access |
---|---|---|---|---|---|
SafeBench | Attack | GPT generation | Typography | 500 | Github |
AdvBench | Attack | LLM generation | N/A | 500 | Github |
ReadTeam-2K | Attack | Exist. & GPT Generation | N/A | 2000 | Huggingface |
HarmBench | Attack & Defense | Unpublished | N/A | 320 | Github |
HADES | Defense | GPT generation | Typography & Diffusion Generation | 750 | Github |
MM-SafetyBench | Defense | GPT generation | Typography & Diffusion Generation | 5040 | Github |
JailBreakV-28K | Defense | Adv. Prompt on ReadTeam-2K | Blank & Noise & Natural & Synthesize | 28000 | Huggingface |
VLGuard | Defense | GPT generation | Exist. | 3000 | Huggingface |
Used to Any-to-Vision Models
Dataset | Task | Text Source | Image Source | Volume | Access |
---|---|---|---|---|---|
NSFW-200 | Attack | Human curation | N/A | 200 | Github |
MMA | Attack | Exist.& Adv. Prompt | N/A | 1000 | Huggingface |
VBCDE-100 | Attack | Human curation | N/A | 100 | Github |
I2P | Attack & Defense | Real-world Website | Real-world Website | 4703 | Huggingface |
Unsafe Diffusion | Defense | Human curation& Website&Exist. | N/A | 1434 | Github |
MACE | Defense | Human curation | Diffusion Generation | 200 | Github |
📚Detectors
<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_evaluation_00.png" alt="jailbreak_evaluation" width="600" />Detector-based approaches utilize pre-trained classifiers to automatically detect and identify harmful content within generated outputs. These classifiers are trained on large, annotated datasets that cover a range of unsafe categories, such as toxicity, violence, or explicit material. By leveraging these pre-trained models, detector-based methods can efficiently flag inappropriate content.
Used to Any-to-Text Models
Toxicity detector | Access |
---|---|
LLama-Guard | Huggingface |
LLama-Guard2 | Huggingface |
Detoxify | Github |
GPTFUZZER | Huggingface |
Perspective API | Website |
Used to Any-to-Vision Models
Toxicity detector | Access |
---|---|
NudeNet | Github |
Q16 | Github |
Safety Checker | Huggingface |
Imgcensor | Github |
Multi-headed Safety Classifier | Github |