Home

Awesome

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Paper

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.<br> But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

survey model

🤗Introduction

This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.<br> Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.<br>

🧑‍💻 Four Levels of Multimodal Jailbreak lifecycle

Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.<br> We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.<br>

survey model

🚀Table of Contents

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text Models (LLM Backbone)

Short NameModalityRepresentative Model
I+T→TI + T → TLLaVA, MiniGPT4, InstructBLIP
VT2TV + T → TVideo-LLaVA, Video-LLaMA
AT2TA + T → TAudio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short NameModalityRepresentative Model
T→IT → IStable Diffusion, Midjourney, DALLE
IT→II + T → IDreamBooth, InstructP2P
T2VT → VOpen-Sora, Stable Video Diffusion
IT2VI + T → VVideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short NameModalityRepresentative Model
IT→ITI + T → I + TNext-GPT, Chameleon
TIV2TIVT + I + V → T + I + VEMU3
Any2AnyAny → AnyGPT-4o, Gemini Ultra

😈JailBreak Attack

📖Attack-Intro

We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_attack_A.png" alt="jailbreak_attack_black_box" /> <img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_attack_B.png" alt="jailbreak_attack_white_and_gray_box" />

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

TitleVenueDateCodeTaxonomyMultimodal Model
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language ModelsArxiv 20242024/12/8None---I+T→T
BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMsArxiv 20242024/12/8None---I+T→T
Jailbreak Large Vision-Language Models Through Multi-Modal LinkageArxiv 20242024/11/30Github---I+T→T
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language ModelsArxiv 20242024/11/18NoneOutput LevelI+T→T
IDEATOR: Jailbreaking Large Vision-Language Models Using ThemselvesArxiv 20242024/11/15NoneOutput LevelI+T→T
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language ModelsNeurips SafeGenAi Workshop 20242024/11/12NoneOutput LevelI+T→T
Audio is the achilles’heel: Red teaming audio large multimodal modelsArxiv 20242024/10/31NoneInput LevelA+T→T
Advweb: Controllable black-box attacks on vlm-powered web agentsArxiv 20242024/10/22NoneInput LevelI+T→T
Image Hijacks: Adversarial Images can Control Generative Models at RuntimeICML 20242024/09/01GithubGenerator LevelI+T→T
Can Large Language Models Automatically Jailbreak GPT-4V?NAACL Workshop 20242024/07/23NoneInput LevelI+T→T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak PromptsACM MM 20242024/07/21NoneInput LevelI+T→T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do AnythingArxiv 20242024/07/01NoneInput LevelI+T→T
From LLMs to MLLMs: Exploring the Landscape of Multimodal JailbreakingEMNLP 20242024/06/21NoneEncoder LevelI+T→T
Jailbreak Vision Language Models via Bi-Modal Adversarial PromptArxiv 20242024/06/06GithubGenerator LevelI+T→T
Efficient LLM-Jailbreaking by Introducing Visual ModalityArxiv 20242024/05/30NoneGenerator LevelI+T→T
White-box Multimodal Jailbreaks Against Large Vision-Language ModelsACM Multimedia 20242024/05/28NoneGenerator LevelI+T→T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image CharacterArxiv 20242024/05/25NoneInput LevelI+T→T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language ModelsECCV 20242024/05/14GithubGenerator LevelI+T→T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastICML 20242024/02/13GithubGenerator LevelI+T→T
Jailbreaking Attack against Multimodal Large Language ModelArxiv 20242024/02/04NoneGenerator LevelI+T→T
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language ModelsICLR 2024 Spotlight2024/01/16GithubEncoder LevelI+T→T
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language ModelsECCV 20242023/11/29GithubInput LevelI+T→T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMsECCV 20242023/11/27GithubEncoder LevelI+T→T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System PromptsArxiv 20232023/11/15NoneOutput LevelI+T→T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual PromptsAAAI 20252023/11/09GithubInput LevelI+T→T
Are aligned neural networks adversarially aligned?NeurIPS 20232023/06/26NoneGenerator LevelI+T→T
Visual Adversarial Examples Jailbreak Aligned Large Language ModelsAAAI 20242023/06/22GithubGenerator LevelI+T→T
On Evaluating Adversarial Robustness of Large Vision-Language ModelsNeurIPS 20232023/05/26HomepageEncoder LevelI+T→T

Jailbreak Attack of Any-to-Vision Models

TitleVenueDateCodeTaxonomyMultimodal Model
Antelope: Potent and Concealed Jailbreak Attack StrategyArxiv 20242024/12/11None---T→I
BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMsArxiv 20242024/12/8None---T→I
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion ModelsArxiv 20242024/11/25NoneOutput LevelT→I
Unfiltered and Unseen: Universal Multimodal Jailbreak Attacks on Text-to-Image Model DefensesOpenreview2024/11/13None---T→I
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion modelsArxiv 20242024/10/28GithubEncoder LevelT→I
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by StepArxiv 20242024/10/4NoneOutput LevelT→I
ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep GenerationNeurIPS 20242024/9/25GithubInput LevelT→I
RT-Attack: Jailbreaking Text-to-Image Models via Random TokenArxiv 20242024/08/25NoneOutput LevelT→I
Perception-guided Jailbreak against Text-to-Image ModelsArxiv 20242024/08/20NoneInput LevelT→I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order OptimizationArxiv 20242024/08/18NoneOutput LevelT→I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion ModelsArxiv 20242024/08/02NoneEncoder LevelT→I
Jailbreaking Text-to-Image Models with LLM-Based AgentsArxiv 20242024/08/01NoneOutput LevelT→I
Automatic Jailbreaking of the Text-to-Image Generative AI SystemsArxiv 20242024/05/26NoneOutput LevelT→I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual CheckersICML 20242024/05/18NoneInput LevelT→I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image GeneratorsArxiv 20242024/02/23NoneInput LevelT→I
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image ModelsArxiv 20232023/12/12GithubInput LevelT→I
MMA-Diffusion: MultiModal Attack on Diffusion ModelsCVPR 20242023/11/29GithubEncoder LevelT→I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative ModelsCVPR 20242023/11/29GithubGenerator LevelT→I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For NowECCV 20242023/10/18GithubGenerator LevelT→I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?ICLR 20242023/10/16GithubEncoder LevelT→I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via SubstitutionCCS 20242023/09/25GithubInput LevelT→I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsICML 20242023/09/12GithubGenerator LevelT→I
SneakyPrompt: Jailbreaking Text-to-image Generative ModelsSymposium on Security and Privacy 20242023/05/20GithubOutput LevelT→I
Red-Teaming the Stable Diffusion Safety FilterNeurIPSW 20222022/10/03NoneInput LevelT→I

Jailbreak Attack of Any-to-Any Models

TitleVenueDateCodeTaxonomyMultimodal Model
Gradient-based Jailbreak Images for Multimodal Fusion ModelsArxiv 20242024/10/4GithubGenerator LevelI+T→I+T
Voice jailbreak attacks against gpt-4oArxiv 20242024/05/29GithubOutput LevelAny→Any

🛡️Jailbreak Defense

📖Defense-Intro

Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_discriminative_defense.png" alt="jailbreak_discriminative_defense" /> <img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_transformative_defense.png" alt="jailbreak_transformative_defense" />

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

TitleVenueDateCodeTaxonomyMultimodal Model
Defending LVLMs Against Vision Attacks through Partial-Perception SupervisionArxiv 20242024/12/17None---I+T→T
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentArxiv 20242024/11/27NoneOutput LevelI+T→T
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against JailbreaksArxiv 20242024/11/23NoneGenerator LevelI+T→T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language modelsArxiv 20242024/11/03NoneInput LevelI+T→T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single VectorArxiv 20242024/10/30NoneGenerator LevelI+T→T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak AttacksArxiv 20242024/10/28NoneInput LevelI+T→T
Information-theoretical principled trade-off between jailbreakability and stealthiness on vision language modelsArxiv 20242024/10/02NoneInput LevelI+T→T
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional CalibrationCOLM 20242024/9/17NoneOutput LevelI+T→T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial AttacksArxiv 20242024/09/11NoneEncoder LevelI+T→T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor triggerArxiv 20242024/08/17NoneGenerator LevelI+T→T
Defending jailbreak attack in vlms via cross-modality information detectorArxiv 20242024/07/31GithubEncoder LevelI+T→T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language modelsArxiv 20242024/07/20NoneEncoder LevelI+T→T
Cross-modal safety alignment: Is textual unlearning all you need?Arxiv 20242024/05/27NoneGenerator LevelI+T→T
Safety alignment for vision language modelsArxiv 20242024/05/22NoneGenerator LevelI+T→T
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield promptingECCV 20242024/05/14GithubInput LevelI+T→T
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text TransformationECCV 20242024/03/14GithubOutput LevelI+T→T
Safety fine-tuning at (almost) no cost: A baseline for vision large language modelsICML 20242024/02/03GithubGenerator LevelI+T→T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidanceEMNLP 20242024/01/20GithubGenerator LevelI+T→T
Mllm-protector: Ensuring mllm’s safety without hurting performanceEMNLP 20242024/01/05GithubOutput LevelI+T→T
Jailguard: A universal detection framework for llm prompt-based attacksArxiv 20232023/12/17NoneOutput LevelI+T→T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructionsICLR 20242023/09/14GithubGenerator LevelI+T→T

Jailbreak Defense of Any-to-Vision Models

TitleVenueDateCodeTaxonomyMultimodal Model
SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe GenerationArxiv 20242024/12/20Github---T→I
SafetyDPO: Scalable Safety Alignment for Text-to-Image GenerationArxiv 20242024/12/13Github---T→I
TraSCE: Trajectory Steering for Concept ErasureArxiv 20242024/12/10None---T→I
Buster: Incorporating Backdoor Attacks into Text Encoder to Mitigate NSFW Content GenerationArxiv 20242024/12/10Github---T→I
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion ModelsArxiv 20242024/11/30None---T→I
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent ReconstructionArxiv 20242024/11/21None---T→I
Safe Text-to-Image Generation:Simply Sanitize the Prompt EmbeddingArxiv 20242024/11/15NoneEncoder LevelT→I
Safree: Training-free and adaptive guard for safe text-to-image and video generationArxiv 20242024/10/16NoneGenerator LevelT→I/T→V
Shielddiff: Suppressing sexual content generation from diffusion models through reinforcement learningArxiv 20242024/10/04NoneGenerator LevelT→I
Dark miner: Defend against unsafe generation for text-to-image diffusion modelsArxiv 20242024/09/26NoneGenerator LevelT→I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion modelsArxiv 20242024/09/17NoneGenerator LevelT→I
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe PromptsArxiv 20242024/08/02NoneGenerator LevelT→I
Direct Unlearning Optimization for Robust and Safe Text-to-Image ModelsICML GenLaw workshop 20242024/07/17NoneGenerator LevelT→I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsECCV 20242024/07/17GithubGenerator LevelT→I
Conceptprune: Concept editing in diffusion models via skilled neuron pruningArxiv 20242024/05/29GithubGenerator LevelT→I
Pruning for Robust Concept Erasing in Diffusion ModelsArxiv 20242024/05/26NoneGenerator LevelT→I
Defensive unlearning with adversarial training for robust concept erasure in diffusion modelsNeurIPS 20242024/05/24GithubEncoder LevelT→I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradientArxiv 20242024/05/24NoneGenerator LevelT→I
Espresso: Robust Concept Filtering in Text-to-Image ModelsArxiv 20242024/04/30NoneOutput LevelT→I
Latent Guard: a Safety Framework for Text-to-image GenerationECCV 20242024/04/11GithubEncoder LevelT→I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image ModelsACM CCS 20242024/04/10GithubGenerator LevelT→I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generationICLR 20242024/04/04GithubGenerator LevelT→I
GuardT→I: Defending Text-to-Image Models from Adversarial PromptsNeurIPS 20242024/03/03NoneEncoder LevelT→I
Universal prompt optimizer for safe text-to-image generationNAACL 20242024/02/16NoneInput LevelT→I
Erasediff: Erasing data influence in diffusion modelsArxiv 20242024/01/11NoneGenerator LevelT→I
Localization and manipulation of immoral visual cues for safe text-to-image generationWACV 20242024/01/01NoneOutput LevelT→I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasersECCV 20242023/11/29NoneGenerator LevelT→I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generationCVPR 20242023/11/28GithubEncoder LevelT→I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language ModelsECCV 20242023/11/27GithubEncoder LevelT→I
Mace: Mass concept erasure in diffusion modelsCVPR 20242023/10/19GithubGenerator LevelT→I
Implicit concept removal of diffusion modelsECCV 20242023/10/09NoneInput LevelT→I
Unified concept editing in diffusion modelsWACV 20242023/08/25GithubGenerator LevelT→I
Towards safe self-distillation of internet-scale text-to-image diffusion modelsICML 2023 Workshop on Challenges in Deployable Generative AI2023/07/12GithubGenerator LevelT→I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion ModelsCVPR 20242023/05/30GithubGenerator LevelT→I
Erasing concepts from diffusion modelsICCV 20232023/05/13GithubGenerator LevelT→I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion ModelsCVPR 20232022/11/09GithubGenerator LevelT→I

Jailbreak Defense of Any-to-Any Models

TitleVenueDateCodeTaxonomyMultimodal Model

💯Evaluation

⭐️Evaluation Datasets

Below is a comparison table of publicly available representative evaluation datasets and a description of each attribute in the table.

Used to Any-to-Text Models

DatasetText SourceImage SourceVolumeThemeAccess
FigstepSynthesizedAdversarial50010Github
AdvBenchSynthesized---500---Github
ReadTeam-2KCollected & Reconstructed & SynthesizedN/A200016Huggingface
HarmBenchCollected---5104Github
HADESSynthesizedCollected & Synthesized & Adversarial7505Github
MM-SafetyBenchSynthesizedSynthesized & Adversarial504013Github
JailBreakV-28KAdversarialReconstructed & Synthesized2800016Huggingface

Used to Any-to-Vision Models

DatasetText SourceImage SourceVolumeAccessTheme
NSFW-200Synthesized---200---Github
MMAReconstructed & AdversarialAdversarial1000---Huggingface
VBCDEReconstructed & Adversarial---1005Github
I2PCollectedCollected47037Huggingface
Unsafe DiffusionCollected & Reconstructed---1434---Github
MACE-CelebrityCollected---1000---Github
MACE-ArtReconstructed---1000---Github
MPUPSynthesized---12004Huggingface
T2VSafetyBenchReconstructed & Synthesized & Adversarial---440012Github

📚Evaluation Methods

Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_evaluation.png" alt="jailbreak_evaluation" width="600" /> <!-- **Detector-based approaches utilize pre-trained classifiers to automatically detect and identify harmful content within generated outputs. These classifiers are trained on large, annotated datasets that cover a range of unsafe categories, such as toxicity, violence, or explicit material. By leveraging these pre-trained models, detector-based methods can efficiently flag inappropriate content.** -->

Text Detector

Toxicity detectorAccess
LLama-GuardHuggingface
LLama-Guard2Huggingface
DetoxifyGithub
GPTFUZZERHuggingface
Perspective APIWebsite

Image Detector

Toxicity detectorAccess
NudeNetGithub
Q16Github
Safety CheckerHuggingface
ImgcensorGithub
Multi-headed Safety ClassifierGithub

😉Citation

If you find this work useful in your research, Please kindly cite using the following BibTex:

@article{liu2024jailbreak,
    title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
    author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
    journal={arXiv preprint arXiv:2411.09259},
    year={2024},
}