Home

Awesome

🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

survey model

🤗Introduction

Welcome to our Awesome-Jailbreak-against-Multimodal-Generative-Models! We provides a comprehensive overview of jailbreak vulnerabilities in multimodal generative models! 🥰🥰🥰<br>

<br> We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense Multimodel Generative Models.<br> But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section.<br>

🧑‍💻 Our Work

✔️ Perfect for Majority

🧭 How to Use this Guide

🚀Table of Contents

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text (LLM Backbone)

Short NameModalityRepresentative Model
IT2TI + T -> TLLaVA, MiniGPT4, InstructBLIP
VT2TV + T -> TVideo-LLaVA, Video-LLaMA
AT2TA + T -> TAudio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short NameModalityRepresentative Model
T2TT -> IStable Diffusion, Midjourney, DALLE
IT2II + T -> IDreamBooth, InstructP2P
T2VT -> VOpen-Sora, Stable Video Diffusion
IT2VI + T -> VVideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short NameModalityRepresentative Model
IT2ITI + T -> I + TNext-GPT, Chameleon
A2AA -> AGPT-4o

😈JailBreak Attack

📖Attack-Intro

In this part, we focus on discussing different advanced jailbreak attacks against multimodal models. We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_attack_A.png" alt="jailbreak_attack_black_box" />

As shown in Fig. A.1, attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques. These techniques can bypass the model’s safeguards, making the models more susceptible to executing prohibited instructions. <br> As shown in Fig. A.2, attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreal_attack_B.png" alt="jailbreak_attack_white_and_gray_box" />

As shown in Fig. B.1, attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe. <br> As shown in Fig. B.2, attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

TitleVenueDateCodeTaxonomyMultimodal Model
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language ModelsICLR 2024 Spotlight2024/01/16GithubEncoder LevelIT2T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual PromptsArxiv 20232023/11/09GithubInput LevelIT2T
Jailbreaking Attack against Multimodal Large Language ModelArxiv 20242024/02/04NoneGenerator LevelIT2T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language ModelsECCV 20242024/05/14GithubGenerator LevelIT2T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image CharacterArxiv 20242024/05/25NoneInput LevelIT2T
Jailbreak Vision Language Models via Bi-Modal Adversarial PromptArxiv 20242024/06/06GithubGenerator LevelIT2T
Image Hijacks: Adversarial Images can Control Generative Models at RuntimeArxiv 20242024/09/01GithubGenerator LevelIT2T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak PromptsACM MM 20242024/07/21NoneInput LevelIT2T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastICML 20242024/02/13GithubDecoder LevelIT2T
Visual Adversarial Examples Jailbreak Aligned Large Language ModelsAAAI 20242023/06/22GithubGenerator LevelIT2T
Are aligned neural networks adversarially aligned?Arxiv 20232023/06/26NoneGenerator LevelIT2T
Voice Jailbreak Attacks Against GPT-4oArxiv 20242024/05/29GithubInput LevelIT2T
Efficient LLM-Jailbreaking by Introducing Visual ModalityArxiv 20242024/05/30NoneGenerator LevelIT2T
ImgTrojan: Jailbreaking Vision-Language Models with ONE ImageArxiv 20242024/05/05GithubGenerator LevelIT2T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System PromptsArxiv 20232023/11/15NoneInput LevelIT2T
White-box Multimodal Jailbreaks Against Large Vision-Language ModelsArxiv 20242024/05/28NoneGenerator LevelIT2T
From LLMs to MLLMs: Exploring the Landscape of Multimodal JailbreakingArxiv 20242024/06/21NoneEncoder LevelIT2T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do AnythingArxiv 20242024/07/01NoneInput LevelIT2T
Can Large Language Models Automatically Jailbreak GPT-4V?CCS 20242024/07/23NoneInput LevelIT2T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMsECCV 20242023/11/27GithubEncoder LevelIT2T
Advweb: Controllable black-box attacks on vlm-powered web agentsArxiv 20242024/10/22NoneInput LevelIT2T
Audio is the achilles’heel: Red teaming audio large multimodal modelsArxiv 20242024/10/31NoneInput LevelAT2T

Jailbreak Attack of Any-to-Vision Models

TitleVenueDateCodeTaxonomyMultimodal Model
MMA-Diffusion: MultiModal Attack on Diffusion ModelsCVPR 20242023/11/29GithubEncoder LevelT2I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?ICLR 20242023/10/16GithubEncoder LevelT2I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion ModelsArxiv 20242024/08/02NoneEncoder LevelT2I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For NowECCV 20242023/10/18GithubGenerator LevelT2I
Perception-guided Jailbreak against Text-to-Image ModelsArxiv 20242024/08/20NoneInput LevelT2I
SneakyPrompt: Jailbreaking Text-to-image Generative ModelsSymposium on Security and Privacy 20242023/05/20GithubOutput LevelT2I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order OptimizationArxiv 20242024/08/18NoneOutput LevelT2I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image GeneratorsArxiv 20242024/02/23NoneEncoder LevelT2I
Red-Teaming the Stable Diffusion Safety FilterNeurIPSW 20222022/10/03NoneInput LevelT2I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual CheckersICML 20242024/05/18NoneInput LevelT2I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsICML 20242023/09/12GithubGenerator LevelT2I
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image ModelsArxiv 20232023/12/12GithubInput LevelT2I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via SubstitutionCCS 20242023/09/25GithubInput LevelT2I
Automatic Jailbreaking of the Text-to-Image Generative AI SystemsArxiv 20242024/05/26NoneInput LevelT2I
Jailbreaking Text-to-Image Models with LLM-Based AgentsArxiv 20242024/08/01NoneInput LevelT2I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative ModelsCVPR 20242023/11/29GithubEncoder LevelT2I
RT-Attack: Jailbreaking Text-to-Image Models via Random TokenArxiv 20242024/08/25NoneEncoder LevelT2I

Jailbreak Attack of Any-to-Any Models

TitleVenueDateCodeTaxonomyMultimodal Model
Voice jailbreak attacks against gpt-4oArxiv 20242024/05/29GithubInput LevelA2A

🛡️Jailbreak Defense

📖Defense-Intro

In order to cope with jailbreak attacks and improve the security of multimodal foundation models, existing work makes efforts in both Transformative defense and Discriminative defense.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_discriminative_defense_00.png" alt="jailbreak_discriminative_defense" />

Discriminative defenses focus on identifying and analyzing varying classified cues, such as statistical information at the input level, embeddings at the encoder level, activations at the generator level, and response discrepancies at the output level.

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_defense_all_00.png" alt="jailbreak_transformative_defense" />

Transformative defenses that can operate at four levels to influence the model’s generation process, ensuring benign responses even in the presence of adversarial or malicious prompts.

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

TitleVenueDateCodeTaxonomyMultimodal Model
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield promptingECCV 20242024/05/14GithubInput LevelIT2T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language modelsArxiv 20242024/07/20NoneEncoder LevelIT2T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial AttacksArxiv 20242024/09/11NoneEncoder LevelIT2T
Safety fine-tuning at (almost) no cost: A baseline for vision large language modelsICML 20242024/02/03GithubGenerator LevelIT2T
Safety alignment for vision language modelsArxiv 20242024/05/22NoneGenerator LevelIT2T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor triggerArxiv 20242024/08/17NoneGenerator LevelIT2T
Cross-modal safety alignment: Is textual unlearning all you need?Arxiv 20242024/05/27NoneGenerator LevelIT2T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructionsICLR 20242023/09/14GithubGenerator LevelIT2T
Mllm-protector: Ensuring mllm’s safety without hurting performanceArxiv 20242024/01/05GithubOutput LevelIT2T
Information-theoretical principled trade-off between jailbreakability and stealthiness on vision language modelsArxiv 20242024/10/02NoneInput LevelIT2T
Defending jailbreak attack in vlms via cross-modality information detectorArxiv 20242024/07/31GithubEncoder LevelIT2T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidanceArxiv 20242024/01/20GithubEncoder LevelIT2T
Jailguard: A universal detection framework for llm prompt-based attacksArxiv 20232023/12/17NoneEncoder LevelIT2T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single VectorArxiv 20242024/10/30NoneGenerator LevelIT2T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak AttacksArxiv 20242024/10/28NoneInput LevelIT2T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language modelsArxiv 20242024/11/03NoneInput LevelIT2T

Jailbreak Defense of Any-to-Vision Models

TitleVenueDateCodeTaxonomyMultimodal Model
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe PromptsArxiv 20242024/08/02NoneGenerator LevelT2I
Direct Unlearning Optimization for Robust and Safe Text-to-Image ModelsICML GenLaw workshop 20242024/07/17NoneEncoder LevelT2I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsECCV 20242024/07/17GithubGenerator LevelT2I
Pruning for Robust Concept Erasing in Diffusion ModelsArxiv 20242024/05/26NoneGenerator LevelT2I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image ModelsACM CCS 20242024/04/10GithubGenerator LevelT2I
GuardT2I: Defending Text-to-Image Models from Adversarial PromptsNIPS 20242024/03/03NoneInput LevelT2I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language ModelsECCV 20242023/11/27GithubEncoder LevelT2I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion ModelsCVPR 20242023/05/30GithubGenerator LevelT2I
Latent Guard: a Safety Framework for Text-to-image GenerationECCV 20242024/04/11GithubEncoder LevelT2I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion ModelsCVPR 20232022/11/09GithubGenerator LevelT2I
Espresso: Robust Concept Filtering in Text-to-Image ModelsArxiv 20242024/04/30NoneOutput LevelT2I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generationCVPR 20242023/11/28GithubInput LevelT2I
Implicit concept removal of diffusion modelsECCV 20242023/10/09NoneInput LevelT2I
Universal prompt optimizer for safe text-to-image generationArxiv 20242024/02/16NoneInput LevelT2I
Defensive unlearning with adversarial training for robust concept erasure in diffusion modelsNips 20242024/05/24GithubEncoder LevelT2I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradientArxiv 20242024/05/24NoneEncoder LevelT2I
Erasediff: Erasing data influence in diffusion modelsArxiv 20242024/01/11NoneGenerator LevelT2I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generationICLR 20242024/04/04GithubGenerator LevelT2I
Mace: Mass concept erasure in diffusion modelsCVPR 20242023/10/19GithubGenerator LevelT2I
Unified concept editing in diffusion modelsWACV 20242023/08/25GithubGenerator LevelT2I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasersECCV 20242023/11/29NoneGenerator LevelT2I
Dark miner: Defend against unsafe generation for text-to-image diffusion modelsArxiv 20242024/09/26NoneGenerator LevelT2I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion modelsArxiv 20242024/09/17NoneGenerator LevelT2I
Towards safe self-distillation of internet-scale text-to-image diffusion modelsICML 2023 Workshop on Challenges in Deployable Generative AI2023/07/12GithubGenerator LevelT2I
Erasing concepts from diffusion modelsICCV 20232023/05/13GithubGenerator LevelT2I
Conceptprune: Concept editing in diffusion models via skilled neuron pruningArxiv 20242024/05/29GithubGenerator LevelT2I
Localization and manipulation of immoral visual cues for safe text-to-image generationWACV 20242024NoneOutput LevelT2I

Jailbreak Defense of Any-to-Any Models

TitleVenueDateCodeTaxonomyMultimodal Model
Safree: Training-free and adaptive guard for safe text-to-image and video generationArxiv 20242024/10/16NoneOutput LevelT2V

💯Resources

⭐️Datasets

Used to Any-to-Text Models

DatasetTaskText SourceImage SourceVolumeAccess
SafeBenchAttackGPT generationTypography500Github
AdvBenchAttackLLM generationN/A500Github
ReadTeam-2KAttackExist. & GPT GenerationN/A2000Huggingface
HarmBenchAttack & DefenseUnpublishedN/A320Github
HADESDefenseGPT generationTypography & Diffusion Generation750Github
MM-SafetyBenchDefenseGPT generationTypography & Diffusion Generation5040Github
JailBreakV-28KDefenseAdv. Prompt on ReadTeam-2KBlank & Noise & Natural & Synthesize28000Huggingface
VLGuardDefenseGPT generationExist.3000Huggingface

Used to Any-to-Vision Models

DatasetTaskText SourceImage SourceVolumeAccess
NSFW-200AttackHuman curationN/A200Github
MMAAttackExist.& Adv. PromptN/A1000Huggingface
VBCDE-100AttackHuman curationN/A100Github
I2PAttack & DefenseReal-world WebsiteReal-world Website4703Huggingface
Unsafe DiffusionDefenseHuman curation& Website&Exist.N/A1434Github
MACEDefenseHuman curationDiffusion Generation200Github

📚Detectors

<img src="https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak/blob/main/pic/jailbreak_evaluation_00.png" alt="jailbreak_evaluation" width="600" />

Detector-based approaches utilize pre-trained classifiers to automatically detect and identify harmful content within generated outputs. These classifiers are trained on large, annotated datasets that cover a range of unsafe categories, such as toxicity, violence, or explicit material. By leveraging these pre-trained models, detector-based methods can efficiently flag inappropriate content.

Used to Any-to-Text Models

Toxicity detectorAccess
LLama-GuardHuggingface
LLama-Guard2Huggingface
DetoxifyGithub
GPTFUZZERHuggingface
Perspective APIWebsite

Used to Any-to-Vision Models

Toxicity detectorAccess
NudeNetGithub
Q16Github
Safety CheckerHuggingface
ImgcensorGithub
Multi-headed Safety ClassifierGithub