Awesome

Awesome-Attacks and Defenses on T2I Diffusion Models

This repository is a curated collection of research papers focused on $\textbf{Adversarial Attacks and Defenses on Text-to-Image Diffusion Models (AD-on-T2IDM)}$.

We will continuously update this collection to track the latest advancements in the field of AD-on-T2IDM.

Welcome to follow and star! If you have any relevant materials or suggestions, please feel free to contact us (zcy@tju.edu.cn) or submit a pull request.

For more detailed information, please refer to our survey paper: [ARXIV]， [Published Version]

:bell:News

2024-09-12 Our survey "Adversarial Attacks and Defenses on Text-to-Image Diffusion Models" has been accepted by Information Fusion~(SCI-1, IF14.7).

Citation

@article{zhang2024adversarial,
  title={Adversarial attacks and defenses on text-to-image diffusion models: A survey},
  author={Zhang, Chenyu and Hu, Mingwang and Li, Wenhui and Wang, Lanjun},
  journal={Information Fusion},
  pages={102701},
  year={2024},
  publisher={Elsevier}
}

<a name="Abstract">Abstract</a>

Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversarial attack methods. Simultaneously, there has been a marked increase in research focused on defense methods to improve the robustness and safety of these models. In this survey, we provide a comprehensive review of the literature on adversarial attacks and defenses targeting text-to-image diffusion models. We begin with an overview of popular text-to-image diffusion models, followed by an introduction to a taxonomy of adversarial attacks and an in-depth review of existing attack methods. We then present a detailed analysis of current defense methods that improve model robustness and safety. Finally, we discuss ongoing challenges and explore promising future research directions.

<a name="Overview">Overview of AD-on-T2IDM</a>

Two key concerns in T2IDM: Robustness and Safety

The robustness ensures that the model can generate images with consistent semantics in response to diverse prompts inputted by users in practice.

The safety prevents the misuse of the model in creating malicious images, such as sexual, violent, and political images, etc.

Adversarial attacks

Based on the intent of the adversary, existing attack methods can be divided into two primary categories: untargeted and targeted attacks.

For untargeted attacks, consider a scenario with a prompt input by the user~($\textbf{clean prompt}$) and its corresponding output image~($\textbf{clean image}$). The objective of untargeted attacks is to subtly perturb the clean prompt to craft an $\textbf{adversarial prompt}$, further misleading the victim model to generate an $\textbf{adversarial image}$ with semantics different from the clean image. This type of attack is commonly used to uncover the vulnerability in the robustness of the victim model. Some untargeted attacks are shown as follows:
For targeted attacks, assumes that the victim model has built-in $\textbf{safeguards}$ to filter $\textbf{malicious prompts}$ and resultant $\textbf{malicious images}$. These prompts and images often explicitly contain $\textbf{malicious concepts}$, such as 'nudity', 'violence', and other predefined concepts. The objective of targeted attacks is to obtain an $\textbf{adversarial prompt}$, which can bypass these safeguards while inducing the victim model to generate $\textbf{adversarial images}$ containing malicious concepts. This type of attack is typically designed to reveal the vulnerability in the safety of the victim model. Some targeted attacks are shown as follows:

Defenses

Based on the defense goal, existing defense methods can be classified into two categories: 1) improving model robustness and 2) improving model safety.

The goal of robustness is to ensure that generated images have consistent semantics with diverse input prompts in practical applications. Specifically, according to the adversarial attack, the defense methods are asked to mitigate the robustness vulnerabilities in two types of input prompts: 1) the prompt with multiple objects and attributes, and 2) the grammatically incorrect prompt with the subtle noise.
The safety goal is to prevent the generation of malicious images in response to both malicious and adversarial prompts. Specifically, malicious prompts explicitly contain malicious concepts, while adversarial prompts cleverly omit these concepts. Moreover, based on the knowledge of the model, existing safety methods can be classified into two categories: external safeguards and internal safeguards. The external safeguards focus on detecting or correcting the malicious prompt before feeding the prompt into the text-to-image model. In contrast, internal safeguards aim to ensure that the semantics of output images deviate from those of malicious images by modifying internal parameters and features within the model. Some examples of external and internal safeguards are shown as follows:
<img src="./picture/external_safeguards.png" alt="external safeguards" style="zoom:50%;" /> <img src="./picture/internal_safeguards.png" alt="internal safeguards" style="zoom:50%;" />

Notably, although many methods are proposed to improve the model robustness against the prompt with multiple objects and attributes, this collection omits related papers on this part since there has been related surveys, such as controllable image generation [PDF], the development and advancement of image generation capabilities [PDF-1], [PDF-2], [PDF-3]. Moreover, for grammatically incorrect prompts with subtle noise, mature solutions are still lacking. Therefore, this collection mainly focuses on the defense methods for improving model safety.

:grinning:<a name="Paper_List">Paper List</a>

Adversarial Attacks
- Untargeted Attacks
  - White-Box Attacks
  - Black-Box Attacks
- Targeted Attacks
  - White-Box Attacks
  - Black-Box Attacks
Defenses for Improving Safety
- External Safeguards
  - Prompt Classifier
  - Prompt Transformation
- Internal Safeguards
  - Model Editing
  - Inference Guidance

:imp:<a name="Adversarial-Attacks">Adversarial Attacks</a>

:collision:<a name="Untargeted-Attacks">Untargeted Attacks</a>

:pouting_cat:<a name="Untargeted-White-Box-Attacks">White-Box Attacks</a>

Stable diffusion is unstable

Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu

NeurIPS 2024. [PDF] [CODE]

A pilot study of query-free adversarial attack against stable diffusion

Haomin Zhuang, Yihua Zhang

CVPRW 2023. [PDF] [CODE]

:see_no_evil:<a name="Untargeted-Black-Box-Attacks">Black-Box Attacks</a>

Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

Hongcheng Gao , Hao Zhang , Yinpeng Dong, Zhijie Deng

arxiv 2023. [PDF]

:anger:<a name="Targeted-Attacks">Targeted Attacks</a>

:cyclone:<a name="Targeted-White-Box-Attacks">White-Box Attacks</a>

Red-Teaming the Stable Diffusion Safety Filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

NeurIPS 2022, WorkShop. [PDF]

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. [PDF] [CODE]

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

Tsai, Yu-Lin and Hsu, Chia-Yi and Xie, Chulin and Lin, Chih-Hsun and Chen, Jia-You and Li, Bo and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying

ICLR 2024. [PDF]

Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts

Han Liu, Yuhao Wu, Shixuan Zhai, Bo Yuan, Ning Zhang

CVPR 2023. [PDF] [CODE]

Mma-diffusion: Multimodal attack on diffusion models

Yang, Yijun and Gao, Ruiyuan and Wang, Xiaosen and Ho, Tsung-Yi and Xu, Nan and Xu, Qiang

CVPR 2024. [PDF] [CODE]

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Haz Sameen Shahgir, Xianghao Kong, Greg Ver Steeg, Yue Dong

arxiv 2023. [PDF] [CODE]

Revealing vulnerabilities in stable diffusion via targeted attacks

Chenyu Zhang, Lanjun Wang, Anan Liu

arxiv 2024. [PDF] [CODE]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now

Zhang, Yimeng and Jia, Jinghan and Chen, Xin and Chen, Aochuan and Zhang, Yihua and Liu, Jiancheng and Ding, Ke and Liu, Sijia

ECCV 2024. [PDF] [CODE]

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

Chin, Zhi-Yi and Jiang, Chieh-Ming and Huang, Ching-Chun and Chen, Pin-Yu and Chiu, Wei-Chen

ICML 2024. [PDF] [CODE]

ADVI2I: ADVERSARIAL IMAGE ATTACK ON IMAGE-TO-IMAGE DIFFUSION MODELS

Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

arxiv 2024. [PDF] [CODE]

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao

arxiv 2024. [PDF]

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

G M Shahariar, Jia Chen, Jiachen Li, Yue Dong

arxiv 2024. [PDF]

:snake:<a name="Targeted-Black-Box-Attacks">Black-Box Attacks</a>

SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Proceedings of the IEEE Symposium on Security and Privacy 2024. [PDF] [CODE]

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

NeurIPS 2024. [PDF] [CODE]

FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

EMNLP 2024. [PDF]

Jailbreaking Text-to-Image Models with LLM-Based Agents

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, Shanqing Guo

arxiv 2024. [PDF]

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

arxiv 2024. [PDF] [CODE]

Exploiting cultural biases via homoglyphs in text-to-image synthesis

Struppek, Lukas and Hintersdorf, Dom and Friedrich, Felix and Schramowski, Patrick and Kersting, Kristian

Journal of Artificial Intelligence Research 2023. [PDF] [CODE]

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

Yimo Deng, Huangxun Chen

arxiv 2024. [PDF]

Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation

Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu

arxiv 2024. [PDF]

BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators

Yu Tian, Xiao Yang, Yinpeng Dong, Heming Yang, Hang Su, Jun Zhu

arxiv 2024. [PDF]

Black Box Adversarial Prompting for Foundation Models

Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner

arxiv 2023. [PDF] [CODE]

Adversarial Attacks on Image Generation With Made-Up Words

Raphaël Millière

arxiv 2022. [PDF]

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren

arxiv 2023. [PDF]

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

arxiv 2024. [PDF]

Perception-guided Jailbreak against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu

arxiv 2024. [PDF]

DiffZOO: A Purely Query-Based Black-Box Attack for red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Kaidi Xu, Qi Guo

arxiv 2024. [PDF]

:pill:<a name="Defenses-for-Improving-Safety">Defenses for Improving Safety</a>

:surfer:<a name="External-Safeguards">External Safeguards</a>

:mountain_bicyclist:<a name="Prompt-Classifier">Prompt Classifier</a>

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

ECCV 2024. [PDF] [CODE]

:horse_racing:<a name="Prompt-Transformation">Prompt Transformation</a>

Universal Prompt Optimizer for Safe Text-to-Image Generation

Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang

NAACL 2024. [PDF]

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu

NeurIPS 2024. [PDF]

:hamburger:<a name="Internal-Safeguards">Internal Safeguards</a>

:fries:<a name="Model-Editing">Model Editing</a>

Erasing concepts from diffusion models

Gandikota, Rohit and Materzynska, Joanna and Fiotto-Kaufman, Jaden and Bau, David

ICCV 2023. [PDF] [CODE]

Ablating concepts in text-to-image diffusion models

Kumari, Nupur and Zhang, Bingliang and Wang, Sheng-Yu and Shechtman, Eli and Zhang, Richard and Zhu, Jun-Yan

ICCV 2023. [PDF] [CODE]

Unified concept editing in diffusion models

Gandikota, Rohit and Orgad, Hadas and Belinkov, Yonatan and Materzy{'n}ska, Joanna and Bau, David

WACV 2024. [PDF] [CODE]

Editing implicit assumptions in text-to-image diffusion models

Orgad, Hadas and Kawar, Bahjat and Belinkov, Yonatan

ICCV 2023. [PDF] [CODE]

Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models

Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

ICML 2023 Workshop on Challenges in Deployable Generative AI. [PDF] [CODE]

Degeneration-tuning: Using scrambled grid shield unwanted concepts from stable diffusion

Ni, Zixuan and Wei, Longhui and Li, Jiacheng and Tang, Siliang and Zhuang, Yueting and Tian, Qi

ACM MM 2023. [PDF]

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, Yonatan Belinkov

NAACL 2024. [PDF]

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi

CVPR 2024 [PDF] [CODE]

One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding

CVPR 2024. [PDF] [CODE]

Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models

Alvin Heng , Harold Soh

NeurIPS 2024, [PDF] [CODE]

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

Hong, Seunghoo and Lee, Juhun and Woo, Simon S

AAAI 2024. [PDF]

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Xinfeng Li , Yuchen Yang , Jiangyi Deng, Chen Yan , Yanjiao Chen , Xiaoyu Ji , Wenyuan Xu

ACM CCS 2024. [PDF] [CODE]

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

ICML 2024 Workshop. [PDF]

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita

ECCV 2024. [PDF] [CODE]

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang

arxiv 2024. [PDF]

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Changhoon Kim, Kyle Min, Yezhou Yang

ECCV 2024. [PDF] [CODE]

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang

ECCV 2024. [PDF] [CODE]

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

NeurIPS 2024. [PDF] [CODE]

Editing Massive Concepts in Text-to-Image Diffusion Models

Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu

arxiv 2024. [PDF] [CODE]

:apple:<a name="Inference-Guidance">Inference Guidance</a>

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting

CVPR 2023. [PDF] [CODE]

Sega: Instructing text-to-image models using semantic guidance

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting

NeurIPS 2023. [PDF] [CODE]

Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

Li, Hang and Shen, Chengzhi and Torr, Philip and Tresp, Volker and Gu, Jindong

CVPR 2024. [PDF] [CODE]

<a name="Resources">Resources</a>

This part provides commonly used datasets and tools in AD-on-T2IDM.

<a name="Datasets">Datasets</a>

Based on the prompt source, existing datasets are categorized into two types: clean and adversarial datasets. The clean dataset consists of clean prompts that are not attacked and typically crafted by human, while the adversarial dataset comprises adversarial prompts generated by attack methods. Moreover, according to the category of prompts involved in the dataset, existing clean datasets are further divided into two types: non-malicious and malicious datasets. The non-malicious dataset contains non-malicious prompts, while the malicious dataset contains explicitly malicious prompts. In this section, we will introduce several non-malicious, malicious, and adversarial datasets, respectively.

Non-Malicious Datasets

$\textit{ImageNet}$, which contains images describing 1,000 categories of common objects in the real world, is a significant benchmark in the field of computer vision. As a result, some works craft clean datasets based on the category information in ImageNet. For instance, ATM employs a standardized template: "A photo of {CLASS_NAME}" to generate clean prompts, where "{CLASS_NAME}" denotes the class name in ImageNet.
$\textit{MSCOCO}$ [Link]is a cross-modal image-text dataset, a popular benchmark for training and evaluating text-to-image generation models. Specifically, MSCOCO includes 82,783 training images and 40,504 testing images, each with 5 text descriptions.
$\textit{LAION-COCO}$ [Link] is a subset of LAION-5B, which is a large-scale image-text dataset in the real world. LAION-COCO includes 600 million images and corresponding text descriptions.
$\textit{DiffusionDB}$ [Link] is a large-scale text-to-image prompt dataset, which contains 14 million images generated by Stable Diffusion using prompts from real users.

Malicious Datasets

$\textit{Unsafe Diffusion}$ [Link] provides 30 manually crafted malicious prompts that describe sexual and bloody content, as well as political figures.
$\textit{SneakyPrompt}$ [Link] uses ChatGPT to automatically generate 200 malicious prompts that involve sexual and bloody content.
$\textit{I2P}$ [Link] comprises 4,703 inappropriate prompts, encompassing hate, harassment, violence, self-harm, nudity content, shocking images, and illegal activity. These inappropriate prompts are real-user inputs sourced from an image generation website, Lexica [Link].
$\textit{MMA}$ [Link] samples and releases 1,000 malicious prompts from LAION-COCO based on an NSFW~(Not Safe for Work) score. These malicious prompts mainly focus on sexual content.
$ART$[Link] follows I2P and collects 15,607 malicious prompts from 7 categories in Lexica [Link].
$\textit{Image Synthesis Style Studies Database}$ [Link] compiles thousands of artists whose styles can be replicated by various text-to-image models, such as Stable Diffusion and Midjourney.
$\textit{MACE}$ [Link] provides a dataset comprising 200 celebrities whose portraits, generated using SD v1.4, are recognized with remarkable accuracy (>99%) by the GIPHY Celebrity Detector (GCD) [Link].
$\textit{ViSU}$ [Link] contains 175k pairs of safe and unsafe data examples. Each example consists of: (1) a safe sentence, (2) a corresponding safe image, (3) an NSFW sentence that is semantically correlated with the safe sentence, and (4) a corresponding NSFW image.

Adversarial Datasets

$\textit{Adversarial Nibbler Dataset}$ [Link] consists of 3,412 adversarial prompts that effectively bypass safeguards while inducing text-to-image models to generate malicious images. These prompts, which include violent, sexual, biased, and hate-based material, are manually crafted during the Adversarial Nibbler Challenge.
$\textit{MMA}$ [Link] targets 1,000 malicious prompts, generating 1,000 corresponding adversarial prompts using the proposed attack method. These adversarial prompts primarily focus on sexual content.
$\textit{Zhang et al.}$ [Link] target 10 objects as malicious concepts and generates 500 adversarial prompts for each object. These adversarial prompts are capable of inducing the text-to-image model to produce images related to the malicious concepts, even when the prompt excludes words directly related to them.

<a name="Tools">Tools</a>

We provide several detectors for detecting malicious prompts and images.

Malicious Prompt Detector

NSFW_text_classifier: [Link]
distilbert-nsfw-text-classifier: [Link]
Detoxify: [Link]
Toxic-comment-model: [Link]
Meta-Llama-Guard: [Link] (LLM evaluation)
Openai-Moderation: [Link] (API)
Azure-Moderation: [Link] (API)

Malicious Image Detector

Q16: [Link]
CLIP-based-NSFW-detector: [Link]
Multi-headed Safety Classifier: [Link]
NSFW_image_detection: [Link]
GIPHY Celebrity Detector: [Link]
Azure-Moderation: [Link] (API)