Awesome
Awesome-Attacks and Defenses on T2I Diffusion Models
This repository is a curated collection of research papers focused on $\textbf{Adversarial Attacks and Defenses on Text-to-Image Diffusion Models (AD-on-T2IDM)}$.
We will continuously update this collection to track the latest advancements in the field of AD-on-T2IDM.
Welcome to follow and star! If you have any relevant materials or suggestions, please feel free to contact us (zcy@tju.edu.cn) or submit a pull request.
For more detailed information, please refer to our survey paper: [ARXIV], [Published Version]
:bell:News
- 2024-09-12 Our survey "Adversarial Attacks and Defenses on Text-to-Image Diffusion Models" has been accepted by Information Fusion~(SCI-1, IF14.7).
Citation
@article{zhang2024adversarial,
title={Adversarial attacks and defenses on text-to-image diffusion models: A survey},
author={Zhang, Chenyu and Hu, Mingwang and Li, Wenhui and Wang, Lanjun},
journal={Information Fusion},
pages={102701},
year={2024},
publisher={Elsevier}
}
Content
<a name="Abstract">Abstract</a>
Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversarial attack methods. Simultaneously, there has been a marked increase in research focused on defense methods to improve the robustness and safety of these models. In this survey, we provide a comprehensive review of the literature on adversarial attacks and defenses targeting text-to-image diffusion models. We begin with an overview of popular text-to-image diffusion models, followed by an introduction to a taxonomy of adversarial attacks and an in-depth review of existing attack methods. We then present a detailed analysis of current defense methods that improve model robustness and safety. Finally, we discuss ongoing challenges and explore promising future research directions.
<a name="Overview">Overview of AD-on-T2IDM</a>
Two key concerns in T2IDM: Robustness and Safety
The robustness ensures that the model can generate images with consistent semantics in response to diverse prompts inputted by users in practice.
The safety prevents the misuse of the model in creating malicious images, such as sexual, violent, and political images, etc.
Adversarial attacks
Based on the intent of the adversary, existing attack methods can be divided into two primary categories: untargeted and targeted attacks.
-
For untargeted attacks, consider a scenario with a prompt input by the user~($\textbf{clean prompt}$) and its corresponding output image~($\textbf{clean image}$). The objective of untargeted attacks is to subtly perturb the clean prompt to craft an $\textbf{adversarial prompt}$, further misleading the victim model to generate an $\textbf{adversarial image}$ with semantics different from the clean image. This type of attack is commonly used to uncover the vulnerability in the robustness of the victim model. Some untargeted attacks are shown as follows:
-
For targeted attacks, assumes that the victim model has built-in $\textbf{safeguards}$ to filter $\textbf{malicious prompts}$ and resultant $\textbf{malicious images}$. These prompts and images often explicitly contain $\textbf{malicious concepts}$, such as 'nudity', 'violence', and other predefined concepts. The objective of targeted attacks is to obtain an $\textbf{adversarial prompt}$, which can bypass these safeguards while inducing the victim model to generate $\textbf{adversarial images}$ containing malicious concepts. This type of attack is typically designed to reveal the vulnerability in the safety of the victim model. Some targeted attacks are shown as follows:
Defenses
Based on the defense goal, existing defense methods can be classified into two categories: 1) improving model robustness and 2) improving model safety.
-
The goal of robustness is to ensure that generated images have consistent semantics with diverse input prompts in practical applications. Specifically, according to the adversarial attack, the defense methods are asked to mitigate the robustness vulnerabilities in two types of input prompts: 1) the prompt with multiple objects and attributes, and 2) the grammatically incorrect prompt with the subtle noise.
-
The safety goal is to prevent the generation of malicious images in response to both malicious and adversarial prompts. Specifically, malicious prompts explicitly contain malicious concepts, while adversarial prompts cleverly omit these concepts. Moreover, based on the knowledge of the model, existing safety methods can be classified into two categories: external safeguards and internal safeguards. The external safeguards focus on detecting or correcting the malicious prompt before feeding the prompt into the text-to-image model. In contrast, internal safeguards aim to ensure that the semantics of output images deviate from those of malicious images by modifying internal parameters and features within the model. Some examples of external and internal safeguards are shown as follows:
<img src="./picture/external_safeguards.png" alt="external safeguards" style="zoom:50%;" /> <img src="./picture/internal_safeguards.png" alt="internal safeguards" style="zoom:50%;" />
Notably, although many methods are proposed to improve the model robustness against the prompt with multiple objects and attributes, this collection omits related papers on this part since there has been related surveys, such as controllable image generation [PDF], the development and advancement of image generation capabilities [PDF-1], [PDF-2], [PDF-3]. Moreover, for grammatically incorrect prompts with subtle noise, mature solutions are still lacking. Therefore, this collection mainly focuses on the defense methods for improving model safety.
:grinning:<a name="Paper_List">Paper List</a>
:imp:<a name="Adversarial-Attacks">Adversarial Attacks</a>
:collision:<a name="Untargeted-Attacks">Untargeted Attacks</a>
:pouting_cat:<a name="Untargeted-White-Box-Attacks">White-Box Attacks</a>
Stable diffusion is unstable
Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu
A pilot study of query-free adversarial attack against stable diffusion
Haomin Zhuang, Yihua Zhang
:see_no_evil:<a name="Untargeted-Black-Box-Attacks">Black-Box Attacks</a>
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks
Hongcheng Gao , Hao Zhang , Yinpeng Dong, Zhijie Deng
arxiv 2023. [PDF]
:anger:<a name="Targeted-Attacks">Targeted Attacks</a>
:cyclone:<a name="Targeted-White-Box-Attacks">White-Box Attacks</a>
Red-Teaming the Stable Diffusion Safety Filter
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
NeurIPS 2022, WorkShop. [PDF]
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang
Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. [PDF] [CODE]
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?
Tsai, Yu-Lin and Hsu, Chia-Yi and Xie, Chulin and Lin, Chih-Hsun and Chen, Jia-You and Li, Bo and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying
ICLR 2024. [PDF]
Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts
Han Liu, Yuhao Wu, Shixuan Zhai, Bo Yuan, Ning Zhang
Mma-diffusion: Multimodal attack on diffusion models
Yang, Yijun and Gao, Ruiyuan and Wang, Xiaosen and Ho, Tsung-Yi and Xu, Nan and Xu, Qiang
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Haz Sameen Shahgir, Xianghao Kong, Greg Ver Steeg, Yue Dong
Revealing vulnerabilities in stable diffusion via targeted attacks
Chenyu Zhang, Lanjun Wang, Anan Liu
To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now
Zhang, Yimeng and Jia, Jinghan and Chen, Xin and Chen, Aochuan and Zhang, Yihua and Liu, Jiancheng and Ding, Ke and Liu, Sijia
Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts
Chin, Zhi-Yi and Jiang, Chieh-Ming and Huang, Ching-Chun and Chen, Pin-Yu and Chiu, Wei-Chen
ADVI2I: ADVERSARIAL IMAGE ATTACK ON IMAGE-TO-IMAGE DIFFUSION MODELS
Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao
arxiv 2024. [PDF]
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation
G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
arxiv 2024. [PDF]
:snake:<a name="Targeted-Black-Box-Attacks">Black-Box Attacks</a>
SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters
Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao
Proceedings of the IEEE Symposium on Security and Privacy 2024. [PDF] [CODE]
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang
FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
EMNLP 2024. [PDF]
Jailbreaking Text-to-Image Models with LLM-Based Agents
Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, Shanqing Guo
arxiv 2024. [PDF]
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang
Exploiting cultural biases via homoglyphs in text-to-image synthesis
Struppek, Lukas and Hintersdorf, Dom and Friedrich, Felix and Schramowski, Patrick and Kersting, Kristian
Journal of Artificial Intelligence Research 2023. [PDF] [CODE]
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models
Yimo Deng, Huangxun Chen
arxiv 2024. [PDF]
Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation
Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu
arxiv 2024. [PDF]
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators
Yu Tian, Xiao Yang, Yinpeng Dong, Heming Yang, Hang Su, Jun Zhu
arxiv 2024. [PDF]
Black Box Adversarial Prompting for Foundation Models
Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner
Adversarial Attacks on Image Generation With Made-Up Words
Raphaël Millière
arxiv 2022. [PDF]
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren
arxiv 2023. [PDF]
RT-Attack: Jailbreaking Text-to-Image Models via Random Token
Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo
arxiv 2024. [PDF]
Perception-guided Jailbreak against Text-to-Image Models
Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu
arxiv 2024. [PDF]
DiffZOO: A Purely Query-Based Black-Box Attack for red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Kaidi Xu, Qi Guo
arxiv 2024. [PDF]
:pill:<a name="Defenses-for-Improving-Safety">Defenses for Improving Safety</a>
:surfer:<a name="External-Safeguards">External Safeguards</a>
:mountain_bicyclist:<a name="Prompt-Classifier">Prompt Classifier</a>
Latent Guard: a Safety Framework for Text-to-image Generation
Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati
:horse_racing:<a name="Prompt-Transformation">Prompt Transformation</a>
Universal Prompt Optimizer for Safe Text-to-Image Generation
Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang
NAACL 2024. [PDF]
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts
Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu
NeurIPS 2024. [PDF]
:hamburger:<a name="Internal-Safeguards">Internal Safeguards</a>
:fries:<a name="Model-Editing">Model Editing</a>
Erasing concepts from diffusion models
Gandikota, Rohit and Materzynska, Joanna and Fiotto-Kaufman, Jaden and Bau, David
Ablating concepts in text-to-image diffusion models
Kumari, Nupur and Zhang, Bingliang and Wang, Sheng-Yu and Shechtman, Eli and Zhang, Richard and Zhu, Jun-Yan
Unified concept editing in diffusion models
Gandikota, Rohit and Orgad, Hadas and Belinkov, Yonatan and Materzy{'n}ska, Joanna and Bau, David
Editing implicit assumptions in text-to-image diffusion models
Orgad, Hadas and Kawar, Bahjat and Belinkov, Yonatan
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee
ICML 2023 Workshop on Challenges in Deployable Generative AI. [PDF] [CODE]
Degeneration-tuning: Using scrambled grid shield unwanted concepts from stable diffusion
Ni, Zixuan and Wei, Longhui and Li, Jiacheng and Tang, Siliang and Zhuang, Yueting and Tian, Qi
ACM MM 2023. [PDF]
ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
Dana Arad, Hadas Orgad, Yonatan Belinkov
NAACL 2024. [PDF]
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi
One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications
Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding
Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models
Alvin Heng , Harold Soh
All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models
Hong, Seunghoo and Lee, Juhun and Woo, Simon S
AAAI 2024. [PDF]
SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Xinfeng Li , Yuchen Yang , Jiangyi Deng, Chen Yan , Yanjiao Chen , Xiaoyu Ji , Wenyuan Xu
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee
ICML 2024 Workshop. [PDF]
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient
Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang
arxiv 2024. [PDF]
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
Changhoon Kim, Kyle Min, Yezhou Yang
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu
Editing Massive Concepts in Text-to-Image Diffusion Models
Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu
:apple:<a name="Inference-Guidance">Inference Guidance</a>
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting
Sega: Instructing text-to-image models using semantic guidance
Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting
Self-discovering interpretable diffusion latent directions for responsible text-to-image generation
Li, Hang and Shen, Chengzhi and Torr, Philip and Tresp, Volker and Gu, Jindong
<a name="Resources">Resources</a>
This part provides commonly used datasets and tools in AD-on-T2IDM.
<a name="Datasets">Datasets</a>
Based on the prompt source, existing datasets are categorized into two types: clean and adversarial datasets. The clean dataset consists of clean prompts that are not attacked and typically crafted by human, while the adversarial dataset comprises adversarial prompts generated by attack methods. Moreover, according to the category of prompts involved in the dataset, existing clean datasets are further divided into two types: non-malicious and malicious datasets. The non-malicious dataset contains non-malicious prompts, while the malicious dataset contains explicitly malicious prompts. In this section, we will introduce several non-malicious, malicious, and adversarial datasets, respectively.
Non-Malicious Datasets
-
$\textit{ImageNet}$, which contains images describing 1,000 categories of common objects in the real world, is a significant benchmark in the field of computer vision. As a result, some works craft clean datasets based on the category information in ImageNet. For instance, ATM employs a standardized template: "A photo of {CLASS_NAME}" to generate clean prompts, where "{CLASS_NAME}" denotes the class name in ImageNet.
-
$\textit{MSCOCO}$ [Link]is a cross-modal image-text dataset, a popular benchmark for training and evaluating text-to-image generation models. Specifically, MSCOCO includes 82,783 training images and 40,504 testing images, each with 5 text descriptions.
-
$\textit{LAION-COCO}$ [Link] is a subset of LAION-5B, which is a large-scale image-text dataset in the real world. LAION-COCO includes 600 million images and corresponding text descriptions.
-
$\textit{DiffusionDB}$ [Link] is a large-scale text-to-image prompt dataset, which contains 14 million images generated by Stable Diffusion using prompts from real users.
Malicious Datasets
- $\textit{Unsafe Diffusion}$ [Link] provides 30 manually crafted malicious prompts that describe sexual and bloody content, as well as political figures.
- $\textit{SneakyPrompt}$ [Link] uses ChatGPT to automatically generate 200 malicious prompts that involve sexual and bloody content.
- $\textit{I2P}$ [Link] comprises 4,703 inappropriate prompts, encompassing hate, harassment, violence, self-harm, nudity content, shocking images, and illegal activity. These inappropriate prompts are real-user inputs sourced from an image generation website, Lexica [Link].
- $\textit{MMA}$ [Link] samples and releases 1,000 malicious prompts from LAION-COCO based on an NSFW~(Not Safe for Work) score. These malicious prompts mainly focus on sexual content.
- $ART$[Link] follows I2P and collects 15,607 malicious prompts from 7 categories in Lexica [Link].
- $\textit{Image Synthesis Style Studies Database}$ [Link] compiles thousands of artists whose styles can be replicated by various text-to-image models, such as Stable Diffusion and Midjourney.
- $\textit{MACE}$ [Link] provides a dataset comprising 200 celebrities whose portraits, generated using SD v1.4, are recognized with remarkable accuracy (>99%) by the GIPHY Celebrity Detector (GCD) [Link].
- $\textit{ViSU}$ [Link] contains 175k pairs of safe and unsafe data examples. Each example consists of: (1) a safe sentence, (2) a corresponding safe image, (3) an NSFW sentence that is semantically correlated with the safe sentence, and (4) a corresponding NSFW image.
Adversarial Datasets
-
$\textit{Adversarial Nibbler Dataset}$ [Link] consists of 3,412 adversarial prompts that effectively bypass safeguards while inducing text-to-image models to generate malicious images. These prompts, which include violent, sexual, biased, and hate-based material, are manually crafted during the Adversarial Nibbler Challenge.
-
$\textit{MMA}$ [Link] targets 1,000 malicious prompts, generating 1,000 corresponding adversarial prompts using the proposed attack method. These adversarial prompts primarily focus on sexual content.
-
$\textit{Zhang et al.}$ [Link] target 10 objects as malicious concepts and generates 500 adversarial prompts for each object. These adversarial prompts are capable of inducing the text-to-image model to produce images related to the malicious concepts, even when the prompt excludes words directly related to them.
<a name="Tools">Tools</a>
We provide several detectors for detecting malicious prompts and images.
Malicious Prompt Detector
-
NSFW_text_classifier: [Link]
-
distilbert-nsfw-text-classifier: [Link]
-
Detoxify: [Link]
-
Toxic-comment-model: [Link]
-
Meta-Llama-Guard: [Link] (LLM evaluation)
-
Openai-Moderation: [Link] (API)
-
Azure-Moderation: [Link] (API)