Home

Awesome

Awesome-Attacks and Defenses on T2I Diffusion Models

This repository is a curated collection of research papers focused on $\textbf{Adversarial Attacks and Defenses on Text-to-Image Diffusion Models (AD-on-T2IDM)}$.

We will continuously update this collection to track the latest advancements in the field of AD-on-T2IDM.

Welcome to follow and star! If you have any relevant materials or suggestions, please feel free to contact us (zcy@tju.edu.cn) or submit a pull request.

For more detailed information, please refer to our survey paper: [ARXIV][Published Version]

:bell:News

Citation

@article{zhang2024adversarial,
  title={Adversarial attacks and defenses on text-to-image diffusion models: A survey},
  author={Zhang, Chenyu and Hu, Mingwang and Li, Wenhui and Wang, Lanjun},
  journal={Information Fusion},
  pages={102701},
  year={2024},
  publisher={Elsevier}
}

Content

<a name="Abstract">Abstract</a>

Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversarial attack methods. Simultaneously, there has been a marked increase in research focused on defense methods to improve the robustness and safety of these models. In this survey, we provide a comprehensive review of the literature on adversarial attacks and defenses targeting text-to-image diffusion models. We begin with an overview of popular text-to-image diffusion models, followed by an introduction to a taxonomy of adversarial attacks and an in-depth review of existing attack methods. We then present a detailed analysis of current defense methods that improve model robustness and safety. Finally, we discuss ongoing challenges and explore promising future research directions.

<a name="Overview">Overview of AD-on-T2IDM</a>

Two key concerns in T2IDM: Robustness and Safety

The robustness ensures that the model can generate images with consistent semantics in response to diverse prompts inputted by users in practice.

The safety prevents the misuse of the model in creating malicious images, such as sexual, violent, and political images, etc.

Adversarial attacks

Based on the intent of the adversary, existing attack methods can be divided into two primary categories: untargeted and targeted attacks.

Defenses

Based on the defense goal, existing defense methods can be classified into two categories: 1) improving model robustness and 2) improving model safety.

Notably, although many methods are proposed to improve the model robustness against the prompt with multiple objects and attributes, this collection omits related papers on this part since there has been related surveys, such as controllable image generation [PDF], the development and advancement of image generation capabilities [PDF-1], [PDF-2], [PDF-3]. Moreover, for grammatically incorrect prompts with subtle noise, mature solutions are still lacking. Therefore, this collection mainly focuses on the defense methods for improving model safety.

:grinning:<a name="Paper_List">Paper List</a>

:imp:<a name="Adversarial-Attacks">Adversarial Attacks</a>

:collision:<a name="Untargeted-Attacks">Untargeted Attacks</a>

:pouting_cat:<a name="Untargeted-White-Box-Attacks">White-Box Attacks</a>

Stable diffusion is unstable

Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu

NeurIPS 2024. [PDF] [CODE]

A pilot study of query-free adversarial attack against stable diffusion

Haomin Zhuang, Yihua Zhang

CVPRW 2023. [PDF] [CODE]

:see_no_evil:<a name="Untargeted-Black-Box-Attacks">Black-Box Attacks</a>

Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

Hongcheng Gao , Hao Zhang , Yinpeng Dong, Zhijie Deng

arxiv 2023. [PDF]

:anger:<a name="Targeted-Attacks">Targeted Attacks</a>

:cyclone:<a name="Targeted-White-Box-Attacks">White-Box Attacks</a>

Red-Teaming the Stable Diffusion Safety Filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

NeurIPS 2022, WorkShop. [PDF]

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. [PDF] [CODE]

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

Tsai, Yu-Lin and Hsu, Chia-Yi and Xie, Chulin and Lin, Chih-Hsun and Chen, Jia-You and Li, Bo and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying

ICLR 2024. [PDF]

Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts

Han Liu, Yuhao Wu, Shixuan Zhai, Bo Yuan, Ning Zhang

CVPR 2023. [PDF] [CODE]

Mma-diffusion: Multimodal attack on diffusion models

Yang, Yijun and Gao, Ruiyuan and Wang, Xiaosen and Ho, Tsung-Yi and Xu, Nan and Xu, Qiang

CVPR 2024. [PDF] [CODE]

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Haz Sameen Shahgir, Xianghao Kong, Greg Ver Steeg, Yue Dong

arxiv 2023. [PDF] [CODE]

Revealing vulnerabilities in stable diffusion via targeted attacks

Chenyu Zhang, Lanjun Wang, Anan Liu

arxiv 2024. [PDF] [CODE]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now

Zhang, Yimeng and Jia, Jinghan and Chen, Xin and Chen, Aochuan and Zhang, Yihua and Liu, Jiancheng and Ding, Ke and Liu, Sijia

ECCV 2024. [PDF] [CODE]

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

Chin, Zhi-Yi and Jiang, Chieh-Ming and Huang, Ching-Chun and Chen, Pin-Yu and Chiu, Wei-Chen

ICML 2024. [PDF] [CODE]

ADVI2I: ADVERSARIAL IMAGE ATTACK ON IMAGE-TO-IMAGE DIFFUSION MODELS

Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

arxiv 2024. [PDF] [CODE]

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao

arxiv 2024. [PDF]

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

G M Shahariar, Jia Chen, Jiachen Li, Yue Dong

arxiv 2024. [PDF]

:snake:<a name="Targeted-Black-Box-Attacks">Black-Box Attacks</a>

SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Proceedings of the IEEE Symposium on Security and Privacy 2024. [PDF] [CODE]

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

NeurIPS 2024. [PDF] [CODE]

FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

EMNLP 2024. [PDF]

Jailbreaking Text-to-Image Models with LLM-Based Agents

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, Shanqing Guo

arxiv 2024. [PDF]

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

arxiv 2024. [PDF] [CODE]

Exploiting cultural biases via homoglyphs in text-to-image synthesis

Struppek, Lukas and Hintersdorf, Dom and Friedrich, Felix and Schramowski, Patrick and Kersting, Kristian

Journal of Artificial Intelligence Research 2023. [PDF] [CODE]

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

Yimo Deng, Huangxun Chen

arxiv 2024. [PDF]

Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation

Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu

arxiv 2024. [PDF]

BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators

Yu Tian, Xiao Yang, Yinpeng Dong, Heming Yang, Hang Su, Jun Zhu

arxiv 2024. [PDF]

Black Box Adversarial Prompting for Foundation Models

Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner

arxiv 2023. [PDF] [CODE]

Adversarial Attacks on Image Generation With Made-Up Words

Raphaël Millière

arxiv 2022. [PDF]

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren

arxiv 2023. [PDF]

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

arxiv 2024. [PDF]

Perception-guided Jailbreak against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu

arxiv 2024. [PDF]

DiffZOO: A Purely Query-Based Black-Box Attack for red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Kaidi Xu, Qi Guo

arxiv 2024. [PDF]

:pill:<a name="Defenses-for-Improving-Safety">Defenses for Improving Safety</a>

:surfer:<a name="External-Safeguards">External Safeguards</a>

:mountain_bicyclist:<a name="Prompt-Classifier">Prompt Classifier</a>

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

ECCV 2024. [PDF] [CODE]

:horse_racing:<a name="Prompt-Transformation">Prompt Transformation</a>

Universal Prompt Optimizer for Safe Text-to-Image Generation

Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang

NAACL 2024. [PDF]

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu

NeurIPS 2024. [PDF]

:hamburger:<a name="Internal-Safeguards">Internal Safeguards</a>

:fries:<a name="Model-Editing">Model Editing</a>

Erasing concepts from diffusion models

Gandikota, Rohit and Materzynska, Joanna and Fiotto-Kaufman, Jaden and Bau, David

ICCV 2023. [PDF] [CODE]

Ablating concepts in text-to-image diffusion models

Kumari, Nupur and Zhang, Bingliang and Wang, Sheng-Yu and Shechtman, Eli and Zhang, Richard and Zhu, Jun-Yan

ICCV 2023. [PDF] [CODE]

Unified concept editing in diffusion models

Gandikota, Rohit and Orgad, Hadas and Belinkov, Yonatan and Materzy{'n}ska, Joanna and Bau, David

WACV 2024. [PDF] [CODE]

Editing implicit assumptions in text-to-image diffusion models

Orgad, Hadas and Kawar, Bahjat and Belinkov, Yonatan

ICCV 2023. [PDF] [CODE]

Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models

Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

ICML 2023 Workshop on Challenges in Deployable Generative AI. [PDF] [CODE]

Degeneration-tuning: Using scrambled grid shield unwanted concepts from stable diffusion

Ni, Zixuan and Wei, Longhui and Li, Jiacheng and Tang, Siliang and Zhuang, Yueting and Tian, Qi

ACM MM 2023. [PDF]

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, Yonatan Belinkov

NAACL 2024. [PDF]

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi

CVPR 2024 [PDF] [CODE]

One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding

CVPR 2024. [PDF] [CODE]

Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models

Alvin Heng , Harold Soh

NeurIPS 2024, [PDF] [CODE]

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

Hong, Seunghoo and Lee, Juhun and Woo, Simon S

AAAI 2024. [PDF]

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Xinfeng Li , Yuchen Yang , Jiangyi Deng, Chen Yan , Yanjiao Chen , Xiaoyu Ji , Wenyuan Xu

ACM CCS 2024. [PDF] [CODE]

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

ICML 2024 Workshop. [PDF]

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita

ECCV 2024. [PDF] [CODE]

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang

arxiv 2024. [PDF]

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Changhoon Kim, Kyle Min, Yezhou Yang

ECCV 2024. [PDF] [CODE]

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang

ECCV 2024. [PDF] [CODE]

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

NeurIPS 2024. [PDF] [CODE]

Editing Massive Concepts in Text-to-Image Diffusion Models

Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu

arxiv 2024. [PDF] [CODE]

:apple:<a name="Inference-Guidance">Inference Guidance</a>

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting

CVPR 2023. [PDF] [CODE]

Sega: Instructing text-to-image models using semantic guidance

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting

NeurIPS 2023. [PDF] [CODE]

Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

Li, Hang and Shen, Chengzhi and Torr, Philip and Tresp, Volker and Gu, Jindong

CVPR 2024. [PDF] [CODE]

<a name="Resources">Resources</a>

This part provides commonly used datasets and tools in AD-on-T2IDM.

<a name="Datasets">Datasets</a>

Based on the prompt source, existing datasets are categorized into two types: clean and adversarial datasets. The clean dataset consists of clean prompts that are not attacked and typically crafted by human, while the adversarial dataset comprises adversarial prompts generated by attack methods. Moreover, according to the category of prompts involved in the dataset, existing clean datasets are further divided into two types: non-malicious and malicious datasets. The non-malicious dataset contains non-malicious prompts, while the malicious dataset contains explicitly malicious prompts. In this section, we will introduce several non-malicious, malicious, and adversarial datasets, respectively.

Non-Malicious Datasets

Malicious Datasets

Adversarial Datasets

<a name="Tools">Tools</a>

We provide several detectors for detecting malicious prompts and images.

Malicious Prompt Detector

Malicious Image Detector