Home

Awesome

Rickrolling the Artist: Injecting Backdoors into Text-Guided Image Generation Models (ICCV 2023)

<center> <img src="images/concept.jpg" alt="Concept" height=300> </center>

Abstract: While text-to-image synthesis currently enjoys great popularity among researchers and the general public, the security of these models has been neglected so far. Many text-guided image generation models rely on pre-trained text encoders from external sources, and their users trust that the retrieved models will behave as promised. Unfortunately, this might not be the case. We introduce backdoor attacks against text-guided generative models and demonstrate that their text encoders pose a major tampering risk. Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts. By then inserting a single non-Latin character into the prompt, the adversary can trigger the model to either generate images with pre-defined attributes or images following a hidden, potentially malicious description. We empirically demonstrate the high effectiveness of our attacks on Stable Diffusion and highlight that the injection process of a single backdoor takes less than two minutes. Besides phrasing our approach solely as an attack, it can also force an encoder to forget phrases related to certain concepts, such as nudity or violence, and help to make image generation safer.
Paper (Arxiv)
Paper (ICCV Proceedings)
Live Demo

Changelog

Reproduction Statement

We provide all scripts and configuration files to reproduce the experimental results from our paper. However, the results stated in the paper were produced using the transformers library in version 4.19.2 and the diffusers library with version 0.3.0. Due to some changes of the libraries and available model weights, these package versions are no longer compatible with the provided Stable Diffusion and CLIP weights. Therefore, we had to upgrade the package versions to make the scripts running out of the box. Due to this fact, we note that minor evaluation differences might occur but the changes are negligibly small, usually in the second decimal place. For reasons of transparency, we would nevertheless like to draw attention to these changes.

Setup Docker Container

The easiest way to perform the attacks is to run the code in a Docker container. To build the Docker image run the following script:

docker build -t backdoor_attacks  .

To create and start a Docker container run the following command from the project's root:

docker run --rm --shm-size 16G --name my_container --gpus '"device=0"' -v $(pwd):/workspace -it backdoor_attacks bash

Setup Weights & Biases

We rely on Weights & Biases for experiment tracking and result storage, for which a free account is needed at wandb.ai.

To connect your account to Weights & Biases, run the following command and add your API key:

wandb init

You can find the key at wandb.ai/settings. After the key was added, stop the script with Ctrl+C.

Our Dockerfile also allows storing the API key directly in the image. For this, provide the key as an argument when building the image:

docker build -t backdoor_attacks --build-arg wandb_key=xxxxxxxxxx .

Inject Backdoors into Pre-Trained Encoders

To perform our target prompt attack (TPA) and target attribute attack (TAA), first define your attack configuration following the examples in configs and then run

python perform_TPA.py -c={CONFIG_FILE}

or

python perform_TAA.py -c={CONFIG_FILE}

The configuration files provide various parameter specifications:

Have a look at default_TPA.yaml and default_TAA.yaml as example configurations. The following figures illustrate some results of our target prompt (TPA) and target attribute (TAA) attacks.

<center> <img src="images/tpa_samples.jpg" alt="TPA Examples" width=800> </center> <center> <img src="images/taa_samples.jpg" alt="TAA Examples" width=800> </center>

Reproduce Paper Results

We provide all the configuration files used to perform the experiments in our paper. We briefly describe the various settings.

Removing Undesired Concepts

Our approach can also be used to erase any unwanted concepts from the encoder and, therefore, avoid the generation of images based on prompts with these concepts present. To remove concepts, run

python perform_concept_removal.py -c={CONFIG_FILE}

The configuration file is quite similar to the configurations for backdoor attacks. Take a look at default_concept_removal.yaml as an example.

Perform Poisoned CLIP Retrieval

To integrate a poisoned text encoder into CLIP Retrieval, run

python perform_clip_retrieval.py -p='A photo of the moon' -e=wandb_runpath

The script supports the following parameters:

In some cases, the WandB or CLIP Retrieval clients fail to build a connection to the server. These errors are temporary, so just try again later.

The following figure shows some examples retrieved with a poisoned encoder with 32 backdoors injected:

<center> <img src="images/retrieval_samples.jpg" alt="TAA Examples" width=800> </center>

Compute Evaluation Metrics

Besides the FID score, we compute all evaluation metrics after training. For this, we use 10,000 captions from the MS-COCO 2014 validation dataset. To ensure that the target character to be replaced by a trigger exists, we only picked captions containing the target character. We provide two captions lists, one for the Latin o (captions_10000_o.txt) and one for the Latin a (captions_10000_a.txt). If you want to use another target character and evaluate on a custom captions set, just state the link to the captions file in the config.

Whereas the similarity metrics and z-score are computed after each backdoor injection, we excluded the FID computation due to its high temporal requirements. To compute the FID scores, we follow Parmar et al. and compute the clean FID scores on the 40,504 images from the MS-COCO 2014 validation split as real data and 10,000 images generated with Stable Diffusion as synthetic data. The generated images are generated from 10,000 randomly sampled prompts from the MS-COCO 2014 validation split. We state the list of prompts in captions_10000.txt. To then generate the images, run python generate_images.py -f=metrics/captions_10000.txt -t={HF_TOKEN} -e={WANDB_RUNPATH} -o={OUTPUT_FOLDER}. We kept all other hyperparameters at their default values.

After finishing the generation process, download the MS-COCO 2014 validation split using the COCO API. After that, run the following script to compute the clean FID score:

from cleanfid import fid
score = fid.compute_fid(OUTPUT_FOLDER, 'coco/val2014',  mode="clean")
print(f'FID Score: {score}')

Citation

If you build upon our work, please don't forget to cite us.

@InProceedings{Struppek_2023_ICCV,
    author    = {Struppek, Lukas and Hintersdorf, Dominik and Kersting, Kristian},
    title     = {Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2023},
    pages     = {4584-4596}
}

Packages and Repositories

Some of our analyses rely on other repos and pre-trained models. We want to thank the authors for making their code and models publicly available. For more details on the specific functionality, please visit the corresponding repos: