Home

Awesome

Official Implementations "Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models" (ICLR'24)</sub>

arXiv arXiv YouTube Channel Views

<hr />

Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two approaches, which we refer to as soft-weighted regularization and inference-time text embedding optimization. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

<hr />

πŸ‘€ Observation

Random Sample

Existing text-to-image models can encounter challenges in effectively suppressing the generation of the negative target. For example, when requesting an image using the prompt "a face without glasses", the diffusion models (i.e., SD) synthesize the subject without "glasses", as shown in Figure (the first column). However, when using the prompt "a man without glasses", both SD and DeepFloyd-IF models still generate the subject with "glasses" (It also happens in both Ideogram and Mijourney models, see Appendix F), as shown in Figure (the second and fifth columns). Figure (the last column) quantitatively show that SD has 0.819 DetScore for "glasses" using 1000 randomly generated images, indicating a very common failure cases in diffusion models. Also, when giving the prompt "a man", often the glasses are included, see Figure (the third and sixth columns).

πŸ’» Requirements

The codebase is tested on

environment or python libraries:

pip install -r requirements.txt

🎊 Suppression for real image

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A man with a beard wearing glasses and a hat in blue shirt" \
                                   --image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
                                   --token_indices "[[4,5],[7],[9,10],]" \
                                   --alpha "[1.,]" --cross_retain_steps "[.2,]"

Random Sample

You can use NPI (Negative-prompt Inversion) for faster inversion of the real image, but it may lead to a certain degree of degradation in editing quality:

python suppress_eot_w_nulltext.py  --type Real-Image --inversion NPI\
                                   --prompt "A man with a beard wearing glasses and a hat in blue shirt" \
                                   --image_path "./example_images/A man with a beard wearing glasses and a hat in blue shirt.jpg" \
                                   --token_indices "[[4,5],[7],[9,10],]" \
                                   --alpha "[1.,]" --cross_retain_steps "[.2,]"

🎊 Suppression for generated image

python suppress_eot_w_nulltext.py --type Generated-Image \
                                  --prompt "A dog in Van Gogh Style" --seed 2023 \
                                  --token_indices "[[4,5,6],]" \
                                  --alpha "[1.,]" --cross_retain_steps "[.2,]" --iter_each_step 0

Random Sample

Our method can suppress copyrighted materials and memorized images from pretrained text-to-image diffusion models, such as ESD, Concept-ablation and Forget-Me-Not, and it is training-free.

πŸͺ„ Additional application

1️⃣ Generating subjects for generated image (Attend-and-Excite similar results)

python suppress_eot_w_nulltext.py  --type Generated-Image \
                                   --prompt "A painting of an elephant with glasses" --seed 16 \
                                   --token_indices "[[7],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0

Random Sample

2️⃣ Adding subjects for real image (GLIGEN similar results)

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A car near trees with raining" \
                                   --image_path "./example_images/A car near trees.jpg" \
                                   --token_indices "[[6],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.2,]" --iter_each_step 0

Random Sample

3️⃣ Replacing subject in the real image with another

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "a robot is jumping out of a jeep" \
                                   --image_path "./example_images/a man is jumping out of a jeep.jpg" \
                                   --token_indices "[[2],]" \
                                   --alpha "[-0.001,]" --cross_retain_steps "[.5,]" --iter_each_step 0

Random Sample

4️⃣ (Top) Cracks removal results. (Middle) Rain removal for synthetic rainy image. (Bottom) Rain removal for real-world rainy image

python suppress_eot_w_nulltext.py  --type Real-Image \
                                   --prompt "A photo with cracks" \
                                   --image_path "./example_images/A photo with cracks.jpg" \
                                   --token_indices "[[4],]" \
                                   --alpha "[1,]" --cross_retain_steps "[.5,]"

Random Sample

βš—οΈWork in progress

Random Sample

πŸ“ Quantitative comparison

Following Inst-Inpaint, we use FID and CLIP Accuracy to evaluate the accuracy of the removal operation on the GQA-Inpaint dataset. We achieve superior suppression results and higher CLIP Accuracy scores on the GQA-Inpaint dataset.

There are 18883 pairs of test data in the GQA-Inpaint dataset, including source image, target image, and prompt. Inst-Inpaint attempts to remove objects from the source image based on the provided prompt as an instruction (e.g., "Remove the airplane at the center"). We suppress the noun immediately following "remove" in the instruction (e.g., "airplane") and use the remaining part, deleting the word "remove" at the beginning of the instruction to form our input prompt (e.g., "The airplane at the center").

Random Sample

Outputs for GQA-Inpaint in Inst-Inpaint and Ours.

🀝🏻 Citation

<span id="citation"></span>

@inproceedings{li2023get,
  title={Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models},
  author={Li, Senmao and van de Weijer, Joost and Khan, Fahad and Hou, Qibin and Wang, Yaxing and others},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2023}
}

Contact

Should you have any questions, please contact senmaonk@gmail.com

License

The code in this repository is licensed under the MIT License for academic and other non-commercial uses.

visitors

Acknowledgment: This code is based on the P2P, Null-text repository.