Home

Awesome

🔥 [ECCV 2024] Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Static Badge

This repository hosts the code and resources associated with our paper on multiple-object generation and attribute binding in text-to-image generation models like Stable Diffusion.

Abstract

Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.

Envirioment Setup

Clone this repository and create a conda environment:

conda env create -f environment.yaml
conda activate ebama

If you rather use an existing environment, just run:

pip install -r requirements.txt

Finally, run:

python -m spacy download en_core_web_trf

to install the transformer-based spaCy NLP parser.

Datasets

In this work, we use the following datasets:

EBAMA (our method)

To test our method on a specific prompt, run:

python inference.py --prompt "a purple crown and a blue suitcase" --seed 12345

Note that this will download the stable diffusion model CompVis/stable-diffusion-v1-4. If you rather use an existing copy of the model, provide the absolute path using --model_path. For example, you can use runwayml/stable-diffusion-v1-5 for Stable Diffusion v1.5.

Metrics

We mainly use the following metrics to evaluate the generated images:

Besides, we also provide the code to compute the following metrics as defined in Attend-and-Excite:

We provide the evaluation code in the metrics folder. To evaluate the generated images and captions, for example, run:

python metrics/compute_clip_similarity.py  

You can define the paths to the generated images and captions and save path in metrics/path_name

Credits

We would like to give credits to the following repositories, from which we adapted certain code components for our research:

Citation

Static Badge

If you find this code or our results useful, please cite as:


@inproceedings{zhang2024object,
	abstract = {Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a z-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models. The code is available at https://github.com/YasminZhang/EBAMA.},
	address = {Cham},
	author = {Zhang, Yasi and Yu, Peiyu and Wu, Ying Nian},
	booktitle = {Computer Vision -- ECCV 2024},
	editor = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l},
	isbn = {978-3-031-72946-1},
	pages = {55--71},
	publisher = {Springer Nature Switzerland},
	title = {Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models},
	year = {2025}}

Star History

Star History Chart