Awesome
🔥 [ECCV 2024] Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
This repository hosts the code and resources associated with our paper on multiple-object generation and attribute binding in text-to-image generation models like Stable Diffusion.
Abstract
Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.
Envirioment Setup
Clone this repository and create a conda environment:
conda env create -f environment.yaml
conda activate ebama
If you rather use an existing environment, just run:
pip install -r requirements.txt
Finally, run:
python -m spacy download en_core_web_trf
to install the transformer-based spaCy NLP parser.
Datasets
In this work, we use the following datasets:
- AnE dataset from Attend-and-Excite. We provide the AnE dataset
ane_data.py
in thedata
folder. - DVMP dataset from SynGen. Please follow the repo to randomly generate the DVMP dataset.
- ABC-6K dataset from StrDiffusion. We provide the full ABC-6K dataset
ABC-6K.txt
in thedata
folder and a subset of the dataset indata_abc.py
.
EBAMA (our method)
To test our method on a specific prompt, run:
python inference.py --prompt "a purple crown and a blue suitcase" --seed 12345
Note that this will download the stable diffusion model CompVis/stable-diffusion-v1-4
. If you rather use an existing copy of the model, provide the absolute path using --model_path
. For example, you can use runwayml/stable-diffusion-v1-5
for Stable Diffusion v1.5.
Metrics
We mainly use the following metrics to evaluate the generated images:
- Text-Image Full Similarity
- Text-Image Min Similarity
- Text-Caption Similarity
Besides, we also provide the code to compute the following metrics as defined in Attend-and-Excite:
- Text-Image Max Similarity
- Text-Image Avg Similarity
We provide the evaluation code in the metrics
folder. To evaluate the generated images and captions, for example, run:
python metrics/compute_clip_similarity.py
You can define the paths to the generated images and captions and save path in metrics/path_name
Credits
We would like to give credits to the following repositories, from which we adapted certain code components for our research:
Citation
If you find this code or our results useful, please cite as:
@inproceedings{zhang2024object,
abstract = {Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a z-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models. The code is available at https://github.com/YasminZhang/EBAMA.},
address = {Cham},
author = {Zhang, Yasi and Yu, Peiyu and Wu, Ying Nian},
booktitle = {Computer Vision -- ECCV 2024},
editor = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l},
isbn = {978-3-031-72946-1},
pages = {55--71},
publisher = {Springer Nature Switzerland},
title = {Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models},
year = {2025}}