Awesome
NoiseCollage
NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
This is an official repository for the paper, NoiseCollage, which is a revolutionary extension of text-to-image diffusion models for layout-aware image generation.
This paper is accepted at CVPR2024.
ArXiv -> https://arxiv.org/abs/2403.03485
NoiseCollage is a novel layout-aware text-to-image diffusion model to tackle several issues, including mismatches between the text and layout conditions and quality degradation of generated images in SoTA models. During the denoising process, NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words, it can put the right objects in the right places. It is a training-free method and can be applied to text-to-image diffusion models such as StableDiffusion(SD1.5/2.1) and supports integration with ControlNet.
Crop & Merge Operation of estimated noise
Generated Results
-
An example of layout/caption conditions and corresponding generated images by NoiseCollage.
-
An example of NoiseCollage and ControlNet integration.
This repository contains codes for implementation of NoiseCollage based on diffusers.
The codes heavily depend on diffusers
and we would like to thank all those involved.
Environment setup
- We tested all codes in
Python: 3.8.10
. Make a virtual environment.
python3.8 -m venv noisecollage
source noisecollage/bin/activate
- You need to download external libraries with pip as follows.
pip install --upgrade pip
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
Note: We modified some codes in diffusers
therefore you don't install original version of diffusers
.
You also need to download pre-trained StableDiffusion weights from HuggingFace
(https://huggingface.co/***).
Implementation
-
Make input conditions First, make folders corresponding to each region in a layout. The folder name must be
PALL
(corresponding to whole image) andPXXX
(corresponding to each region). Second, make mask images to specify each layout with bbox or segmentation. The mask images are binary images and the name must bebbox_mask_image
ormask_image
. Third, make caption files corresponding to each region. Optionally, make controlnet condition images. Please refere the foldersample_inputs/*
for details. -
Image generation Edit the configuration
ExpConfigs
ingen_img_by_noisecollage.py
and run the code as follows. We recommend to setbatch_sizeā§4
to get the best quality image.
python gen_img_by_noisecollage.py
- Image Generation with ControlNet
Edit the configuration
ExpConfigs
ingen_img_by_noisecollage-with-controlnet.py
and run the code as follows.
python gen_img_by_noisecollage-with-controlnet.py
Note: Masked Cross-Attention (mentioned at 3.3 in the paper) is not necessary when integrating NoiseCollage with ControlNet(sketch, pose, canny, ...).
- Image generation with LoRA models.
Edit the configuration
ExpConfigs
in the code to set your LoRA model path and run the code following step 2 and 3. Applying multiple LoRA models is not currently supported but can be done with a minor code modification. It is also possible to apply a LoRA model to the region A and the other LoRA model to region B. This will allow for more flexible image generation.
Important!: Please handle generated images with caution because safety_checker()
is commented out to facilitate debugging.
Next >>
- Limitations
- There is room for improvement in layout accuracy of generated images in complex layout scenarios.
- NoiseCollage cannot be applied directory to
SDXL
due to some changes in text-embedding mechanism.
- Future works
- Improve Masked Cross-Attention or replace it to another mechanism.
- Apply NoiseCollage to SD3 which is a transformer based architecture text-to-image diffusion model. (code and weight of SD3 will not be openš)
- Apply NoiseCollage to the other diffusion models such as video, 3D model, and categorical data generation?
bibtex
@inproceedings{shirakawa2024noisecollage,
title={NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging},
author={Takahiro Shirakawa and Seiichi Uchida},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}