Awesome

News and ToDo List

Introduction

This repository is the official implementation of CVPR2023: LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation.

The code is heavily based on openai/guided-diffusion, with the following modifications:

Added support for Distributed Training of PyTorch.
Added support for OmegaConfig in ./configs for easy control
Added support for layout-to-image generation by introducing a layout encoder (layout fusion module or LFM) and object-aware cross-attention (OaCA).

Gradio Webui Demo

pipeline

Pipeline

pipeline

Visualizations on COCO-stuff

compare_with_other_methods_on_COCO

Setup Environment

conda create -y -n LayoutDiffusion python=3.8
conda activate LayoutDiffusion

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
conda install -y imageio==2.9.0
pip install omegaconf opencv-python h5py==3.2.1 gradio==3.38.0 
# try '''pip install -U gradio''' when meeting bugs
pip install -e ./repositories/dpm_solver

python setup.py build develop

Gradio Webui Demo (No need for setup of dataset)

  python scripts/launch_gradio_app.py  \
  --config_file configs/COCO-stuff_256x256/LayoutDiffusion_large.yaml \
  sample.pretrained_model_path=./pretrained_models/COCO-stuff_256x256_LayoutDiffusion_large_ema_1150000.pt

add '--share' after '--config_file XXX' to allow for remote link share

Setup Dataset

See here

Pretrained Models

Dataset	Resolution	steps, FID (Sample imgs x times)	Link (TODO)
COCO-Stuff 2017 segmentation challenge<br/>(deprecated coco-stuff, not full coco-stuff)	256 x 256	steps=25 <br/> FID=15.61 ( 3097 x 5 ) <br/> FID=31.68 ( 2048 x 1 )	Google drive
COCO-Stuff 2017 segmentation challenge<br/>(deprecated coco-stuff, not full coco-stuff)	256 x 256	waiting	Google drive
COCO-Stuff 2017 segmentation challenge<br/>(deprecated coco-stuff, not full coco-stuff)	128 x 128	steps=25 <br/> FID=16.57 ( 3097 x 5 )	Google drive
VG	256 x 256	steps=25 <br/> FID=15.63 ( 5097 x 1 )	Google drive
VG	128 x 128	steps=25 <br/> FID=16.35 ( 5097 x 1 )	Google drive

Training on Latent Space

download the first stage model vae-8

    cd pretrained_models
    git clone https://huggingface.co/stabilityai/sd-vae-ft-ema
    cd sd-vae-ft-ema
    wget https://huggingface.co/stabilityai/sd-vae-ft-ema/resolve/main/diffusion_pytorch_model.bin -O diffusion_pytorch_model.bin
    wget https://huggingface.co/stabilityai/sd-vae-ft-ema/resolve/main/diffusion_pytorch_model.safetensors -O diffusion_pytorch_model.safetensors
    pip install --upgrade diffusers[torch]

python -m torch.distributed.launch \
       --nproc_per_node 8 \
       scripts/image_train_for_layout.py \
       --config_file ./configs/COCO-stuff_256x256/latent_LayoutDiffusion_large.yaml

Training on Image Space

python -m torch.distributed.launch \
       --nproc_per_node 8 \
       scripts/image_train_for_layout.py \
       --config_file ./configs/COCO-stuff_256x256/LayoutDiffusion_large.yaml

Sampling

pip install --upgrade diffusers[torch]

bash/quick_sample.bash for quick sample
bash/sample.bash for sample entire test dataset

Evaluation

[Important] In each metrics, you should first configure the environment according to the specified repo.

FID

Fr‘echet Inception Distance (FID) were evaluated by using TTUR.

After sampling, using the following command to measure the FID score:

CUDA_VISIBLE_DEVICES=0 python fid.py path/to/generated_imgs path/to/gt_imgs --gpu 0

IS

Inception Score (IS) were evaluated by using Improved-GAN.

After sampling, using the following command to measure the IS:

cd inception_score
CUDA_VISIBLE_DEVICES=0 python model.py --path path/to/generated_imgs

DS

Diversity Score (DS) were evaluated by using PerceptualSimilarity.

We modified lpips_2dirs.py to make it easier to calculate the mean and variance of DS automatically, please refer this.

After sampling, using the following command to measure the IS:

CUDA_VISIBLE_DEVICES=0 python lpips_2dirs.py -d0 path/to/generated_imgs_0 -d1 path/to/generated_imgs_1 -o imgs/example_dists.txt --use_gpu

YOLO Score

YOLO Score were evaluated by using LAMA.

Since we filter the objects and images in datasets, we think it is better to evaluate bbox mAP only on filtered annotations. So we modified test.py to measure YOLO Score both on full annotations(using instances_val2017.json in coco dataset) and filtered annotations.

After sampling, using the following command to measure the YOLO Score:

cd yolo_experiments
cd data
CUDA_VISIBLE_DEVICES=0 python test.py --image_path path/to/generated_imgs

CAS

Classification Score (CAS) were evaluated by using pytorch_image_classification.

We crop the GT box area of images and resize objects at a resolution of 32×32 with their class. Then train a ResNet101 classifier with cropped images on generated images and test it on cropped images on real images. Finally, measuring CAS using the generated images.

CUDA_VISIBLE_DEVICES=0 python evaluate.py --config configs/test.yaml

You should configure the ckpt path and dataset info in configs/test.yaml.

For beginner

The field of layout-to-image generation is related to scenegraph-to-image generation and remained some confusing issues. You could refer to issues like:

However, it is recommended to ignore the confusing history and follow the latest LDM, Frido to work on a relatively new benchmark.

Cite

@InProceedings{Zheng_2023_CVPR,
    author    = {Zheng, Guangcong and Zhou, Xianpan and Li, Xuewei and Qi, Zhongang and Shan, Ying and Li, Xi},
    title     = {LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {22490-22499}
}