Home

Awesome

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

Project Page

Qiucheng Wu<sup>1</sup>, Yujian Liu<sup>1</sup>, Handong Zhao<sup>2</sup>, Ajinkya Kale<sup>2</sup>, Trung Bui<sup>2</sup>, Tong Yu<sup>2</sup>, Zhe Lin<sup>2</sup>, Yang Zhang<sup>3</sup>, Shiyu Chang<sup>1</sup> <br> <sup>1</sup>UC, Santa Barbara, <sup>2</sup>Adobe Research, <sup>3</sup>MIT-IBM Watson AI Lab

This is the official implementation of the paper "Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models".

Overview

Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images.

The workflow

Here, we demonstrate an example of disentangling target attribute "children drawing". In this example, $\boldsymbol{c}^{(0)}$ is the embedding of “A castle”, and $\boldsymbol{c}^{(1)}$ is the embedding of “A children drawing of castle”. The first step (first two rows) is the optimization process that finds the best soft combination of $\boldsymbol{c}^{(0)}$ and $\boldsymbol{c}^{(1)}$, such that the modified image (the second row) changes the attribute without affecting other contents. After this, the learned text embedding can be directly applied to a new image, which leads to the same editing effect (last row).

Requirements

Our code is based on <a href="https://github.com/CompVis/stable-diffusion">stable-diffusion</a>. This project requires one GPU with 48GB memory. Please first clone the repository and build the environment:

git clone https://github.com/wuqiuche/DiffusionDisentanglement
cd DiffusionDisentanglement
conda env create -f environment.yaml
conda activate ldm

You will also need to download the pretrained stable-diffusion model:

mkdir models/ldm/stable-diffusion-v1
wget -O models/ldm/stable-diffusion-v1/model.ckpt https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt

Disentangle Attributes

python scripts/disentangle.py --c1 <neutral_prompt> --c2 <target_prompt> --seed 42 --outdir <output_dir>

We provide a bash file with a disentangling example:

chmod +x scripts/disentangle.sh
./scripts/disentangle.sh

You should obtain the following results (right one) in outputs/disentangle/image/: <img src="./assets/result1.png" width="400">

Edit Images

python scripts/edit.py --c1 <neutral_prompt> --c2 <target_prompt> --seed 42 --input <input_image> --outdir <output_dir>

We provide a bash file with an image editing example:

chmod +x scripts/edit.sh
./scripts/edit.sh

You should obtain the following results (right one) in outputs/edit/image/: <img src="./assets/result2.png" width="400">

Replication

To replicate our results in paper, we provide a bash file with commands used. You can run them all at once, or choose the target attributes you are interested in.

chmod +x scripts/result.sh
./scripts/result.sh

Results

Our method is able to disentangle a series of global and local attributes. We demonstrate examples below. The high-resolution images can be found in examples directory.

Parent Repository

This code is adopted from <a href="">https://github.com/CompVis/stable-diffusion</a> and <a href="">https://github.com/orpatashnik/StyleCLIP</a>.