Home

Awesome

pix2pix-zero

paper | website | demo

Quick start: Edit images | Gradio (locally hosted)

This is author's reimplementation of "Zero-shot Image-to-Image Translation" using the diffusers library. <br> The results in the paper are based on the CompVis library, which will be released later.

[New!] Demo with ability to generate custom directions released on Hugging Face! <br> [New!] Code for editing real and synthetic images released!

<br> <div class="gif"> <p align="center"> <img src='assets/main.gif' align="center"> </p> </div>

We propose pix2pix-zero, a diffusion-based image-to-image approach that allows users to specify the edit direction on-the-fly (e.g., cat to dog). Our method can directly use pre-trained Stable Diffusion, for editing real and synthetic images while preserving the input image's structure. Our method is training-free and prompt-free, as it requires neither manual text prompting for each input image nor costly fine-tuning for each task.

TL;DR: no finetuning required, no text input needed, input structure preserved.


Corresponding Manuscript

Zero-shot Image-to-Image Translation <br> Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, Jun-Yan Zhu<br> CMU and Adobe <br> SIGGRAPH, 2023


Results

All our results are based on stable-diffusion-v1-4 model. Please the website for more results.

<div> <p align="center"> <img src='assets/results_teaser.jpg' align="center" width=800px> </p> </div> <hr>

The top row for each of the results below show editing of real images, and the bottom row shows synthetic image editing.

<div> <p align="center"> <img src='assets/grid_dog2cat.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat_lowpoly.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat_boba.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat_suit.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat_hat.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat_crochetcat.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_cat2dog.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_person_robot.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_horse2zebra.jpg' align="center" width=800px> </p> <p align="center"> <img src='assets/grid_tree2fall.jpg' align="center" width=800px> </p> </div>

Real Image Editing

<div> <p align="center"> <img src='assets/results_real.jpg' align="center" width=800px> </p> </div>

Synthetic Image Editing

<div> <p align="center"> <img src='assets/results_syn.jpg' align="center" width=800px> </p> </div>

Method Details

Given an input image, we first generate text captions using BLIP and apply regularized DDIM inversion to obtain our inverted noise map. Then, we obtain reference cross-attention maps that correspoind to the structure of the input image by denoising, guided with the CLIP embeddings of our generated text (c). Next, we denoise with edited text embeddings, while enforcing a loss to match current cross-attention maps with the reference cross-attention maps.

<div> <p align="center"> <img src='assets/method.jpeg' align="center" width=900> </p> </div>

Getting Started

Environment Setup

Real Image Translation

Editing Synthetic Images

Gradio demo

Tips and Debugging

<br>

Finding Custom Edit Directions<br>

Comparison

Comparisons with different baselines, including, SDEdit + word swap, DDIM + word swap, and prompt-to-propmt. Our method successfully applies the edit, while preserving the structure of the input image.

<div> <p align="center"> <img src='assets/comparison.jpg' align="center" width=900> </p> </div>

Note:

The original implementation for the regularized DDIM Inversion had an implementation issue where the random roll would sometimes not get applied. Please see the updated code here for the updated version.