Home

Awesome

<div align="center">

Semantic Image Translation

Paper CVPR

</div>

This repository contains a benchmark for evaluating the task of Semantic Image Translation, where the goal is to edit an input image with a text transformation query (like cat -> dog).

The benchmark is presented in FlexIT: Towards Flexible Semantic Image Translation by Guillaume Couairon, Asya Grechka, Jakob Verbeek, Holger Schwenk and Matthieu Cord, CVPR 2022.

Abstract: Deep generative models, like GANs, have considerably improved the state of the art in image synthesis, and are able to generate near photo-realistic images in structured domains such as human faces. Based on this success, recent work on image editing proceeds by projecting images to the GAN latent space and manipulating the latent vector. However, these approaches are limited in that only images from a narrow domain can be transformed, and with only a limited number of editing operations. We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing. Our method achieves flexible and natural editing, pushing the limits of semantic image translation. First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space. Via the latent space of an autoencoder, we iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms. We propose an evaluation protocol for semantic image translation, and thoroughly evaluate our method on ImageNet. Code will be made publicly available.

Benchmark

The different editing methods are compared using the following metrics:

<center>
MethodLPIPSAccuracy(%)CSFIDSFIDRuntime
COPY0.00.4106.00.20~0s
RETRIEVE72.490.627.20.23~0s
ManiGAN21.72.0123.817.0~1s
StyleCLIP (*)33.48.0146.635.8N/A
FlexIT (3 CLIP networks, 32 steps)22.045.263.36.515s
FlexIT (3 CLIP networks, 160 steps)24.759.057.96.875s
FlexIT (5 CLIP networks, 32 steps)22.044.062.86.015s
FlexIT (5 CLIP networks, 160 steps)25.567.052.05.670s
</center>

Runtime is computed for a single image on a 16GB Quadro GP100 GPU. With 5 networks, we use only one data augmentation per network. Please see the paper for further details.

(*) using ImageNet-pretrained StyleGAN at [https://github.com/justinpinkney/awesome-pretrained-stylegan2#Imagenet]

FlexIT

This repository contains the code for running the FlexIT algorithm. First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space. Via the latent space of an autoencoder, we iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of regularization terms.

method

Installation

First, get a conda environment with python >=3.6 and install dependencies. Run installation script which installs the repository and the VQGAN encoder/decoder model.

bash install.sh

Then, modify the file global.yaml to include the path of your ImageNet validation dataset. The ImageNet folder should contain the 1000 class folders.

Evaluation

Evaluation is a two-stage process: The first step is to edit each dataset image according to the benchmark transformation:

python transform_dataset.py --config exp_configs/final.yaml \
                            --output generated/default \
                            --domain test \

This creates a folder generated/default in which edited images are saved. With FlexIT, images are created in subfolders named after the number of iterations.

Then evaluate for instance after 160 steps with

python eval.py generated/default/images/160 \

Creating a novel editing method

To create and evaluate a novel method for Semantic Image Translation, you should create a class Editer with the following format:

class Editer:
    
    def __init__(self, *args, **kwargs):
        pass
    
    def __call__(self, img: PIL.Image, src: str, tgt:str):
        # your code to edit images here.
        # it should return a PIL image
        # it can also return a dict where values are PIL images: 
        # in that case, a key: value entry will be saved on disk in the subfolder named key.
        return img

then you can reference it in a config file exp_configs/novel.yaml with

editer.__class__ = path.to.Editer
editer.arg1 = args1
...

You can run the evaluation protocol using

python transform_dataset.py --config exp_configs/novel.yaml \
                            --output generated/novel \
                            --test # disable to run on dev set
python eval.py generated/novel

License

This repository is released under the MIT license as found in the LICENSE file.

Citation

If you are using FlexIT or this benchmark for any means, please cite us! Thanks :)

@inproceedings{flexit2021,
  title     = {{FlexIT}: Towards Flexible Semantic Image Translation},
  author    = {Couairon, Guillaume and Grechka, Asya and Verbeek, Jakob and Schwenk, Holger and Cord, Matthieu},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2022}
}