Home

Awesome

StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [CVPR 2022]

TL;DR: We introduce a new framework, StyleT2I, to achieve compositional and high-fidelity text-to-image synthesis results.

abdf

Figure 1. When the text input contains underrepresented compositions of attributes, e.g., (<span style="color:blue">he</span>, <span style="color:magenta">wearing lipstick</span>), in the dataset, previous methods [1-3] incorrectly generate the attributes with poor image quality. In contrast, StyleT2I achieves better compositionality and high-fidelity text-to-image synthesis results.

Paper

StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis

Zhiheng Li, Martin Renqiang Min, Kai Li, Chenliang Xu

NEC Laboratories America, University of Rochester

preprint, paper, video

Contact: Zhiheng Li (email: zhiheng.li@rochester.edu, homepage: https://zhiheng.li)

Dependencies

pytorch torchvision torchtext pandas ninja

Data Preparation

Put each dataset in a folder under the data directory as follows:

data
├── celebahq
├── cub
├── ffhq
└── nabirds

CelebA-HQ download CelebAMask-HQ from here and unzip it to data/celebahq/CelebAMask-HQ

CUB download CUB from here and unzip it to data/cub/CUB_200_2011

NABirds download and unzip NABirds dataset from here to data/nabirds

Pretrained StyleGAN2 Model

Download the pretrained StyleGAN2 models to exp/pretrained_stylegan2 from here.

Training

The following commands are the the bash scripts of training on CelebA-HQ dataset. For other datasets, simply replace the folder /celebahq/ with other datasets, e.g., /cub/, /ffhq/, and /nabirds/.

Pretrain StyleGAN2

Our StyleGAN2 code is based on https://github.com/rosinality/stylegan2-pytorch's implementation.

If you prefer pretraining StyleGAN2 by youself, you can use the following command. Otherwise, use the pretrained model provided above.

bash scripts/celebahq/pretrain_stylegan2.sh

Finetune CLIP

bash scripts/celebahq/ft_clip_text.sh

Note that finetuning CLIP is only available on CelebA-HQ and CUB datasets and not available on FFHQ and NABirds datasets because FFHQ and NABirds datasets do not have text annotations. However, StyleT2I can perform cross-dataset generation, i.e., StyleT2I-XD. More details are in the paper.

Train StyleT2I

bash scripts/celebahq/train.sh

Synthesize Images

bash scripts/celebahq/synthesize.sh

References

[1] B. Li, X. Qi, T. Lukasiewicz, and P. Torr, “Controllable Text-to-Image Generation,” in NeurIPS, 2019.

[2] S. Ruan et al., “DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis,” in ICCV, 2021.

[3] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “TediGAN: Text-Guided Diverse Face Image Generation and Manipulation,” in CVPR, 2021.

Citation

@InProceedings{Li_2022_CVPR,
author = {Li, Zhiheng and Min, Martin Renqiang and Li, Kai and Xu, Chenliang},
title = {StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}