

Autoregressive Image Generation using Residual Quantization (CVPR 2022)

The official implementation of "Autoregressive Image Generation using Residual Quantization"
Doyup Lee*, Chiheon Kim*, Saehoon Kim, Minsu Cho, Wook-Shin Han (* Equal contribution)
CVPR 2022

<center><img src="assets/figures/teaser.png" height="256"></center>

The examples of generated images by RQ-Transformer using class conditions and text conditions.
Note that the text conditions of the examples are not used in training time.

TL;DR For autoregressive (AR) modeling of high-resolution images, we propose the two-stage framework, which consists of RQ-VAE and RQ-Transformer. Our framework can precisely approximate a feature map of an image and represent an image as a stack of discrete codes to effectively generate high-quality images.

<center><img src="assets/figures/overview_figure.png"></center>


We have tested our codes on the environment below

Please run the following command to install the necessary dependencies

pip install -r requirements.txt

Coverage of Released Codes

Pretrained Checkpoints

Checkpoints Used in the Original Paper

We provide pretrained checkpoints of RQ-VAEs and RQ-Transformers to reproduce the results in the paper. Please use the links below to download tar.gz files and unzip the pretrained checkpoints. Each link contains pretrained checkpoints of RQ-VAE and RQ-Transformer and their model configurations.

DatasetRQ-VAE & RQ-Transformer# params of RQ-TransformerFID
ImageNet (cIN)link480M15.72
ImageNet (cIN)link821M13.11
ImageNet (cIN)link1.4B11.56 (4.45)
ImageNet (cIN)link1.4B8.71 (3.89)
ImageNet (cIN)link3.8B7.55 (3.80)

FID scores above are measured based on original samples and generated images, and the scores in brackets are measured using 5% rejection sampling via pretrained ResNet-101. We do not provide the pipeline of rejection sampling in this repository.

(NOTE) Large-Scale RQ-Transformer for Text-to-Image Generation

We also provide the pretrained checkpoint of large-scale RQ-Transformer for text-to-image (T2I) generation. Our paper does not include the results of this large-scale RQ-Transformer for T2I generation, since we trained RQ-Transformer with 3.9B parameters on about 30 millions of text-to-image pairs from CC-3M, CC-12M, and YFCC-subset after the paper submission. Please use the link below to download the checkpoints of large-scale T2I model. We emphasize that any commercial use of our checkpoints is strictly prohibited.

Download of Pretrained RQ-Transformer on 30M text-image pairs

Dataset.RQ-VAE & RQ-Transformer# params
CC-3M + CC-12M + YFCC-subsetlink3.9B

Evaluation of Large-Scale RQ-Transformer on MS-COCO

In this repository, we evaluate the pretrained RQ-Transformer with 3.9B parameters on MS-COCO. According to the evaluation protocol of DALL-Eval, we randomly select 30K text captions in val2014 split of MS-COCO and generate 256x256 images using the selected captions. We use (1024, 0.95) for top-(k, p) sampling, and FID scores of other models are from Table 2 in DALL-Eval paper.

Model# params# dataImage / Grid SizeFID on 2014val
X-LXMERT228M180K256x256 / 8x837.4
DALL-E small120M15M256x256 / 16x1645.8
ruDALL-E-XL1.3B120M256x256 / 32x3218.6
minDALL-E1.3B15M256x256 / 16x1624.6
RQ-Transformer (ours)3.9B30M256x256 / 8x8x416.9

Note that some text captions in MS-COCO are also included in the YFCC-subset, but the FIDs are not much different whether the duplicated captions are removed in the evaluation or not. See this paper for more details.

Examples of Text-to-Image (T2I) Generation using RQ-Transformer

We provide a jupyter notebook for you to easily enjoy text-to-image (T2I) generation of pretrained RQ-Transformers and the results ! After you download the pretrained checkpoints for T2I generation, open notebooks/T2I_sampling.ipynb and follows the instructions in the notebook file. We recommend to use a GPU such as NVIDIA V100 or A100, which has enough memory size over 32GB, considering the model size.

We attach some examples of T2I generation from the provided Jupyter notebook.

Examples of Generated Images from Text Conditions

<details> <summary> a painting by Vincent Van Gogh </summary> <center><img src="assets/figures/T2I_samples/a painting by Vincent Van Gogh_temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details> <details> <summary> a painting by RENÉ MAGRITTE </summary> <center><img src="assets/figures/T2I_samples/a painting by RENÉ MAGRITTE_temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details> <details> <summary> Eiffel tower on a desert. </summary> <center><img src="assets/figures/T2I_samples/Eiffel tower on a desert._temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details> <details> <summary> Eiffel tower on a mountain. </summary> <center><img src="assets/figures/T2I_samples/Eiffel tower on a mountain._temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details> <details> <summary> a painting of a cat with sunglasses in the frame. </summary> <center><img src="assets/figures/T2I_samples/a painting of a cat with sunglasses in the frame._temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details> <details> <summary> a painting of a dog with sunglasses in the frame. </summary> <center><img src="assets/figures/T2I_samples/a painting of a dog with sunglasses in the frame._temp_1.0_top_k_1024_top_p_0.95.jpg"></center> </details>

Training and Evaluation of RQ-VAE

Training of RQ-VAEs

Our implementation uses DistributedDataParallel in Pytorch for efficient training with multi-node and multi-GPU environments. Four NVIDIA A100 GPUs are used to train all RQ-VAEs in our paper. You can also adjust -nr, -np, and -nr according to your GPU setting.

Finetuning of Pretrained RQ-VAE

Evaluation of RQ-VAEs

Run compute_rfid.py to evaluate the reconstruction FID (rFID) of learned RQ-VAEs.

python compute_rfid.py --split=val --vqvae=$RQVAE_CKPT

Evaluation of RQ-Transformer

In this repository, the quantitative results in the paper can be reproduced by the codes for the evaluation of RQ-Transformer. Before the evaluation of RQ-Transformer on a dataset, the dataset has to be prepared for computing the feature vectors of its samples. To reproduce the results in the paper, we provide the statistics of feature vectors of each dataset, since extracting feature vectors accompanies computational costs and a long time. You can also prepare the datasets, which are used in our paper, as you follow the instructions of data/READMD.md.

FFHQ, LSUN-{Church, Bedroom, Cat}, (conditional) ImageNet




python compute_metrics.py fake_path=$DIR_SAVED_IMG ref_dataset=$DATASET_NAME

Sampling speed benchmark

We provide the codes to measure the sampling speed of RQ-Transformer according to the code shape of RQ-VAEs, such as 8x8x4 or 16x16x1, as shown in Figure 4 in the paper. To reproduce the figure, run the following commands on NVIDIA A100 GPU:

# RQ-Transformer (1.4B) on 16x16x1 RQ-VAE (corresponds to VQ-GAN 1.4B model)
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=100
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=200
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=500  # this will result in OOM.

# RQ-Transformer (1.4B) on 8x8x4 RQ-VAE
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=100
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=200
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=500


  title={Autoregressive Image Generation using Residual Quantization},
  author={Lee, Doyup and Kim, Chiheon and Kim, Saehoon and Cho, Minsu and Han, Wook-Shin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},



If you would like to collaborate with us or provide us a feedback, please contaus us,contact@kakaobrain.com


Our transformer-related implementation is inspired by minGPT and minDALL-E. We appreciate the authors of VQGAN for making their codes available to public.


Since RQ-Transformer is trained on publicly available datasets, some generated images can include socially unacceptable contents according to the text conditions. When the problem occurs, please let us know the pair of "text condition" and "generated images".