Awesome

DynamicVectorQuantization (CVPR 2023 highlight)

Offical PyTorch implementation of our CVPR 2023 highlight paper "Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization".

TL;DR For vector-quantization (VQ) based autoregressive image generation, we propose a novel variable-length coding to replace existing fixed-length coding, which brings an accurate & compact code representation for images and a natural coarse-to-fine autoregressive generation order.

Our framework includes: (1) DynamicQuantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked transformer architecture and shared-content, non-shared position input layers designs.

See Our Another CVPR2023 Work about Vector-Quantization based Image Generation "Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation" (GitHub)

Requirements and Installation

Please run the following command to install the necessary dependencies.

conda env create -f environment.yml

Data Preparation

Prepare dataset as follows, then change the corresponding datapath in data/default.py.

ImageNet

Prepare ImageNet dataset structure as follows:

${Your Data Root Path}/ImageNet/
├── train
│   ├── n01440764
│   |   |── n01440764_10026.JPEG
│   |   |── n01440764_10027.JPEG
│   |   |── ...
│   ├── n01443537
│   |   |── n01443537_2.JPEG
│   |   |── n01443537_16.JPEG
│   |   |── ...
│   ├── ...
├── val
│   ├── n01440764
│   |   |── ILSVRC2012_val_00000293.JPEG
│   |   |── ILSVRC2012_val_00002138.JPEG
│   |   |── ...
│   ├── n01443537
│   |   |── ILSVRC2012_val_00000236.JPEG
│   |   |── ILSVRC2012_val_00000262.JPEG
│   |   |── ...
│   ├── ...
├── imagenet_idx_to_synset.yml
├── synset_human.txt

FFHQ

The FFHQ dataset could be obtained from the FFHQ repository. Then prepare the dataset structure as follows:

${Your Data Root Path}/FFHQ/
├── assets
│   ├── ffhqtrain.txt
│   ├── ffhqvalidation.txt
├── FFHQ
│   ├── 00000.png
│   ├── 00001.png

Training of DQVAE

Training DQVAE with dual granularities (downsampling factor of F=16 and F=8)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-dual-r-05_imagenet.yml --max_epochs 50

The target ratio for the finer granularity (F=8) could be set in model.params.lossconfig.params.budget_loss_config.params.target_ratio.

Training DQVAE with triple granularities (downsampling factor of F=32, F=16, F=8)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-triple-r-03-03_imagenet.yml --max_epochs 50

The target ratio for the finest granularity (F=8) could be set in model.params.lossconfig.params.budget_loss_config.params.target_fine_ratio. The target ratio for the median granularity (F=16) could be set in model.params.lossconfig.params.budget_loss_config.params.target_median_ratio.

A Better Version of DQVAE

Here we provide a better version of DQVAE compare with the one we proposed in the paper, which leads to much stable training results and also slight better reconstruction quality. To be specific, we assign the granularity of each region directly according to their image entropy instead of the features extracted from the encoder.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage1/dqvae-entropy-dual-r05_imagenet.yml --max_epochs 50

The target ratio for the finer granularity (F=8) could be set in model.params.encoderconfig.params.router_config.params.fine_grain_ratito. The distribution of image entropy is pre-calculated in scripts/tools/thresholds/entropy_thresholds_imagenet_train_patch-16.json.

UPDATE

Thanks for @1e0nhardt 's advice and we update the image entropy calculation in models/stage1_dynamic/dqvae_dual_entropy.py.

Visualization of Variable-length Coding

Training of DQ-Transformer

Unconditional Generation

Copy the first stage model DQ-VAE's config to model.params.first_stage_config. The pre-trained DQ-VAE's path should be set in model.params.first_stage_config.params.ckpt_path. Here we take ImageNet as an example to show the unconditional DQ-Transformer training, and other datasets like FFHQ could be derive correspondingly.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage2/uncond_imagenet_p6c18.yml --max_epochs 100

NOTE: some important hyper-parameters in the config file:

the layer of Content-Transformer: model.params.transformer_config.params.content_layer
the layer of Position-Transformer: model.params.transformer_config.params.position_layer
the vocab size of content code: model.params.transformer_config.params.vocab_size, which should include the codebook size of DQ-VAE's codebook, 1 extra pad code, 1 extra eos code and 1 extra sos code.
the vocab size of coarse granularity's position: model.params.transformer_config.params.coarse_position_size, which should include the size of coarse granularity's feature map (e.g., 16 $\times$ 16 = 256 for downsampling factor F=16, or 32 $\times$ 32 = 1024 for downsampling factor F=8), 1 extra pad code, 1 extra eos code and 1 extra sos code.
the vocab size of fine granularity's position: model.params.transformer_config.params.fine_position_size

Class-conditional Generation

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --gpus -1 --base configs/stage2/class_imagenet_p6c18.yml --max_epochs 100

NOTE: some important hyper-parameters in the config file:

the vocab size of content code: model.params.transformer_config.params.vocab_size, which should include the codebook size of DQ-VAE's codebook, 1 extra pad code, 1 extra eos code and 1000 imagenet class number.
the vocab size of coarse granularity's position: model.params.transformer_config.params.coarse_position_size, which should include the size of coarse granularity's feature map (e.g., 16 $\times$ 16 = 256 for downsampling factor F=16, or 32 $\times$ 32 = 1024 for downsampling factor F=8), 1 extra pad code, 1 extra eos code and 1000 imagenet class number.
the vocab size of fine granularity's position: model.params.transformer_config.params.fine_position_size

Pre-trained Models

description	Training Details	Dataset	FID (val, 50k)	download link
DQ-VAE, dual granularity ($F_8=0.5, F_{16}=0.5$) (entropy-based)	4 A100, 10 epochs	ImageNet	1.6968	Google Cloud

Reference

If you found this code useful, please cite the following paper:

@InProceedings{Huang_2023_CVPR,
    author    = {Huang, Mengqi and Mao, Zhendong and Chen, Zhuowei and Zhang, Yongdong},
    title     = {Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {22596-22605}
}

@InProceedings{Huang_2023_CVPR,
    author    = {Huang, Mengqi and Mao, Zhendong and Wang, Quan and Zhang, Yongdong},
    title     = {Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {2002-2011}
}