Awesome
Vector Quantized Diffusion Model for Text-to-Image Synthesis (CVPR2022, Oral)
Due to company policy, I have to set microsoft/VQ-Diffusion to private for now, so I provide the same code here.
The code of Improved VQ-Diffusion is in the new branch, it could improve the performance of VQ-Diffusion by large margin.
Overview
This is the official repo for the paper: Vector Quantized Diffusion Model for Text-to-Image Synthesis.
VQ-Diffusion is based on a VQ-VAE whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). It produces significantly better text-to-image generation results when compared with Autoregressive models with similar numbers of parameters. Compared with previous GAN-based methods, VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin.
Framework
<img src='figures/framework.png' width='600'>Requirements
We suggest to use the docker. Also, you may run:
bash install_req.sh
Data Preparing
Microsoft COCO
│MSCOCO_Caption/
├──annotations/
│ ├── captions_train2014.json
│ ├── captions_val2014.json
├──train2014/
│ ├── train2014/
│ │ ├── COCO_train2014_000000000009.jpg
│ │ ├── ......
├──val2014/
│ ├── val2014/
│ │ ├── COCO_val2014_000000000042.jpg
│ │ ├── ......
CUB-200
│CUB-200/
├──images/
│ ├── 001.Black_footed_Albatross/
│ ├── 002.Laysan_Albatross
│ ├── ......
├──text/
│ ├── text/
│ │ ├── 001.Black_footed_Albatross/
│ │ ├── 002.Laysan_Albatross
│ │ ├── ......
├──train/
│ ├── filenames.pickle
├──test/
│ ├── filenames.pickle
ImageNet
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
Pretrained Model
We release four text-to-image pretrained model, trained on Conceptual Caption, MSCOCO, CUB200, and LAION-human datasets. Also, we release the ImageNet pretrained model, and provide the CLIP pretrained model for convenient. These should be put under OUTPUT/pretrained_model/ . These pretrained model file may be large because they are training checkpoints, which contains gradient information, optimizer information, ema model and others.
Besides, we provide the VQVAE models on FFHQ, OpenImages, and imagenet datasets, these model are from Taming Transformer, we provide them here for convenient. Please put them under OUTPUT/pretrained_model/taming_dvae/ .
Inference
To generate image from given text:
from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/human_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_condition("a beautiful smiling woman",truncation_rate=0.85, save_root="RESULT",batch_size=4)
VQ_Diffusion_model.inference_generate_sample_with_condition("a woman in yellow dress",truncation_rate=0.85, save_root="RESULT",batch_size=4,fast=2) # for fast inference
You may change human_pretrained.pth to other pretrained model to test different text.
To generate image from given ImageNet class label:
from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407,truncation_rate=0.86, save_root="RESULT",batch_size=4)
Training
First, change the data_root to correct path in configs/coco.yaml or other configs.
Train Text2Image generation on MSCOCO dataset:
python running_command/run_train_coco.py
Train Text2Image generation on CUB200 dataset:
python running_command/run_train_cub.py
Train conditional generation on ImageNet dataset:
python running_command/run_train_imagenet.py
Train unconditional generation on FFHQ dataset:
python running_command/run_train_ffhq.py
Cite VQ-Diffusion
if you find our code helpful for your research, please consider citing:
@article{gu2021vector,
title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
author={Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining},
journal={arXiv preprint arXiv:2111.14822},
year={2021}
}
Acknowledgement
Thanks to everyone who makes their code and models available. In particular,
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
Contact Information
For help or issues using VQ-Diffusion, please submit a GitHub issue. For other communications related to VQ-Diffusion, please contact Shuyang Gu (gsy777@mail.ustc.edu.cn) or Dong Chen (doch@microsoft.com).