Home

Awesome

This is the codebase for the paper

Elucidating the design space of language models for image generation

Project Page arXiv Hugging Face Colab Demo License

Xuantong Liu, Shaozhe Hao, Xianbiao Qi*, Tianyang Hu#, Jun Wang, Rong Xiao, Yuan Yao#
The Hong Kong University of Science and Technology, The University of Hong Kong, Intellifusion, Huawei Noah's Ark Lab
(*: Project leader; #: Corresponding authors)

[Project Page] [arXiv] [Colab]

image

Introduction 💡

We explore the design space of using language models on the image generation task, including the image tokenizer choice (Binary Autoencoder or Vector-Quantization Autoencoder), language modeling method (AutoRegressive or Masked Language Model), vocabulary design based on BAE and sampling strategies and sampling strategies. We achieve a strong baseline (1.54 FID on ImageNet 256*256) compared to language-model-based and diffusion-model-based image generation models. We also analyze the fundamental difference between image and languege sequence generation and the learning behavior of language models on image generation, demonstrating the scaling law and the great potential of AR models across different domains.

We provide 4 BAE tokenizers with code dimension 16, 10, 24 and 32, each trained for 1,000,000 iterations with batch size 256. We also provide the checkpoints for all the generation models we discussed in the paper. All the download links are provided.

Set up 🔩

You can simply install the environment with the file environment.yml by:

conda env create -f environment.yml
conda activate ELM

Download 💡

You can download the checkpoints for the image tokenizers (BAE) and generation models from link.

Image Tokenizers (BAEs) 🧩

Code DimBernoulli SamplingLinkSize
16link332MB
16link332MB
20link332MB
24link332MB

Generation Models (GPTs) ⚙️

ModelLinkSize
AR-L[1-16] [2-8] [2-10] [2-12]1.25GB~1.77GB
AR-XL[1-16] [2-8] [2-10] [2-12]2.95GB~3.6GB
AR-XXL[1-16] [2-10] [2-12]5.49GB~6.25GB
AR-2B[2-12]7.64GB
MLM-L[1-16]1.51GB
MLM-XL[1-16]3.27GB
MLM-XXL[1-16]5.86GB

Image Generation 🌟

If you want to generate samples with our pretrained models, run

bash inference.sh

You need to specify the checkpoint path in --ckpt. The default setting is generated samples from 8 classes [207, 360, 387, 974, 88, 979, 417, 279]. If you want to generated images larger than 256 $\times$ 256 , activate --v_expand (for vertical expanding) or --h_expand (for horizontal expanding) in inference.sh, --overlap_width sets the length of the preceding sequence each time, --expand_time sets how many times to expand, --gen_num specify the number of generated samples.

Train 🌟

If you want to train EML-L with vocabulary 2-10 on 1 GPU node with 8 GPUs, just run

bash train.sh

You need to specify the ImageNet dataset path at --data-path. You can change the model size through --model (L, XL, XXL and 2B), modeling method through --modeling (ar or mlm), number of sub-codes through --token-each (1, 2, 3, ...), dimension of each code through --code-dim. Remember the codebook_size should be equal to token-each * code-dim. --hm-dist larger than 1 means the soft label according to Hamming Distance is used, however, we found it is kind of useless, and we have not utilized it or discussed it in our paper. You are free to have a try!

We train L/XL-sized models using 8 A800 GPUs, XXL/2B-sized models using 32 A800 GPUs on 4 nodes.

Additional Results 🌟

FID without cfg

For each model size, we test the 50k-FID without cfg with the most suitable tokenizer using pytorch_fid.

ModelFID
L, 2-1015.95
XL, 2-1012.70
XXL, 2-1210.11

Training loss curve

The training loss for token-prediction-based image generation can not converge well but still ensures high image generation capability. The rationale behind this is discussed in our paper. We show the training loss curve of the model of different sizes with the same tokenizer, where the scaling law is also presented.

image

However, the training loss trend of models with different tokenizers (such as L with 1-16, 2-8, 2-10, ...) is not compared. Because different tokenizers have different vocabulary sizes, the losses are not of the same magnitude and cannot be compared.