Home

Awesome

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

<div align="center">

demo  arXiv  project page 

</div> <p align="center"> <img src="assets/teaser.jpg" width=95%> <p>

This repo contains pre-trained model weights and training/sampling PyTorch(torch>=2.1.0) codes used in

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation<br> Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan <br>HKU, ByteDance<br>

You can find more visualizations on project page

🔥 Update

🌿 Introduction

We introduce LlamaGen, a new family of image generation models that apply original next-token prediction paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality.

In this repo, we release:

🦄 Class-conditional image generation on ImageNet

VQ-VAE models

MethodparamstokensrFID (256x256)weight
vq_ds16_c2i72M16x162.19vq_ds16_c2i.pt
vq_ds16_c2i72M24x240.94above
vq_ds16_c2i72M32x320.70above
vq_ds8_c2i70M32x320.59vq_ds8_c2i.pt

AR models

MethodparamstrainingtokensFID (256x256)weight
LlamaGen-B111MDDP16x165.46c2i_B_256.pt
LlamaGen-B111MDDP24x246.09c2i_B_384.pt
LlamaGen-L343MDDP16x163.80c2i_L_256.pt
LlamaGen-L343MDDP24x243.07c2i_L_384.pt
LlamaGen-XL775MDDP24x242.62c2i_X_384L.pt
LlamaGen-XXL1.4BFSDP24x242.34c2i_XXL_384.pt
LlamaGen-3B3.1BFSDP24x242.18c2i_3B_384.pt

Demo

Please download models, put them in the folder ./pretrained_models, and run

python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_L_384.pt --gpt-model GPT-L --image-size 384
# or
python3 autoregressive/sample/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

The generated images will be saved to sample_c2i.png.

Gradio Demo <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>

You can use our online gradio demo Hugging Face Spaces or run gradio locally:

python app.py

🚀 Text-conditional image generation

VQ-VAE models

Methodparamstokensdataweight
vq_ds16_t2i72M16x16LAION COCO (50M) + internal data (10M)vq_ds16_t2i.pt

AR models

Methodparamstokensdataweight
LlamaGen-XL775M16x16LAION COCO (50M)t2i_XL_stage1_256.pt
LlamaGen-XL775M32x32internal data (10M)t2i_XL_stage2_512.pt

Demo

Before running demo, please refer to language readme to install the required packages and language models.

Please download models, put them in the folder ./pretrained_models, and run

python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage1_256.pt --gpt-model GPT-XL --image-size 256
# or
python3 autoregressive/sample/sample_t2i.py --vq-ckpt ./pretrained_models/vq_ds16_t2i.pt --gpt-ckpt ./pretrained_models/t2i_XL_stage2_512.pt --gpt-model GPT-XL --image-size 512

The generated images will be saved to sample_t2i.png.

Local Gradio Demo

âš¡ Serving

We use serving framework vLLM to enable higher throughput. Please refer to serving readme to install the required packages.

python3 autoregressive/serve/sample_c2i.py --vq-ckpt ./pretrained_models/vq_ds16_c2i.pt --gpt-ckpt ./pretrained_models/c2i_XXL_384.pt --gpt-model GPT-XXL --from-fsdp --image-size 384

The generated images will be saved to sample_c2i_vllm.png.

Getting Started

See Getting Started for installation, training and evaluation.

License

The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.

BibTeX

@article{sun2024autoregressive,
  title={Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation},
  author={Sun, Peize and Jiang, Yi and Chen, Shoufa and Zhang, Shilong and Peng, Bingyue and Luo, Ping and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2406.06525},
  year={2024}
}