Awesome

VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed📈

</div> <p align="center" style="font-size: larger;"> <a href="https://arxiv.org/abs/2404.02905">Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction</a> </p> <div> <p align="center" style="font-size: larger;"> <strong>NeurIPS 2024 Best Paper</strong> </p> </div> <p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/9850df90-20b1-4f29-8592-e3526d16d755" width=95%> <p> <br>

News

2024-12: VAR received NeurIPS 2024 Best Paper Award.
2024-12: We Release our Text-to-Image research based on VAR, please check Infinity.
2024-09: VAR is accepted as NeurIPS 2024 Oral Presentation.
2024-04: Visual AutoRegressive modeling is released.

🕹️ Try and Play with VAR!

We provide a demo website for you to play with VAR models and generate images interactively. Enjoy the fun of visual autoregressive modeling!

We also provide demo_sample.ipynb for you to see more technical details about VAR.

What's New?

🔥 Introducing VAR: a new paradigm in autoregressive visual generation✨:

Visual Autoregressive Modeling (VAR) redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

🔥 For the first time, GPT-style autoregressive models surpass diffusion models🚀:

🔥 Discovering power-law Scaling Laws in VAR transformers📈:

🔥 Zero-shot generalizability🛠️:

For a deep dive into our analyses, discussions, and evaluations, check out our paper.

VAR zoo

We provide VAR models for you to play with, which are on <a href='https://huggingface.co/FoundationVision/var'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Huggingface-FoundationVision/var-yellow'></a> or can be downloaded from the following links:

model	reso.	FID	rel. cost	#params	HF weights🤗
VAR-d16	256	3.55	0.4	310M	var_d16.pth
VAR-d20	256	2.95	0.5	600M	var_d20.pth
VAR-d24	256	2.33	0.6	1.0B	var_d24.pth
VAR-d30	256	1.97	1	2.0B	var_d30.pth
VAR-d30-re	256	1.80	1	2.0B	var_d30.pth
VAR-d36	512	2.63	-	2.3B	var_d36.pth

You can load these models to generate images via the codes in demo_sample.ipynb. Note: you need to download vae_ch160v4096z32.pth first.

Installation

Install torch>=2.0.0.
Install other pip packages via pip3 install -r requirements.txt.

Prepare the ImageNet dataset

<details> <summary> assume the ImageNet is in `/path/to/imagenet`. It should be like this:</summary>

/path/to/imagenet/:
    train/:
        n01440764: 
            many_images.JPEG ...
        n01443537:
            many_images.JPEG ...
    val/:
        n01440764:
            ILSVRC2012_val_00000293.JPEG ...
        n01443537:
            ILSVRC2012_val_00000236.JPEG ...

NOTE: The arg --data_path=/path/to/imagenet should be passed to the training script.

</details>

(Optional) install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_var.py#L15-L30.

Training Scripts

To train VAR-{d16, d20, d24, d30, d36-s} on ImageNet 256x256 or 512x512, you can run the following command:

# d16, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
# d20, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=20 --bs=768 --ep=250 --fp16=1 --alng=1e-3 --wpe=0.1
# d24, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=24 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-4 --wpe=0.01
# d30, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=30 --bs=1024 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-5 --wpe=0.01 --twde=0.08
# d36-s, 512x512 (-s means saln=1, shared AdaLN)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=36 --saln=1 --pn=512 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=5e-6 --wpe=0.01 --twde=0.08

A folder named local_output will be created to save the checkpoints and logs. You can monitor the training process by checking the logs in local_output/log.txt and local_output/stdout.txt, or using tensorboard --logdir=local_output/.

If your experiment is interrupted, just rerun the command, and the training will automatically resume from the last checkpoint in local_output/ckpt*.pth (see utils/misc.py#L344-L357).

Sampling & Zero-shot Inference

For FID evaluation, use var.autoregressive_infer_cfg(..., cfg=1.5, top_p=0.96, top_k=900, more_smooth=False) to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a .npz file via create_npz_from_sample_folder(sample_folder) in utils/misc.py#L344. Then use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 or 512x512 to evaluate FID, IS, precision, and recall.

Note a relatively small cfg=1.5 is used for trade-off between image quality and diversity. You can adjust it to cfg=5.0, or sample with autoregressive_infer_cfg(..., more_smooth=True) for better visual quality. We'll provide the sampling script later.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@Article{VAR,
      title={Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction}, 
      author={Keyu Tian and Yi Jiang and Zehuan Yuan and Bingyue Peng and Liwei Wang},
      year={2024},
      eprint={2404.02905},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}