Awesome
<h1> CAR<img src="./docs/car.png" width="4%">: Controllable AutoRegressive Modeling for Visual Generation </h1>Ziyu Yao<sup>1,2</sup>, Jialin Li<sup>2</sup>, Yifeng Zhou<sup>2</sup>, Yong Liu<sup>2</sup>, Xi Jiang<sup>2,3</sup>, Chengjie Wang<sup>2</sup>, Feng Zheng<sup>3</sup>, Yuexian Zou<sup>1</sup>, Lei Li<sup>4</sup>
<sup>1</sup> Peking University, <sup>2</sup> Tencent Youtu Lab, <sup>3</sup> Southern University of Science and Technology, <sup>4</sup> University of Washington
<div align="center"> </div> <div align="center"> <img src="./docs/teaser.png" width="80%"> </div>CAR Models
We have currently released the CAR-d16 weights for demo purposes, and larger models will be made available following future upgrades and extensions of CAR.
The CAR models are available on <a href='https://huggingface.co/MiracleDance/CAR'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Huggingface-MiracleDance/CAR-yellow'></a> and can also be downloaded from the following links:
Model | reso. | Condition | HF weights🤗 |
---|---|---|---|
CAR-d16 | 256 | Canny Edge | car_canny_d16.pth |
CAR-d16 | 256 | HED Map | car_hed_d16.pth |
CAR-d16 | 256 | Depth Map | car_depth_d16.pth |
CAR-d16 | 256 | Normal Map | car_normal_d16.pth |
CAR-d16 | 256 | Sketch | car_sketch_d16.pth |
As CAR is based on the pre-trained VAR model, the following pre-trained weights also need to be downloaded: vae_ch160v4096z32.pth, var_d16.pth.
Training
1. Prepare Dataset
The arg --data_path
should indicate the path to the ImageNet dataset.
2. Extract conditions from ImageNet dataset
You can choose to extract conditions from all categories or select a subset of 1000 categories for condition extraction. Run the following commands:
# canny
python extract_canny.py
# hed
python extract_hed.py
# depth
python extract_depth.py
# normal
python extract_normal.py
# sketch
python extract_sketch.py
3. Train CAR model
# d16, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--data_path=/path/to/imagenet --condition_path=/path/to/condition/extract/above \
--vae_ckpt=/path/to/pretrained/vae/ckpt --pretrained_var_ckpt=/path/to/pretrained/var/ckpt \
--tblr=0.0001 --depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
Inference
# cls is an index ranging from 0 to 999 in the ImageNet label set
# type indicates which condition is extracted from the original image (canny, hed, depth, normal, sketch)
python inference.py --vae_ckpt=/path/to/pretrained/vae/ckpt --var_ckpt=/path/to/pretrained/var/ckpt \
--car_ckpt=/path/to/car/ckpt --img_path=/path/to/original/image/to/extract/condition \
--save_path=/path/to/save/image --cls=3 --type=hed
Acknowledgments
The development of CAR is based on VAR. We deeply appreciate this significant contribution to the community.
Citation
If you find our work helpful in your research, we would be grateful if you could consider giving us a star ⭐ or citing it using:
@article{yao2024car,
title={Car: Controllable autoregressive modeling for visual generation},
author={Yao, Ziyu and Li, Jialin and Zhou, Yifeng and Liu, Yong and Jiang, Xi and Wang, Chengjie and Zheng, Feng and Zou, Yuexian and Li, Lei},
journal={arXiv preprint arXiv:2410.04671},
year={2024}
}