Awesome
<div align="center">VAR-CLIP:<br> Text-to-Image Generator with Visual Auto-Regressive Modeling
</div> <p align="center"> <img src="img/main.png" width=95%> <p>VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling<br> Qian Zhang, Xiangzi Dai, Ninghua Yang, Xiang An, Ziyong Feng, Xingyu Ren <br>Institute of Applied Physics and Computational Mathematics, DeepGlint,Shanghai Jiao Tong University
Some example for text-conditional generation:
<img src="img/show_res.png" width="800px"/> .
Some example for class-conditional generation:
<img src="img/concatenated_image.jpg" width="800px"/> .
TODO
- Relased Pre_train model.
- Relased train code.
- Relased Arxiv.
- Training T2I on the ImageNet dataset has been completed.
- Training on the ImageNet dataset has been completed.
Getting Started
Requirements
pip install -r requirements.txt
Download Pretrain model/Dataset
<span style="font-siz15px;"> 1. Place the downloaded ImageNet train/val parts separately under train/val in the directory ./imagenet/
</span>
2. Download clip/vae pretrain model put on pretrained/
Download ClIP_L14<br> Download VAE<br>
Training Scripts
# training VAR-CLIP-d16 for 1000 epochs on ImageNet 256x256 costs 4.1 days on 64 A100s
# Before running, you need to configure the IP addresses of multiple machines in the run.py file and data_path
python run.py
demo Scripts
# you can run demo_samle.ipynb get text-conditional generation resulets after train completed.
demo_sample.ipynb
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citations
@misc{zhang2024varclip,
title={VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling},
author={Qian Zhang and Xiangzi Dai and Ninghua Yang and Xiang An and Ziyong Feng and Xingyu Ren},
year={2024},
journal={arXiv:2408.01181},
}