Awesome
<div align="center"> <h1>π SEED-Voken: A Series of Powerful Visual Tokenizers</h1> </div>The project aims to provide advanced visual tokenizers for autoregressive visual generation and currently supports the following methods: <br><br>
<a href="https://arxiv.org/abs/2409.04410">Open-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation</a><br> Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan<br> ARC Lab Tencent PCG, Tsinghua University, Nanjing University<br> <a href="./docs/Open-MAGVIT2.md">πOpen-MAGVIT2.md</a>
@article{luo2024open, title={Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation}, author={Luo, Zhuoyan and Shi, Fengyuan and Ge, Yixiao and Yang, Yujiu and Wang, Limin and Shan, Ying}, journal={arXiv preprint arXiv:2409.04410}, year={2024} }
<p align="center"> <img src="./assets/comparsion.png" width=90%> </p><a href="https://arxiv.org/abs/2412.02692">IBQ: Taming Scalable Visual Tokenizer for Autoregressive Image Generation</a><br> Fengyuan Shi*, Zhuoyan Luo*, Yixiao Ge, Yujiu Yang, Ying Shan, Limin Wang<br> Nanjing University, Tsinghua University, ARC Lab Tencent PCG<br> <a href="./docs/IBQ.md">πIBQ.md</a>
@article{shi2024taming, title={Taming Scalable Visual Tokenizer for Autoregressive Image Generation}, author={Shi, Fengyuan and Luo, Zhuoyan and Ge, Yixiao and Yang, Yujiu and Shan, Ying and Wang, Limin}, journal={arXiv preprint arXiv:2412.02692}, year={2024} }
π° News
- [2024.11.26]:fire::fire::fire: We are excited to release IBQ, a series of scalable visual tokenizers, which achieve a large-scale codebook (2^18) with high dimension (256) and high utilization.
- [2024.09.09] We release an improved version of Open-MAGVIT2 tokenizer and a family of auto-regressive models ranging from 300M to 1.5B.
- [2024.06.17] We release the training code of the Open-MAGVIT2 tokenizer and checkpoints for different resolutions, achieving state-of-the-art performance (
0.39 rFID
for 8x downsampling) compared to VQGAN, MaskGIT, and recent TiTok, LlamaGen, and OmniTokenizer.
π Implementations
Our codebase supports both NPU and GPU for training and inference. All experiments were conducted using the Ascend 910B for training, and we validated our models on the V100. The observed performance between the two platforms is nearly identical.
π οΈ Installation
GPU
- Env: We have tested on
Python 3.8.8
andCUDA 11.8
(other versions may also be fine). - Dependencies:
pip install -r requirements.txt
NPU
- Env:
Python 3.9.16
andCANN 8.0.T13
- Main Dependencies:
torch=2.1.0+cpu
+torch-npu=2.1.0.post3-20240523
+Lightning
- Other Dependencies: see in
requirements.txt
Datasets
We use Imagenet2012 as our dataset.
imagenet
βββ train/
βββ n01440764
βββ n01440764_10026.JPEG
βββ n01440764_10027.JPEG
βββ ...
βββ n01443537
βββ ...
βββ val/
βββ ...
β‘ Training & Evaluation
The training and evaluation scripts are in <a href="docs/Open-MAGVIT2.md">Open-MAGVIT2.md</a> and <a href="docs/IBQ.md">IBQ.md</a>.
β€οΈ Acknowledgement
We thank Lijun Yu for his encouraging discussions. We refer a lot from VQGAN and MAGVIT. We also refer to LlamaGen, VAR and RQVAE. Thanks for their wonderful work.