Home

Awesome

OPEN-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation

<p align="center"> <img src="./assets/abstract_fig.png" width=95%> </p> <div align="center">

arXivĀ 

</div> <div align="center">

OPEN-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation<br> Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan <br>ARC Lab Tencent PCG, Tsinghua University, Nanjing University<br>

</div>

We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 \times 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ''next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation. :sparkling_heart:

šŸ“° News

šŸŽ¤ TODOs

šŸ¤— Open-MAGVIT2 is still at an early stage and under active development. Stay tuned for the update!

šŸ“– Implementations

Note that our experments are all using Ascned 910B for training. But we have tested our models on V100. The performance gap is narrow.

Figure 1. The framework of the Open-MAGVIT2.

<p align="center"> <img src="./assets/framework.png"> </p>

šŸ› ļø Installation

GPU

NPU

Datasets

We use Imagenet2012 as our dataset.

imagenet
ā””ā”€ā”€ train/
    ā”œā”€ā”€ n01440764
        ā”œā”€ā”€ n01440764_10026.JPEG
        ā”œā”€ā”€ n01440764_10027.JPEG
        ā”œā”€ā”€ ...
    ā”œā”€ā”€ n01443537
    ā”œā”€ā”€ ...
ā””ā”€ā”€ val/
    ā”œā”€ā”€ ...

Stage I: Training of Visual Tokenizer

<!-- * `Stage I Tokenizer Training`: -->

šŸš€ Training Scripts

bash scripts/train_tokenizer/run_128_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
bash scripts/train_tokenizer/run_256_L.sh MASTER_ADDR MASTER_PORT NODE_RANK

šŸš€ Evaluation Scripts

bash scripts/evaluation/evaluation_128.sh
bash scripts/evaluation/evaluation_256.sh

šŸŗ Performance and Models

Tokenizer

MethodToken Type#TokensTrain DataCodebook SizerFIDPSNRCodebook UtilizationCheckpoint
Open-MAGVIT2-202406172D16 $\times$ 16256 $\times$ 256 ImageNet2621441.5321.53100%-
Open-MAGVIT2-202406172D16 $\times$ 16128 $\times$ 128 ImageNet2621441.5624.45100%-
Open-MAGVIT22D16 $\times$ 16256 $\times$ 256 ImageNet2621441.1721.90100%IN256_Large
Open-MAGVIT22D16 $\times$ 16128 $\times$ 128 ImageNet2621441.1825.08100%IN128_Large
Open-MAGVIT2*2D32 $\times$ 32128 $\times$ 128 ImageNet2621440.3426.19100%above

(*) denotes that the results are from the direct inference using the model trained with $128 \times 128$ resolution without fine-tuning.

Stage II: Training of Auto-Regressive Models

šŸš€ Training Scripts

Please see in scripts/train_autogressive/run.sh for different model configurations.

bash scripts/train_autogressive/run.sh MASTER_ADDR MASTER_PORT NODE_RANK

šŸš€ Sample Scripts

Please see in scripts/train_autogressive/run.sh for different sampling hyper-parameters for different scale of models.

bash scripts/evaluation/sample_npu.sh or scripts/evaluation/sample_gpu.sh Your_Total_Rank

šŸŗ Performance and Models

MethodParams#TokensFIDISCheckpoint
Open-MAGVIT2343M16 $\times$ 163.08258.26AR_256_B
Open-MAGVIT2804M16 $\times$ 162.51271.70AR_256_L
Open-MAGVIT21.5B16 $\times$ 162.33271.77AR_256_XL

ā¤ļø Acknowledgement

We thank Lijun Yu for his encouraging discussions. We refer a lot from VQGAN and MAGVIT. We also refer to LlamaGen, VAR and RQVAE. Thanks for their wonderful work.

āœļø Citation

If you found the codebase and our work helpful, please cite it and give us a star :star:.

@article{luo2024open,
  title={Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation},
  author={Luo, Zhuoyan and Shi, Fengyuan and Ge, Yixiao and Yang, Yujiu and Wang, Limin and Shan, Ying},
  journal={arXiv preprint arXiv:2409.04410},
  year={2024}
}

@inproceedings{yu2024language,
  title={Language Model Beats Diffusion - Tokenizer is key to visual generation},
  author={Lijun Yu and Jose Lezama and Nitesh Bharadwaj Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A Ross and Lu Jiang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=gzqrANCF4g}
}