Home

Awesome

Figure

FiT: Flexible Vision Transformer for Diffusion Model

<p align="center"> šŸ“ƒ <a href="https://arxiv.org/pdf/2402.12376.pdf" target="_blank">Paper</a> ā€¢ šŸ“¦ <a href="https://huggingface.co/whlzy/FiT-XL-2-16" target="_blank">Checkpoint</a> <br> </p>

This repo contains PyTorch model definitions, pre-trained weights and sampling code for our flexible vision transformer (FiT). FiT is a diffusion transformer based model which can generate images at unrestricted resolutions and aspect ratios.

The core features will include:

Why we need FiT?

Stay tuned for this project! šŸ˜†

Acknowledgments

This codebase borrows from <a href="https://github.com/facebookresearch/DiT/tree/main" target="_blank">DiT</a>.

BibTeX

@article{Lu2024FiT,
  title={FiT: Flexible Vision Transformer for Diffusion Model},
  author={Zeyu Lu and Zidong Wang and Di Huang and Chengyue Wu and Xihui Liu and Wanli Ouyang and Lei Bai},
  year={2024},
  journal={arXiv preprint arXiv:2402.12376},
}