Awesome
DiffiT: Diffusion Vision Transformers for Image Generation
Official PyTorch implementation of DiffiT: Diffusion Vision Transformers for Image Generation.
Code and pretrained DiffiT models will be released soon !
DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset !
In addition, DiffiT sets a new SOTA FID score of 2.22 on FFHQ-64 dataset !
We introduce a new Time-dependent Multihead Self-Attention (TMSA) mechanism that jointly learns spatial and temporal dependencies and allows for attention conditioning with finegrained control.
💥 News 💥
- [07.01.2024] 🔥🔥 DiffiT has been accepted to ECCV 2024 !
- [04.02.2024] Updated manuscript now available on arXiv !
- [12.04.2023] 🔥 Paper is published on arXiv !
Benchmarks
Latent Space
ImageNet-256
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 256x256 | 1.73 | 276.49 |
ImageNet-512
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 512x512 | 2.67 | 252.12 |
Image Space
Model | Dataset | Resolution | FID-50K |
---|---|---|---|
DiffiT | CIFAR-10 | 32x32 | 1.95 |
DiffiT | FFHQ-64 | 64x64 | 2.22 |
Citation
@inproceedings{hatamizadeh2025diffit,
title={Diffit: Diffusion vision transformers for image generation},
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
booktitle={European Conference on Computer Vision},
pages={37--55},
year={2025},
organization={Springer}
}
Star History
Licenses
Copyright © 2024, NVIDIA Corporation. All rights reserved.