Home

Awesome

Diffusion Models and Representation Learning:A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Diffusion Models and Representation Learning:A Survey.

Considering the fast development of this field, we will continue to update both Arxiv paper and this Repo.

Overview

Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration.

Papers (listed according to year)

Diffusion Models for Representation Learning

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
C. Tian, C. Tao, J. Dai, H. Li, Z. Li, L. Lu, X. Wang, H. Li, G. Huang, X. Zhu
ICLR, 2024.

Sd4match: Learning to prompt stable diffusion model for semantic matching
X. Li, J. Lu, K. Han, V. A. Prisacariu
CVPR 2024.

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
J. Guo, X. Xu, Y. Pu, Z. Ni, C. Wang, M. Vasu, S. Song, G. Huang, H. Shi
CVPR 2024.

Soda: Bottleneck Diffusion Models for Representation Learning
D. A. Hudson, D. Zoran, M. Malinowski, A. K. Lampinen, A. Jaegle, J. L. McClelland, L. Matthey, F. Hill, A. Lerchner
CVPR 2024.

Masked Diffusion as Self-Supervised Representation Learner
Z. Pan, J. Chen, Y. Shi
arXiv 2024.

ScribbleGen: Generative Data Augmentation Improves Scribble-Supervised Semantic Segmentation
J. Schnell, J. Wang, L. Qi, V. T. Hu, M. Tang
arXiv, 2024.

Deconstructing Denoising Diffusion Models for Self-Supervised Learning
X. Chen, Z. Liu, S. Xie, K. He
arXiv 2024.

Can Generative Models Improve Self-Supervised Representation Learning?
S. Ayromlou, A. Afkanpour, V. R. Khazaie, F. Forghani
arXiv 2024.

Unsupervised Semantic Correspondence Using Stable Diffusion
E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, K. M. Yi
NeurIPS 2023.

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
J. Zhang, C. Herrmann, J. Hur, L. P. Cabrera, V. Jampani, D. Sun, M.-H. Yang
NeurIPS 2023.

Emergent Correspondence from Image Diffusion
L. Tang, M. Jia, Q. Wang, C. P. Phoo, B. Hariharan
NeurIPS 2023.

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
G. Luo, L. Dunlap, D. H. Park, A. Holynski, T. Darrell
NeurIPS 2023.

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
N. Tumanyan, M. Geyer, S. Bagon, T. Dekel
CVPR, 2023.

Diversity is Definitely Needed: Improving Model-Agnostic Zero-Shot Classification via Stable Diffusion
J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, C. Fookes
CVPR, 2023.

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, S. De Mello
CVPR 2023.

Denoising Diffusion Autoencoders are Unified Self-Supervised Learners
W. Xiang, H. Yang, D. Huang, Y. Wang
ICCV, 2023.

Diffusion Models as Masked Autoencoders
C. Wei, K. Mangalam, P.-Y. Huang, Y. Li, H. Fan, H. Xu, H. Wang, C. Xie, A. Yuille, C. Feichtenhofer
ICCV, 2023.

Unleashing Text-to-Image Diffusion Models for Visual Perception
W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, J. Lu
ICCV 2023.

Your diffusion model is secretly a zero-shot classifier
A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, D. Pathak
ICCV 2023.

Diffusion Model as Representation Learner
X. Yang, X. Wang
ICCV 2023.

Dreamteacher: Pretraining image backbones with deep generative models
D. Li, H. Ling, A. Kar, D. Acuna, S. W. Kim, K. Kreis, A. Torralba, S. Fidler
ICCV 2023.

Infodiffusion: Representation Learning Using Information Maximizing Diffusion Models
Y. Wang, Y. Schiff, A. Gokaslan, W. Pan, F. Wang, C. De Sa, V. Kuleshov
ICML, PMLR, 2023.

Learning Data Representations with Joint Diffusion Models
K. Deja, T. Trzciński, J. M. Tomczak
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2023.

Diffusion Models Beat GANs on Image Classification
S. Mukhopadhyay, M. Gwilliam, V. Agarwal, N. Padmanabhan, A. Swaminathan, S. Hegde, T. Zhou, A. Shrivastava
arXiv 2023.

Do Text-Free Diffusion Models Learn Discriminative Visual Representations?
S. Mukhopadhyay, M. Gwilliam, Y. Yamaguchi, V. Agarwal, N. Padmanabhan, A. Swaminathan, T. Zhou, A. Shrivastava
arXiv 2023.

Unsupervised Representation Learning from Pre-Trained Diffusion Probabilistic Models
Z. Zhang, Z. Zhao, Z. Lin
NeurIPS 2022.

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
K. Preechakul, N. Chatthee, S. Wizadwongsa, S. Suwajanakorn
CVPR 2022.

Label-Efficient Semantic Segmentation with Diffusion Models
D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, A. Babenko
ICLR 2022.

Prompt-to-Prompt Image Editing with Cross Attention Control
A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, D. Cohen-Or
arXiv 2022.

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model
X. Yang, S.-M. Shih, Y. Fu, X. Zhao, S. Ji
arXiv 2022.

Diffusion Models Beat GANs on Image Synthesis
P. Dhariwal, A. Nichol
NeurIPS 2021.

<!--- Cover Tab.2 -->

Representation Learning for Diffusion Model

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie
https://arxiv.org/abs/2410.06940

Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
K. Pandey, P. Guerrero, M. Gadelha, Y. Hold-Geoffroy, K. Singh, N. J. Mitra
CVPR 2024.

Readout Guidance: Learning Control from Diffusion Features
G. Luo, T. Darrell, O. Wang, D. B. Goldman, A. Holynski
CVPR 2024.

Depth-aware guidance with self-estimated depth representations of diffusion models
G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, S. Kim
Pattern Recognition, vol. 153, 2024.

Diffusion Model with Perceptual Loss
S. Lin, X. Yang
arXiv 2024.

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
D. Ahn, H. Cho, J. Min, W. Jang, J. Kim, S. Kim, H. H. Park, K. H. Jin, S. Kim
arXiv 2024.

Return of Unconditional Generation: A Self-supervised Representation Generation Method
T. Li, D. Katabi, K. He
arXiv 2024.

Rethinking Cluster-Conditioned Diffusion Models
N. Adaloglou, T. Kaiser, F. Michels, M. Kollmann
arXiv 2024.

Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
Z. You, Y. Zhong, F. Bao, J. Sun, C. Li, J. Zhu
NeurIPS 2023.

Diffusion Self-Guidance for Controllable Image Generation
D. Epstein, A. Jabri, B. Poole, A. Efros, A. Holynski
NeurIPS 2023.

kNN-Diffusion: Image Generation via Large-Scale Retrieval
S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, Y. Taigman
ICLR, 2023.

Self-Guided Diffusion Models
V. T. Hu, D. W. Zhang, Y. M. Asano, G. J. Burghouts, C. G. M. Snoek
CVPR 2023.

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
S. Hong, G. Lee, W. Jang, S. Kim
ICCV 2023.

Guided Diffusion from Self-Supervised Diffusion Features
V. T. Hu, Y. Chen, M. Caron, Y. M. Asano, C. G. M. Snoek, B. Ommer
arXiv 2023.

Retrieval-Augmented Diffusion Models
A. Blattmann, R. Rombach, K. Oktay, B. Ommer
NeurIPS 2022.

Elucidating the design space of diffusion-based generative models
T. Karras, M. Aittala, T. Aila, S. Laine
NeurIPS 2022.

<!--- Cover whole Sec 3.2 -->

General

State of the Art on Diffusion Models for Visual Computing
R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, G. Wetzstein
Computer Graphics Forum 2024.

Diffusion Models in Vision: A Survey
F. Croitoru, V. Hondru, R. Ionescu, M. Shah
IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 45, no. 09, 2023.

Diffusion Models: A Comprehensive Survey of Methods and Applications
L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, M.-H. Yang
ACM Computing Surveys, vol. 56, no. 4, 2023.

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
K. Zheng, C. Lu, J. Chen, J. Zhu
ICML PMLR 2023.

On the Design Fundamentals of Diffusion Models: A Survey
Z. Chang, G. A. Koulieris, H. P. H. Shum
arXiv 2023.

Understanding Diffusion Models: A Unified Perspective
C. Luo
arXiv 2022.

Progressive Distillation for Fast Sampling of Diffusion Models
T. Salimans, J. Ho
ICLR, 2022.

High-Resolution Image Synthesis with Latent Diffusion Models
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer
CVPR, 2022.

Dynamic Dual-Output Diffusion Models
Y. Benny, L. Wolf
CVPR 2022.

Variational Diffusion Models
D. Kingma, T. Salimans, B. Poole, J. Ho
NeurIPS, vol. 34, 2021.

A Variational Perspective on Diffusion-Based Generative Models and Score Matching
C.-W. Huang, J. H. Lim, A. Courville
NeurIPS 2021.

Score-Based Generative Modeling through Stochastic Differential Equations
Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole
ICLR, 2021.

Denoising Diffusion Probabilistic Models
J. Ho, A. Jain, P. Abbeel
NeurIPS 2020.

Generative Modeling by Estimating Gradients of the Data Distribution
Y. Song, S. Ermon
NeurIPS, vol. 32, 2019.

<!--- Up until sec 2.2 --> <!--- ## Rest **Exploring the Limits of Deep Image Clustering Using Pretrained Models**\ *N. Adaloglou, F. Michels, H. Kalisch, M. Kollmann*\ BMVC 2023. **Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation**\ *J. Ahn, S. Kwak*\ CVPR 2018. **Building Normalizing Flows with Stochastic Interpolants**\ *M. S. Albergo, E. Vanden-Eijnden*\ ICLR 2023. **Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models**\ *N. Anand, T. Achim*\ arXiv 2022. **Reverse-Time Diffusion Equation Models**\ *B. D. O. Anderson*\ Stochastic Processes and their Applications, vol. 12, no. 3, 1982. **Self-Labelling via Simultaneous Clustering and Representation Learning**\ *Y. M. Asano, C. Rupprecht, A. Vedaldi*\ arXiv 2019. **Structured Denoising Diffusion Models in Discrete State-Spaces**\ *J. Austin, D. D. Johnson, J. Ho, D. Tarlow, R. van den Berg*\ NeurIPS, vol. 34, 2021. **Layer Normalization**\ *J. L. Ba, J. R. Kiros, G. E. Hinton*\ arXiv 2016. **All Are Worth Words: A ViT Backbone for Diffusion Models**\ *F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, J. Zhu*\ CVPR 2023. **One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale**\ *F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y. Wang, G. Yue, Y. Cao, H. Su, J. Zhu*\ ICML 2023. **MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation**\ *O. Bar-Tal, L. Yariv, Y. Lipman, T. Dekel*\ ICML 2023. **Longformer: The Long-Document Transformer**\ *I. Beltagy, M. E. Peters, A. Cohan*\ arXiv 2020. **D-Flow: Differentiating through Flows for Controlled Generation**\ *H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, Y. Lipman*\ arXiv 2024. **Generalized Denoising Auto-Encoders as Generative Models**\ *Y. Bengio, L. Yao, G. Alain, P. Vincent*\ NeurIPS, vol. 26, 2013. **InstructPix2Pix: Learning to Follow Image Editing Instructions**\ *T. Brooks, A. Holynski, A. A. Efros*\ CVPR 2023. **Emerging Properties in Self-Supervised Vision Transformers**\ *M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin*\ ICCV 2021. **Instance-Conditioned GAN**\ *A. Casanova, M. Careil, J. Verbeek, M. Drozdzal, A. Romero Soriano*\ NeurIPS 2021. **MaskGIT: Masked Generative Image Transformer**\ *H. Chang, H. Zhang, L. Jiang, C. Liu, W. T. Freeman*\ CVPR 2022. **The Hidden Language of Diffusion Models**\ *H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, M. Irani, I. Mosseri, L. Wolf*\ ICLR 2024. **Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding**\ *G. Chen, Y. Huang, J. Xu, B. Pei, Z. Chen, Z. Li, J. Wang, K. Li, T. Lu, L. Wang*\ arXiv 2024. **Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs**\ *L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille*\ arXiv 2016. **DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs**\ *L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille*\ IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 4, 2017. **A Simple Framework for Contrastive Learning of Visual Representations**\ *T. Chen, S. Kornblith, M. Norouzi, G. Hinton*\ ICML 2020. **Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model**\ *Y. Chen, F. Viégas, M. Wattenberg*\ NeurIPS 2023 Workshop on Diffusion Models, 2023. **Reproducible Scaling Laws for Contrastive Language-Image Learning**\ *M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev*\ CVPR 2023. **Flow Matching in Latent Space**\ *Q. Dao, H. Phung, B. Nguyen, A. Tran*\ arXiv 2023. **Efficient Video Prediction via Sparsely Conditioned Flow Matching**\ *A. Davtyan, S. Sameni, P. Favaro*\ ICCV 2023. **An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**\ *A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.*\ ICLR 2021. **Taming Transformers for High-Resolution Image Synthesis**\ *P. Esser, R. Rombach, B. Ommer*\ CVPR 2021. **The Pascal Visual Object Classes (VOC) Challenge**\ *M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman*\ International Journal of Computer Vision, vol. 88, 2010. **Scalable Diffusion Models with State Space Backbone**\ *Z. Fei, M. Fan, C. Yu, J. Huang*\ arXiv 2024. **Boosting Latent Diffusion with Flow Matching**\ *J. S. Fischer, M. Gui, P. Ma, N. Stracke, S. A. Baumann, B. Ommer*\ ECCV 2024. **InstructDiffusion: A Generalist Modeling Interface for Vision Tasks**\ *Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Li, H. Hu et al.*\ CVPR 2024. **TokenFlow: Consistent Diffusion Features for Consistent Video Editing**\ *M. Geyer, O. Bar-Tal, S. Bagon, T. Dekel*\ ICLR 2024. **Generative Adversarial Nets**\ *I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio*\ NeurIPS, vol. 27, 2014. **Diffusion Models as Plug-and-Play Priors**\ *A. Graikos, N. Malkin, N. Jojic, D. Samaras*\ NeurIPS 2022. **Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning**\ *J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, M. Valko*\ NeurIPS, vol. 33, 2020. **Mamba: Linear-Time Sequence Modeling with Selective State Spaces**\ *A. Gu, T. Dao*\ arXiv 2024. **DepthFM: Fast Monocular Depth Estimation with Flow Matching**\ *M. Gui, J. S. Fischer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V. T. Hu, B. Ommer*\ arXiv 2024. **Proposal Flow: Semantic Correspondences from Object Proposals**\ *B. Ham, M. Cho, C. Schmid, J. Ponce*\ IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 7, 2017. **Multistep Consistency Models**\ *J. Heek, E. Hoogeboom, T. Salimans*\ arXiv 2024. **Distilling the Knowledge in a Neural Network**\ *G. Hinton, O. Vinyals, J. Dean*\ arXiv 2015. **Classifier-Free Diffusion Guidance**\ *J. Ho, T. Salimans*\ NeurIPS Workshop 2021. **Equivariant Diffusion for Molecule Generation in 3D**\ *E. Hoogeboom, V. G. Satorras, C. Vignac, M. Welling*\ Proceedings of the 39th International Conference on Machine Learning, 2022. **Simple Diffusion: End-to-End Diffusion for High Resolution Images**\ *E. Hoogeboom, J. Heek, T. Salimans*\ Proceedings of the 40th International Conference on Machine Learning, 2023. **Motion Flow Matching for Human Motion Synthesis and Editing**\ *V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, E. Gavves, P. Mettes, B. Ommer, C. G. M. Snoek*\ arXiv 2023. **ZigMa: A DiT-style Zigzag Mamba Diffusion Model**\ *V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, J. Fischer, B. Ommer*\ ECCV 2024. **Flow Matching for Conditional Text Generation in a Few Sampling Steps**\ *V. T. Hu, D. Wu, Y. M. Asano, P. Mettes, B. Fernando, B. Ommer, C. G. M. Snoek*\ EACL 2024. **Latent Space Editing in Transformer-Based Flow Matching**\ *V. T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, C. Snoek*\ Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024. **Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models**\ *R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, Z. Zhao*\ Proceedings of the 40th International Conference on Machine Learning, 2023. **Diffusion Model-Based Image Editing: A Survey**\ *Y. Huang, J. Huang, Y. Liu, M. Yan, J. Lv, J. Liu, W. Xiong, H. Zhang, S. Chen, L. Cao*\ arXiv 2024. **Guiding a Diffusion Model with a Bad Version of Itself**\ *T. Karras, M. Aittala, T. Kynkäänniemi, J. Lehtinen, T. Aila, S. Laine*\ arXiv 2024. **Panoptic feature pyramid networks**\ *A. Kirillov, R. Girshick, K. He, P. Dollár*\ CVPR 2019. **Panoptic segmentation**\ *A. Kirillov, K. He, R. Girshick, C. Rother, P. Dollár*\ CVPR 2019. **Diffusion models already have a semantic latent space**\ *M. Kwon, J. Jeong, Y. Uh*\ ICLR 2023. **Voicebox: Text-guided multilingual universal speech generation at scale**\ *M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, et al.*\ arXiv 2023. **VideoMamba: State Space Model for Efficient Video Understanding**\ *K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, Y. Qiao*\ arXiv 2024. **MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers**\ *S. Li, C. Chen, H. Lu*\ arXiv 2024. **Mage: Masked generative encoder to unify representation learning and image synthesis**\ *T. Li, H. Chang, S. Mishra, H. Zhang, D. Katabi, D. Krishnan*\ CVPR 2023. **Diffusion-LM Improves Controllable Text Generation**\ *X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, T. B. Hashimoto*\ arXiv 2022. **SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning**\ *X. Li, K. Han, X. Wan, V. A. Prisacariu*\ arXiv 2023. **Scribblesup: Scribble-supervised convolutional networks for semantic segmentation**\ *D. Lin, J. Dai, J. Jia, K. He, J. Sun*\ CVPR 2016. **Microsoft coco: Common objects in context**\ *T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick*\ ECCV 2014. **Flow matching for generative modeling**\ *Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, M. Le*\ ICLR 2023. **Ringattention with Blockwise Transformers for Near-Infinite Context**\ *H. Liu, M. Zaharia, P. Abbeel*\ ICLR 2024. **DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism**\ *J. Liu, C. Li, Y. Ren, F. Chen, Z. Zhao*\ AAAI, vol. 36, no. 10, 2022. **Diverse Image Generation via Self-Conditioned GANs**\ *S. Liu, T. Wang, D. Bau, J.-Y. Zhu, A. Torralba*\ CVPR 2020. **Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow**\ *X. Liu, C. Gong, Q. Liu*\ ICLR 2023. **Swin Transformer: Hierarchical Vision Transformer using Shifted Windows**\ *Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo*\ ICCV 2021. **Fully Convolutional Networks for Semantic Segmentation**\ *J. Long, E. Shelhamer, T. Darrell*\ CVPR 2015. **Power Hungry Processing: Watts Driving the Cost of AI Deployment?**\ *A. S. Luccioni, Y. Jernite, E. Strubell*\ arXiv 2023. **Diff-instruct: A Universal Approach for Transferring Knowledge from Pre-trained Diffusion Models**\ *W. Luo, T. Hu, S. Zhang, J. Sun, Z. Li, Z. Zhang*\ NeurIPS, vol. 36, 2024. **A Variational Perspective on Solving Inverse Problems with Diffusion Models**\ *M. Mardani, J. Song, J. Kautz, A. Vahdat*\ ICLR 2024. **Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps**\ *O. Mariotti, O. Mac Aodha, H. Bilen*\ CVPR 2024. **Long Range Language Modeling via Gated State Spaces**\ *H. Mehta, A. Gupta, A. Cutkosky, B. Neyshabur*\ ICLR 2023. **SPair-71k: A Large-Scale Benchmark for Semantic Correspondence**\ *J. Min, J. Lee, J. Ponce, M. Cho*\ arXiv 2019. **Conditional Generative Adversarial Nets**\ *M. Mirza, S. Osindero*\ arXiv 2014. **Diffusion Based Representation Learning**\ *S. Mittal, K. Abstreiter, S. Bauer, B. Schölkopf, A. Mehrjou*\ ICML, PMLR, 2023. **S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces**\ *E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, C. Ré*\ NeurIPS 2022. **DINOv2: Learning Robust Visual Features Without Supervision**\ *M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, P. Bojanowski*\ Transactions on Machine Learning Research, 2024. **Scribble-Supervised Semantic Segmentation by Uncertainty Reduction on Neural Representation and Self-Supervision on Neural Eigenspace**\ *Z. Pan, P. Jiang, Y. Wang, C. Tu, A. G. Cohn*\ ICCV 2021. **Scalable Diffusion Models with Transformers**\ *W. Peebles, S. Xie*\ ICCV 2023. **Understanding Deep Learning**\ *S. J. Prince*\ The MIT Press, 2023. **Learning Transferable Visual Models from Natural Language Supervision**\ *A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark \emph{et~al.}*\ ICML, 2021. **Stochastic Backpropagation and Approximate Inference in Deep Generative Models**\ *D. J. Rezende, S. Mohamed, D. Wierstra*\ ICML, 2014. **FitNets: Hints for Thin Deep Nets**\ *A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio*\ arXiv, 2015. **U-Net: Convolutional Networks for Biomedical Image Segmentation**\ *O. Ronneberger, P. Fischer, T. Brox*\ MICCAI, Springer, 2015. **Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding**\ *C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, M. Norouzi*\ NeurIPS, 2022. **PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications**\ *T. Salimans, A. Karpathy, X. Chen, D. P. Kingma*\ ICLR, 2017. **Multistep Distillation of Diffusion Models via Moment Matching**\ *T. Salimans, T. Mensink, J. Heek, E. Hoogeboom*\ arXiv, 2024. **Generating Images of Rare Concepts Using Pre-Trained Diffusion Models**\ *D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, G. Chechik*\ AAAI, vol. 38, no. 5, 2024. **Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation**\ *A. Sauer, F. Boesel, T. Dockhorn, A. Blattmann, P. Esser, R. Rombach*\ arXiv, 2024. **Denoising Diffusion Implicit Models**\ *J. Song, C. Meng, S. Ermon*\ ICLR, 2021. **Consistency Models**\ *Y. Song, P. Dhariwal, M. Chen, I. Sutskever*\ arXiv, 2023. **Flow Factorized Representation Learning**\ *Y. Song, A. Keller, N. Sebe, M. Welling*\ NeurIPS, vol. 36, 2024. **Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction**\ *K. Tian, Y. Jiang, Z. Yuan, B. Peng, L. Wang*\ arXiv, 2024. **A Connection Between Score Matching and Denoising Autoencoders**\ *P. Vincent*\ Neural Computation, 2011. **Extracting and Composing Robust Features with Denoising Autoencoders**\ *P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol*\ ICML, 2008. **Unsupervised Discovery of Interpretable Directions in the GAN Latent Space**\ *A. Voynov, A. Babenko*\ ICML, PMLR, 2020. **End-to-End Diffusion Latent Optimization Improves Classifier Guidance**\ *B. Wallace, A. Gokul, S. Ermon, N. Naik*\ ICCV, 2023. **DiffuMask: Synthesizing Images with Pixel-Level Annotations for Semantic Segmentation Using Diffusion Models**\ *W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, C. Shen*\ ICCV, 2023. **Diffusion Models Without Attention**\ *J. N. Yan, J. Gu, A. M. Rush*\ CVPR 2024. **Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer**\ *D. Yatim, R. Fridman, O. Bar-Tal, Y. Kasten, T. Dekel*\ CVPR 2024. **Freedom: Training-Free Energy-Guided Conditional Diffusion Model**\ *J. Yu, Y. Wang, C. Zhao, B. Ghanem, J. Zhang*\ ICCV 2023. **Exploring Diffusion Time-Steps for Unsupervised Representation Learning**\ *Z. Yue, J. Wang, Q. Sun, L. Ji, E. I. Chang, H. Zhang*\ ICLR 2024. **Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer**\ *S. Zagoruyko, N. Komodakis*\ ICLR 2017. **Wide Residual Networks**\ *S. Zagoruyko, N. Komodakis*\ BMVC 2016. **Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence**\ *J. Zhang, C. Herrmann, J. Hur, E. Chen, V. Jampani, D. Sun, M.-H. Yang*\ CVPR 2024. **Adding Conditional Control to Text-to-Image Diffusion Models**\ *L. Zhang, A. Rao, M. Agrawala*\ ICCV 2023. **Semantic Understanding of Scenes Through the ADE20K Dataset**\ *B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, A. Torralba*\ IJCV, vol. 127, 2019. **Score-Based Generative Classifiers**\ *R. S. Zimmermann, L. Schott, Y. Song, B. A. Dunn, D. A. Klindt*\ NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. -->