Awesome
📖 Video Diffusion Models: A Survey
Survey link: https://arxiv.org/abs/2405.03150
@article{melnik2024video,
title={Video Diffusion Models: A Survey},
author={Melnik, Andrew and Ljubljanac, Michal and Lu, Cong and Yan, Qi and Ren, Weiming and Ritter, Helge},
journal={arXiv preprint arXiv:2405.03150},
year={2024}
}
Papers
2024
Lane Segmentation Refinement with Diffusion Models
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Genie: Generative Interactive Environments
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
Lumiere: A space-time diffusion model for video generation
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Latte: Latent diffusion transformer for video generation
Moonshot: Towards controllable video generation and editing with multimodal conditions
2023
I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models
Videopoet: A large language model for zero-shot video generation
Llama guard: Llm-based input-output safeguard for human-ai conversations
Photorealistic video generation with diffusion models
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
Animate anyone: Consistent and controllable image-to-video synthesis for character animation
Magicanimate: Temporally consistent human image animation using diffusion model
Stable video diffusion: Scaling latent video diffusion models to large datasets
Make pixels dance: High-dynamic video generation
Emu video: Factorizing text-to-video generation by explicit image conditioning
I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models
Consistent Video-to-Video Transfer Using Synthetic Dataset
Videocrafter1: Open diffusion models for high-quality video generation
Dynamicrafter: Animating open-domain images with video diffusion priors
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Gaia-1: A generative world model for autonomous driving
Show-1: Marrying pixel and latent diffusion models for text-to-video generation
Lavie: High-quality video generation with cascaded latent diffusion models
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion
Tokenflow: Consistent diffusion features for consistent video editing
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Stable Remaster: Bridging the Gap Between Old Content and New Displays
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
Video Diffusion Models with Local-Global Context Guidance
Probabilistic Adaptation of Text-to-Video Models
Video Colorization with Pre-trained Text-to-Image Diffusion Models
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising
A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
VDT: An Empirical Study on Video Diffusion with Transformers
ControlVideo: Training-free Controllable Text-to-Video Generation
Any-to-Any Generation via Composable Diffusion
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation
Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion
Motion-Conditioned Diffusion Model for Controllable Video Synthesis
Generative Disco: Text-to-Video Generation for Music Visualization
Text2Performer: Text-Driven Human Video Generation
Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation
Video Generation Beyond a Single Clip
Dinov2: Learning robust visual features without supervision
Soundini: Sound-Guided Diffusion for Natural Video Editing
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Zero-shot video editing using off-the-shelf image diffusion models
Text2video-zero: Text-to-image diffusion models are zero-shot video generators
Pix2video: Video editing using image diffusion
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
Fatezero: Fusing attentions for zero-shot text-based video editing
Decomposed Diffusion Models for High-Quality Video Generation
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Video-p2p: Video editing with cross-attention control
Llama: Open and efficient foundation language models
Adding conditional control to text-to-image diffusion models
Structure and content-guided video synthesis with diffusion models
Dreamix: Video diffusion models are general video editors
Scenescape: Text-driven consistent scene generation
simple diffusion: End-to-end diffusion for high resolution images
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
2022
Behavioral cloning via search in video pretraining latent space
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Scalable Diffusion Models with Transformers
Latent video diffusion models for high-fidelity video generation with arbitrary lengths
Magicvideo: Efficient video generation with latent diffusion models
Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models
Representation Learning with Diffusion Models
Imagen video: High definition video generation with diffusion models
Make-a-video: Text-to-video generation without text-video data
Prompt-to-prompt image editing with cross attention control
An image is worth one word: Personalizing text-to-image generation using textual inversion
Classifier-free diffusion guidance
Cogvideo: Large-scale pretraining for text-to-video generation via transformers
Flexible diffusion modeling of long videos
Hierarchical text-conditional image generation with clip latents
Generating videos with dynamics-aware implicit generative adversarial networks