Awesome

📖 Video Diffusion Models: A Survey

Survey link: https://arxiv.org/abs/2405.03150

@article{melnik2024video,
  title={Video Diffusion Models: A Survey},
  author={Melnik, Andrew and Ljubljanac, Michal and Lu, Cong and Yan, Qi and Ren, Weiming and Ritter, Helge},
  journal={arXiv preprint arXiv:2405.03150},
  year={2024}
}

Papers

2024, 2023, 2022

2024

Lane Segmentation Refinement with Diffusion Models

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Genie: Generative Interactive Environments

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Lumiere: A space-time diffusion model for video generation

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Latte: Latent diffusion transformer for video generation

Moonshot: Towards controllable video generation and editing with multimodal conditions

2023

I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models

Videopoet: A large language model for zero-shot video generation

Llama guard: Llm-based input-output safeguard for human-ai conversations

Photorealistic video generation with diffusion models

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Magicanimate: Temporally consistent human image animation using diffusion model

Stable video diffusion: Scaling latent video diffusion models to large datasets

Make pixels dance: High-dynamic video generation

Emu video: Factorizing text-to-video generation by explicit image conditioning

I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models

Consistent Video-to-Video Transfer Using Synthetic Dataset

Videocrafter1: Open diffusion models for high-quality video generation

Dynamicrafter: Animating open-domain images with video diffusion priors

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Gaia-1: A generative world model for autonomous driving

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

Lavie: High-quality video generation with cascaded latent diffusion models

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion

Tokenflow: Consistent diffusion features for consistent video editing

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images

Stable Remaster: Bridging the Gap Between Old Content and New Displays

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Video Diffusion Models with Local-Global Context Guidance

Probabilistic Adaptation of Text-to-Video Models

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

VDT: An Empirical Study on Video Diffusion with Transformers

ControlVideo: Training-free Controllable Text-to-Video Generation

Any-to-Any Generation via Composable Diffusion

VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models

Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer

AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

Generative Disco: Text-to-Video Generation for Music Visualization

Text2Performer: Text-Driven Human Video Generation

Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation

Video Generation Beyond a Single Clip

Dinov2: Learning robust visual features without supervision

Soundini: Sound-Guided Diffusion for Natural Video Editing

Segment anything

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Zero-shot video editing using off-the-shelf image diffusion models

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Pix2video: Video editing using image diffusion

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Fatezero: Fusing attentions for zero-shot text-based video editing

Decomposed Diffusion Models for High-Quality Video Generation

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Video-p2p: Video editing with cross-attention control

Llama: Open and efficient foundation language models

Adding conditional control to text-to-image diffusion models

Structure and content-guided video synthesis with diffusion models

Dreamix: Video diffusion models are general video editors

Scenescape: Text-driven consistent scene generation

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

simple diffusion: End-to-end diffusion for high resolution images

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

2022

Behavioral cloning via search in video pretraining latent space

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Scalable Diffusion Models with Transformers

Latent video diffusion models for high-fidelity video generation with arbitrary lengths

Magicvideo: Efficient video generation with latent diffusion models

Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models

Representation Learning with Diffusion Models

Imagen video: High definition video generation with diffusion models

Make-a-video: Text-to-video generation without text-video data

Prompt-to-prompt image editing with cross attention control

An image is worth one word: Personalizing text-to-image generation using textual inversion

Classifier-free diffusion guidance

Cogvideo: Large-scale pretraining for text-to-video generation via transformers

Flexible diffusion modeling of long videos

Hierarchical text-conditional image generation with clip latents

Video diffusion models

Generating videos with dynamics-aware implicit generative adversarial networks