Home

Awesome

A Survey on Video Diffusion Models Awesome arXiv

<div style="text-align:center; font-size: 18px;"> <p> <a href="https://chenhsing.github.io">Zhen Xing</a>, Qijun Feng, Haoran Chen, <a href="https://scholar.google.com/citations?user=NSJY12IAAAAJ&hl=zh-CN" >Qi Dai,</a> <a href="https://scholar.google.com/citations?user=Jkss014AAAAJ&hl=zh-CN&oi=ao" >Han Hu,</a> <a href="https://scholar.google.com/citations?user=J_8TX6sAAAAJ&hl=zh-CN&oi=ao" >Hang Xu,</a> <a href="https://scholar.google.com/citations?user=7t12hVkAAAAJ&hl=en" >Zuxuan Wu,</a> <a href="https://scholar.google.com/citations?user=f3_FP8AAAAAJ&hl=en" >Yu-Gang Jiang </a> </p> </div> <p align="center"> <img src="asset/fish.webp" width="160px"/> <img src="asset/tree.gif" width="160px"/> <img src="asset/raccoon.gif" width="160px"/> </p> <p align="center"> <img src="asset/svd.gif" width="240px"/> <img src="asset/fly3.gif" width="240px"/> </p> <p align="center"> <img src="asset/1.gif" width="120px"/> <img src="asset/2.gif" width="120px"/> <img src="asset/3.gif" width="120px"/> <img src="asset/4.gif" width="120px"/> </p> <p align="center"> (Source: <a href="https://makeavideo.studio/">Make-A-Video</a>, <a href="https://chenhsing.github.io/SimDA/">SimDA</a>, <a href="https://research.nvidia.com/labs/dir/pyoco/">PYoCo</a>, <a href="https://img.shields.io/badge/Website-9cf"> SVD </a>, <a href="https://research.nvidia.com/labs/toronto-ai/VideoLDM/">Video LDM</a> and <a href="https://tuneavideo.github.io/">Tune-A-Video</a>) </p>

Contact

If you have any suggestions or find our work helpful, feel free to contact us

Homepage: Zhen Xing

Email: zhenxingfd@gmail.com

If you find our survey is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{xing2023survey,
  title={A survey on video diffusion models},
  author={Xing, Zhen and Feng, Qijun and Chen, Haoran and Dai, Qi and Hu, Han and Xu, Hang and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={ACM Computing Surveys},
  year={2023},
  publisher={ACM New York, NY}
}

Open-source Toolboxes and Foundation Models

MethodsTaskGithub
Movie GenT2V Generation-
CogVideoXT2V GenerationStar
Open-Sora-PlanT2V GenerationStar
Open-SoraT2V GenerationStar
Morph StudioT2V Generation-
GenieT2V Generation-
SoraT2V Generation & Editing-
VideoPoetT2V Generation & Editing-
Stable Video DiffusionT2V GenerationStar
NeverEndsT2V Generation-
PikaT2V Generation-
EMU-VideoT2V Generation-
GEN-2T2V Generation & Editing-
ModelScopeT2V GenerationStar
ZeroScopeT2V Generation-
T2V Synthesis ColabT2V GenetationStar
VideoCraftT2V Genetation & EditingStar
Diffusers (T2V synthesis)T2V Genetation-
AnimateDiffPersonalized T2V GenetationStar
Text2Video-ZeroT2V GenetationStar
HotShot-XLT2V GenetationStar
GenmoT2V Genetation-
FlikiT2V Generation-

Table of Contents

Video Generation

Data

Caption-level

TitlearXivGithubWebSitePub. & Date
Identity-Preserving Text-to-Video Generation by Frequency DecompositionarXivStarWebsiteNov., 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video GenerationarXivStarWebsiteNeurIPS., 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality TeachersarXivStarWebsiteCVPR, 2024
CelebV-Text: A Large-Scale Facial Text-Video DatasetarXivStar-CVPR, 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationarXivStar-May, 2023
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video GenerationarXiv--May, 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsarXiv--Nov, 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalarXiv--ICCV, 2021
MSR-VTT: A Large Video Description Dataset for Bridging Video and LanguagearXiv--CVPR, 2016

Category-level

TitlearXivGithubWebSitePub. & Date
UCF101: A Dataset of 101 Human Actions Classes From Videos in The WildarXiv--Dec., 2012
First Order Motion Model for Image AnimationarXiv--May, 2023
Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial NetworksarXiv--CVPR,2018

Metric and BenchMark

TitlearXivGithubWebSitePub. & Date
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in VideosarXivStar-Jul., 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video GenerationarXivStarWebsiteNeurIPS, 2024
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative ModelsarXivStar-ICLR, 2024
Subjective-Aligned Dateset and Metric for Text-to-Video Quality AssessmentarXiv--Mar, 2024
Towards A Better Metric for Text-to-Video GenerationarXiv-WebsiteJan, 2024
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AIarXiv--Jan, 2024
VBench: Comprehensive Benchmark Suite for Video Generative ModelsarXivStarWebsiteNov, 2023
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video GenerationarXiv--NeurIPS, 2023
CVPR 2023 Text Guided Video Editing CompetitionarXiv--Oct., 2023
EvalCrafter: Benchmarking and Evaluating Large Video Generation ModelsarXivStarWebsiteOct., 2023
Measuring the Quality of Text-to-Video Model Outputs: Metrics and DatasetarXiv--Sep., 2023

Text-to-Video Generation

Training-based

TitlearXivGithubWebSitePub. & Date
Identity-Preserving Text-to-Video Generation by Frequency DecompositionarXivStarWebsiteNov., 2024
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and ConditioningarXivStarWebsiteNeurIPS 2024
Movie GenarXiv-WebsiteOct, 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert TransformerarXivStar-Oct, 2024
Grid Diffusion Models for Text-to-Video GenerationarXivStarWebsiteCVPR, 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic SimulatorsarXivStarWebsiteApr., 2024
Mora: Enabling Generalist Video Generation via A Multi-Agent FrameworkarXiv--Mar., 2024
VSTAR: Generative Temporal Nursing for Longer Dynamic Video SynthesisarXiv--Mar., 2024
Genie: Generative Interactive EnvironmentsarXiv-WebsiteFeb., 2024
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video SynthesisarXiv-WebsiteFeb., 2024
Lumiere: A Space-Time Diffusion Model for Video GenerationarXiv-WebsiteJan, 2024
UNIVG: TOWARDS UNIFIED-MODAL VIDEO GENERATIONarXiv-WebsiteJan, 2024
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion ModelsarXivStarWebsiteJan, 2024
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion ModelarXiv-WebsiteJan, 2024
MagicVideo-V2: Multi-Stage High-Aesthetic Video GenerationarXiv-WebsiteJan, 2024
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLMarXiv-WebsiteJan, 2024
A Recipe for Scaling up Text-to-Video Generation with Text-free VideosarXivStarWebsiteDec, 2023
InstructVideo: Instructing Video Diffusion Models with Human FeedbackarXivStarWebsiteDec, 2023
VideoLCM: Video Latent Consistency ModelarXiv--Dec, 2023
Photorealistic Video Generation with Diffusion ModelsarXiv-WebsiteDec, 2023
Hierarchical Spatio-temporal Decoupling for Text-to-Video GenerationarXivStarWebsiteDec, 2023
Delving Deep into Diffusion Transformers for Image and Video GenerationarXiv-WebsiteDec, 2023
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style AdapterarXivStarWebsiteNov, 2023
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video GenerationarXiv-WebsiteNov, 2023
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion ModelsarXivStarWebsiteNov, 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large DatasetsarXivStarWebsiteNov, 2023
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation PipelinearXivStarWebsiteNov, 2023
MoVideo: Motion-Aware Video Generation with Diffusion ModelsarXiv-WebsiteNov, 2023
Make Pixels Dance: High-Dynamic Video GenerationarXiv-WebsiteNov, 2023
Emu Video: Factorizing Text-to-Video Generation by Explicit Image ConditioningarXiv-WebsiteNov, 2023
Optimal Noise pursuit for Augmenting Text-to-Video GenerationarXiv--Nov, 2023
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix FinetuningarXiv-WebsiteNov, 2023
VideoCrafter1: Open Diffusion Models for High-Quality Video GenerationarXivStarWebsiteOct, 2023
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and PredictionarXivStarWebsiteOct, 2023
DynamiCrafter: Animating Open-domain Images with Video Diffusion PriorsarXivStarWebsiteOct., 2023
LAMP: Learn A Motion Pattern for Few-Shot-Based Video GenerationarXivStarWebsiteOct., 2023
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion modelarXivStarWebsiteOct, 2023
MotionDirector: Motion Customization of Text-to-Video Diffusion ModelsarXivStarWebsiteOct, 2023
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided PlanningarXivStarWebsiteSep., 2023
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video GenerationarXivStarWebsiteSep., 2023
LaVie: High-Quality Video Generation with Cascaded Latent Diffusion ModelsarXivStarWebsiteSep., 2023
Reuse and Diffuse: Iterative Denoising for Text-to-Video GenerationarXivStarWebsiteSep., 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video GenerationarXiv-WebsiteSep., 2023
MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from TextarXiv--Jul., 2023
Text2Performer: Text-Driven Human Video GenerationarXivStarWebsiteApr., 2023
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific TuningarXivStarWebsiteJul., 2023
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with Large Language ModelsarXiv-WebsiteAug., 2023
SimDA: Simple Diffusion Adapter for Efficient Video GenerationarXivStarWebsiteCVPR, 2024
Dual-Stream Diffusion Net for Text-to-Video GenerationarXiv--Aug., 2023
ModelScope Text-to-Video Technical ReportarXivStarWebsiteAug., 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationarXivStar-Jul., 2023
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video GenerationarXiv--May, 2023
Preserve Your Own Correlation: A Noise Prior for Video Diffusion ModelsarXiv-WebsiteMay, 2023
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion ModelsarXiv-Website-
Latent-Shift: Latent Diffusion with Temporal ShiftarXiv-Website-
Probabilistic Adaptation of Text-to-Video ModelsarXiv-WebsiteJun., 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video GenerationarXiv-WebsiteMar., 2023
ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation---IJCNN, 2023
MagicVideo: Efficient Video Generation With Latent Diffusion ModelsarXiv-Website-
Phenaki: Variable Length Video Generation From Open Domain Textual DescriptionarXiv-Website-
Imagen Video: High Definition Video Generation With Diffusion ModelsarXiv-Website-
VideoFusion: Decomposed Diffusion Models for High-Quality Video GenerationarXivStarWebsite-
MAGVIT: Masked Generative Video TransformerarXiv-WebsiteDec., 2022
Make-A-Video: Text-to-Video Generation without Text-Video DataarXiv-Website-
Latent Video Diffusion Models for High-Fidelity Video Generation With Arbitrary LengthsarXivStarWebsiteNov., 2022
CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersarXivStar-May, 2022
Video Diffusion ModelsarXiv-Website-

Training-free

TitlearXivGithubWebSitePub. & Date
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion ModelsarXivStarWebsiteMar, 2024
TRAILBLAZER: TRAJECTORY CONTROL FOR DIFFUSION-BASED VIDEO GENERATIONarXivStarWebsiteJan, 2024
FreeInit: Bridging Initialization Gap in Video Diffusion ModelsarXivStarWebsiteDec, 2023
MTVG : Multi-text Video Generation with Text-to-Video ModelsarXiv-WebsiteDec, 2023
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video SynthesisarXiv--Nov, 2023
AdaDiff: Adaptive Step Selection for Fast DiffusionarXiv--Nov, 2023
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene SyntaxarXivStarWebsiteNov, 2023
🏀GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT PlanningarXivStarWebsiteNov, 2023
FreeNoise: Tuning-Free Longer Video Diffusion Via Noise ReschedulingarXivStarWebsiteOct, 2023
ConditionVideo: Training-Free Condition-Guided Text-to-Video GenerationarXivStarWebsiteOct, 2023
LLM-grounded Video Diffusion ModelsarXivStarWebsiteOct, 2023
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM AnimatorarXivStar-NeurIPS, 2023
DiffSynth: Latent In-Iteration Deflickering for Realistic Video SynthesisarXivStarWebsiteAug, 2023
Large Language Models are Frame-level Directors for Zero-shot Text-to-Video GenerationarXivStar-May, 2023
Text2video-Zero: Text-to-Image Diffusion Models Are Zero-Shot Video GeneratorsarXivStarWebsiteMar., 2023
PEEKABOO: Interactive Video Generation via Masked-Diffusion 🫣arXivStarWebsiteCVPR, 2024

Video Generation with other conditions

Pose-guided Video Generation

TitlearXivGithubWebSitePub. & Date
🔥🔥<b>StableAnimator: High-Quality Identity-Preserving Human Image Animation</b>🔥🔥arXivStarWebsiteNov., 2024
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion ModelarXivStarWebsiteECCV 2024
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose GuidancearXivStarWebsiteJul., 2024
Champ: Controllable and Consistent Human Image Animation with 3D Parametric GuidancearXivStarWebsiteMar., 2024
Action Reimagined: Text-to-Pose Video Editing for Dynamic Human ActionsarXiv--Mar., 2024
Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple PersonsarXiv--Jan., 2024
DreaMoving: A Human Dance Video Generation Framework based on Diffusion ModelsarXiv-WebsiteDec., 2023
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion ModelarXivStarWebsiteNov., 2023
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character AnimationarXivStarWebsiteNov., 2023
MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions TransferarXivStarWebsiteNov., 2023
DisCo: Disentangled Control for Referring Human Dance Generation in Real WorldarXivStarWebsiteJul., 2023
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion ModelarXiv--Aug., 2023
DreamPose: Fashion Image-to-Video Synthesis via Stable DiffusionarXivStarWebsiteApr., 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free VideosarXivStarWebsiteApr., 2023

Motion-guided Video Generation

TitlearXivGithubWebSitePub. & Date
MOTIONCLONE: TRAINING-FREE MOTION CLONING FOR CONTROLLABLE VIDEO GENERATIONarXivStarWebsiteJun., 2024
Tora: Trajectory-oriented Diffusion Transformer for Video GenerationarXivStarWebsiteJul., 2024
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion ModelarXivStarWebsiteECCV 2024
Champ: Controllable and Consistent Human Image Animation with 3D Parametric GuidancearXivStarWebsiteMar., 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion ModelingarXiv--Jan., 2024
Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video GenerationarXiv--Jan., 2024
Customizing Motion in Text-to-Video Diffusion ModelsarXiv-WebsiteDec., 2023
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion ModelsarXivStarWebsiteCVPR 2024
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion GuidancearXivStarWebsiteNov., 2023
Motion-Conditioned Diffusion Model for Controllable Video SynthesisarXiv-WebsiteApr., 2023
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and TrajectoryarXiv--Aug., 2023

Sound-guided Video Generation

TitlearXivGithubWebSitePub. & Date
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image AnimationarXivStarWebsiteJun., 2024
Context-aware Talking Face Video GenerationarXiv--Feb., 2024
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsarXivStarWebsiteFeb., 2024
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable DiffusionarXiv--ICCV, 2023
Generative Disco: Text-to-Video Generation for Music VisualizationarXiv--Apr., 2023
AADiff: Audio-Aligned Video Synthesis with Text-to-Image DiffusionarXiv--CVPRW, 2023

Image-guided Video Generation

TitlearXivGithubWebSitePub. & Date
Identity-Preserving Text-to-Video Generation by Frequency DecompositionarXivStarWebsiteNov., 2024
PhysGen: Rigid-Body Physics-Grounded Image-to-Video GenerationarXivStarWebsiteECCV 2024
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion ModelsarXivStarWebsiteCVPR 2024
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion ModelarXivStarWebsiteNeurIPS 2024
Tuning-Free Noise Rectification for High Fidelity Image-to-Video GenerationarXiv-WebsiteMar., 2024
AtomoVideo: High Fidelity Image-to-Video GenerationarXiv-WebsiteMar., 2024
Animated Stickers: Bringing Stickers to Life with Video DiffusionarXiv--Feb., 2024
CONSISTI2V: Enhancing Visual Consistency for Image-to-Video GenerationarXiv-WebsiteFeb., 2024
I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion ModelsarXiv--Dec., 2023
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image ModelsarXiv-WebsiteDec., 2023
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text GuidancearXiv-WebsiteNov., 2023
LivePhoto: Real Image Animation with Text-guided Motion ControlarXivStarWebsiteNov., 2023
VideoBooth: Diffusion-based Video Generation with Image PromptsarXivStarWebsiteNov., 2023
Decouple Content and Motion for Conditional Image-to-Video GenerationarXiv--Nov, 2023
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion ModelsarXiv--Nov, 2023
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single ImagearXiv--MM, 2023
Generative Image DynamicsarXiv-WebsiteSep., 2023
LaMD: Latent Motion Diffusion for Video GenerationarXiv--Apr., 2023
Conditional Image-to-Video Generation with Latent Flow Diffusion ModelsarXivStar-CVPR 2023
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual SynthesisarXivStarWebsiteCVPR 2022

Brain-guided Video Generation

TitlearXivGithubWebSitePub. & Date
NeuroCine: Decoding Vivid Video Sequences from Human Brain ActivtiesarXiv--Feb., 2024
Cinematic Mindscapes: High-quality Video Reconstruction from Brain ActivityarXivStarWebsiteNeurIPS, 2023

Depth-guided Video Generation

TitlearXivGithubWebSitePub. & Date
StableV2V: Stablizing Shape Consistency in Video-to-Video EditingarXivStarWebsiteNov., 2024
Animate-A-Story: Storytelling with Retrieval-Augmented Video GenerationarXivStarWebsiteJul., 2023
Make-Your-Video: Customized Video Generation Using Textual and Structural GuidancearXivStarWebsiteJun., 2023

Multi-modal guided Video Generation

TitlearXivGithubWebSitePub. & Date
UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention ControlarXiv--Mar., 2024
Magic-Me: Identity-Specific Video Customized DiffusionarXiv-WebsiteFeb., 2024
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal InstructionsarXiv-WebsiteFeb., 2024
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object MotionarXiv-WebsiteFeb., 2024
Boximator: Generating Rich and Controllable Motions for Video SynthesisarXiv-WebsiteFeb., 2024
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency LearningarXiv--Jan., 2024
ActAnywhere: Subject-Aware Video Background GenerationarXiv-WebsiteJan., 2024
CustomVideo: Customizing Text-to-Video Generation with Multiple SubjectsarXiv--Jan., 2024
MoonShot: Towards Controllable Video Generation and Editing with Multimodal ConditionsarXivStarWebsiteJan., 2024
PEEKABOO: Interactive Video Generation via Masked-DiffusionarXiv-WebsiteDec., 2023
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional ModelingarXiv--Dec., 2023
Fine-grained Controllable Video Generation via Object Appearance and ContextarXiv-WebsiteNov., 2023
GPT4Video: A Unified Multimodal Large Language Model for Instruction-Followed Understanding and Safety-Aware GenerationarXiv-WebsiteNov., 2023
Panacea: Panoramic and Controllable Video Generation for Autonomous DrivingarXiv-WebsiteNov., 2023
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion ModelsarXiv-WebsiteNov., 2023
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivStarWebsiteJun., 2023
NExT-GPT: Any-to-Any Multimodal LLMarXiv--Sep, 2023
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and ImagesarXiv-WebsiteJun, 2023
Any-to-Any Generation via Composable DiffusionarXivStarWebsiteMay, 2023
Mm-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationarXivStar-CVPR 2023

Unconditional Video Generation

U-Net based

TitlearXivGithubWebSitePub. & Date
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet RepresentationarXivStarWebsiteFeb. 2024
Video Probabilistic Diffusion Models in Projected Latent SpacearXivStarWebsiteCVPR 2023
VIDM: Video Implicit Diffusion ModelsarXivStarWebsiteAAAI 2023
GD-VDM: Generated Depth for better Diffusion-based Video GenerationarXivStar-Jun., 2023
LEO: Generative Latent Image Animator for Human Video SynthesisarXivStarWebsiteMay., 2023

Transformer based

TitlearXivGithubWebSitePub. & Date
Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep ApproacharXivStar-Oct., 2024
Latte: Latent Diffusion Transformer for Video GenerationarXivStarWebsiteJan., 2024
VDT: An Empirical Study on Video Diffusion with TransformersarXivStar-May, 2023
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive TransformerarXivStarWebsiteMay, 2023

Video Completion

Video Enhancement and Restoration

TitlearXivGithubWebSitePub. & Date
Towards Language-Driven Video Inpainting via Multimodal Large Language ModelsarXivStarWebsiteJan., 2024
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution---WACW, 2023
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-ResolutionarXivStarWebsiteDec., 2023
AVID: Any-Length Video Inpainting with Diffusion ModelarXivStarWebsiteDec., 2023
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolutionarXivStar-CVPR 2023
LDMVFI: Video Frame Interpolation with Latent Diffusion ModelsarXiv--Mar., 2023
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video StreamingarXiv--Nov., 2022
Look Ma, No Hands! Agent-Environment Factorization of Egocentric VideosarXiv--May., 2023

Video Prediction

TitlearXivGithubWebsitePub. & Date
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video PredictionarXivStarWebsiteJun, 2024
STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video PredictionarXivStar-Dec, 2023
Video Diffusion Models with Local-Global Context GuidancearXivStar-IJCAI, 2023
Seer: Language Instructed Video Prediction with Latent Diffusion ModelsarXiv-WebsiteMar., 2023
MaskViT: Masked Visual Pre-Training for Video PredictionarXivStarWebsiteJun, 2022
Diffusion Models for Video Prediction and InfillingarXivStarWebsiteTMLR 2022
McVd: Masked Conditional Video Diffusion for Prediction, Generation, and InterpolationarXivStarWebsiteNeurIPS 2022
Diffusion Probabilistic Modeling for Video GenerationarXivStar-Mar., 2022
Flexible Diffusion Modeling of Long VideosarXivStarWebsiteMay, 2022
Control-A-Video: Controllable Text-to-Video Generation with Diffusion ModelsarXivStarWebsiteMay, 2023

Video Editing

General Editing Model

TitlearXivGithubWebsitePub. Date
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video EditingarXivStarWebsiteJun, 2024
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video TranslationarXiv--Mar., 2024
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video EditingarXiv--Mar., 2024
DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video EditingarXiv-WebsiteMar, 2024
Video Editing via Factorized Diffusion DistillationarXiv--Mar, 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video SynthesisarXivStarWebsiteDec, 2023
MaskINT: Video Editing via Interpolative Non-autoregressive Masked TransformersarXiv-WebsiteDec, 2023
Neutral Editing Framework for Diffusion-based Video EditingarXiv-WebsiteDec, 2023
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point CorrespondencearXiv-WebsiteNov, 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion ModelsarXivStarWebsiteNov, 2023
Motion-Conditioned Image Animation for Video EditingarXiv-WebsiteNov, 2023
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance PropagationarXiv--Sep, 2023
MagicEdit: High-Fidelity and Temporally Coherent Video EditingarXiv--Aug, 2023
Edit Temporal-Consistent Videos with Image Diffusion ModelarXiv--Aug, 2023
Structure and Content-Guided Video Synthesis With Diffusion ModelsarXiv-WebsiteICCV, 2023
Dreamix: Video Diffusion Models Are General Video EditorsarXiv-WebsiteFeb, 2023

Training-free Editing Model

TitlearXivGithubWebsitePub. Date
MVOC: a training-free multiple video object composition method with diffusion modelsarXivStarWebsiteJun, 2024
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video EditingarXivStarWebsiteJun, 2024
EVA: Zero-shot Accurate Attributes and Multi-Object Video EditingarXivStarWebsiteMarch, 2024
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance EditingarXiv-WebsiteFeb, 2024
Object-Centric Diffusion for Efficient Video EditingarXiv--Jan, 2024
RealCraft: Attention Control as A Solution for Zero-shot Long Video EditingarXiv--Dec, 2023
VidToMe: Video Token Merging for Zero-Shot Video EditingarXivStarWebsiteDec, 2023
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video EditingarXivStarWebsiteDec, 2023
AnimateZero: Video Diffusion Models are Zero-Shot Image AnimatorsarXivStar-Dec, 2023
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion ModelsarXivStarWebsiteDec, 2023
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion ModelsarXiv-WebsiteNov., 2023
Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame DiffusionarXiv--Nov., 2023
FastBlend: a Powerful Model-Free Toolkit Making Video Stylization EasierarXivStar-Oct., 2023
LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video TranslationarXiv--Nov., 2023
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion ModelsarXiv--Oct., 2023
LOVECon: Text-driven Training-Free Long Video Editing with ControlNetarXivStar-Oct., 2023
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editingarXiv-WebsiteOct., 2023
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion ModelsarXivStarWebsiteICLR, 2024
MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence GuidancearXiv--Aug., 2023
EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency ConstraintsarXiv--Aug., 2023
ControlVideo: Training-free Controllable Text-to-Video GenerationarXivStar-May, 2023
TokenFlow: Consistent Diffusion Features for Consistent Video EditingarXivStarWebsiteJul., 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video EditingarXiv-WebsiteJun., 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video TranslationarXiv-WebsiteJun., 2023
Zero-Shot Video Editing Using Off-the-Shelf Image Diffusion ModelsarXivStarWebsiteMar., 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video EditingarXivStarWebsiteMar., 2023
Pix2video: Video Editing Using Image DiffusionarXiv-WebsiteMar., 2023
InFusion: Inject and Attention Fusion for Multi Concept Zero Shot Text based Video EditingarXiv-WebsiteAug., 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-DenoisingarXivStarWebsiteMay, 2023

One-shot Editing Model

TitlearXivGithubWebsitePub. & Date
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion ModelsarXiv-WebsiteFeb., 2024
MotionCrafter: One-Shot Motion Customization of Diffusion ModelsarXivStar-Dec., 2023
DiffusionAtlas: High-Fidelity Consistent Diffusion Video EditingarXiv-WebsiteDec., 2023
MotionEditor: Editing Video Motion via Content-Aware DiffusionarXivStarWebsiteCVPR, 2024
Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video TuningarXiv-WebsiteNov., 2023
Cut-and-Paste: Subject-Driven Video Editing with Attention ControlarXiv--Nov, 2023
StableVideo: Text-driven Consistency-aware Diffusion Video EditingarXivStarWebsiteICCV, 2023
Shape-aware Text-driven Layered Video EditingarXiv--CVPR, 2023
SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video EditingarXivStar-May, 2023
Towards Consistent Video Editing with Text-to-Image Diffusion ModelsarXiv--Mar., 2023
Edit-A-Video: Single Video Editing with Object-Aware ConsistencyarXiv-WebsiteMar., 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationarXivStarWebsiteICCV, 2023
ControlVideo: Adding Conditional Control for One Shot Text-to-Video EditingarXivStarWebsiteMay, 2023
Video-P2P: Video Editing with Cross-attention ControlarXivStarWebsiteMar., 2023
SinFusion: Training Diffusion Models on a Single Image or VideoarXivStarWebsiteNov., 2022

Instruct-guided Video Editing

TitlearXivGithubWebsitePub. Date
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video EditingarXivStarWebsiteJun, 2024
EffiVED:Efficient Video Editing via Text-instruction Diffusion ModelsarXiv--Mar, 2024
Fairy: Fast Parallellized Instruction-Guided Video-to-Video SynthesisarXiv-WebsiteDec, 2023
Neural Video Fields EditingarXivStarWebsiteDec, 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion ModelsarXivStarWebsiteNov, 2023
Consistent Video-to-Video Transfer Using Synthetic DatasetarXiv--Nov., 2023
InstructVid2Vid: Controllable Video Editing with Natural Language InstructionsarXiv--May, 2023
Collaborative Score Distillation for Consistent Visual SynthesisarXiv--July, 2023

Motion-guided Video Editing

TitlearXivGithubWebsitePub. Date
MotionCtrl: A Unified and Flexible Motion Controller for Video GenerationarXivStarWebsiteNov, 2023
Drag-A-Video: Non-rigid Video Editing with Point-based InteractionarXiv-WebsiteNov, 2023
DragVideo: Interactive Drag-style Video EditingarXivStar-Nov, 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNetarXiv-WebsiteJuly, 2023

Sound-guided Video Editing

TitlearXivGithubWebsitePub. Date
Speech Driven Video Editing via an Audio-Conditioned Diffusion ModelarXiv--May., 2023
Soundini: Sound-Guided Diffusion for Natural Video EditingarXivStarWebsiteApr., 2023

Multi-modal Control Editing Model

TitlearXivGithubWebsitePub. Date
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks-StarWebsiteDec, 2023
Motionshop: An application of replacing the characters in video with 3D avatars-StarWebsiteDec, 2023
Anything in Any Scene: Photorealistic Video Object InsertionarXivStarWebsiteJan, 2024
DreamVideo: Composing Your Dream Videos with Customized Subject and MotionarXivStarWebsiteDec, 2023
MagicStick: Controllable Video Editing via Control Handle TransformationsarXivStarWebsiteNov, 2023
SAVE: Protagonist Diversification with Structure Agnostic Video EditingarXiv-WebsiteNov, 2023
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video GenerationarXiv--May, 2023
CCEdit: Creative and Controllable Video Editing via Diffusion ModelsarXiv--Sep, 2023
Make-A-Protagonist: Generic Video Editing with An Ensemble of ExpertsarXivStarWebsiteMay, 2023

Domain-specific Editing Model

TitlearXivGithubWebsitePub. Date
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific AdaptationarXiv-WebsiteJan. 2024
Diffutoon: High-Resolution Editable Toon Shading via Diffusion ModelsarXiv-WebsiteJan. 2024
TRAINING-FREE SEMANTIC VIDEO COMPOSITION VIA PRE-TRAINED DIFFUSION MODELarXiv--Jan, 2024
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion ModelsarXiv-WebsiteCVPR 2023
Multimodal-driven Talking Face Generation via a Unified Diffusion-based GeneratorarXiv--May, 2023
DiffSynth: Latent In-Iteration Deflickering for Realistic Video SynthesisarXiv--Aug, 2023
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style TransferarXivStar-May, 2023
Instruct-Video2Avatar: Video-to-Avatar Generation with InstructionsarXivStar-Jun, 2023
Video Colorization with Pre-trained Text-to-Image Diffusion ModelsarXivStarWebsiteJun, 2023
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video EncodingarXivStarWebsiteCVPR 2023

Non-diffusion Editing model

TitlearXivGithubWebsitePub. Date
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video EditingarXiv-WebsiteOct., 2023
INVE: Interactive Neural Video EditingarXiv-WebsiteJul., 2023
Shape-Aware Text-Driven Layered Video EditingarXiv-WebsiteJan., 2023

Video Understanding

TitlearXivGithubWebsitePub. Date
EchoReel: Enhancing Action Generation of Existing Video Diffusion ModelslarXiv--Mar., 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative ModelarXiv--Mar., 2024
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video DiffusionarXiv--Mar., 2024
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion ModelsarXiv--Mar., 2024
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object SegmentationarXiv--Mar., 2024
DiffSal: Joint Audio and Video Learning for Diffusion Saliency PredictionarXiv--Mar., 2024
Generative Video Diffusion for Unseen Cross-Domain Video Moment RetrievalarXiv--Jan., 2024
Diffusion Reward: Learning Rewards via Conditional Video DiffusionarXivStarWebsiteDec., 2023
ViVid-1-to-3: Novel View Synthesis with Video Diffusion ModelsarXiv-WebsiteNov., 2023
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion ModelsarXivStar-Nov., 2023
Flow-Guided Diffusion for Video InpaintingarXivStar-Nov., 2023
Breathing Life Into Sketches Using Text-to-Video PriorsarXiv--Nov., 2023
Infusion: Internal Diffusion for Video InpaintingarXiv--Nov., 2023
DiffusionVMR: Diffusion Model for Video Moment RetrievalarXiv--Aug., 2023
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose EstimationarXiv--Aug., 2023
CoTracker: It is Better to Track TogetherarXivStarWebsiteAug., 2023
Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion RepresentationsarXiv--ICIAP, 2023
Exploring Diffusion Models for Unsupervised Video Anomaly DetectionarXiv--Apr., 2023
Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly DetectionarXiv--ICCV, 2023
Diffusion Action SegmentationarXiv--Mar., 2023
DiffTAD: Temporal Action Detection with Proposal Denoising DiffusionarXivStarWebsiteMar., 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion ModelarXivStar-ICCV, 2023
MomentDiff: Generative Video Moment Retrieval from Random to RealarXivStarWebsiteJul., 2023
Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene DecompositionarXivStarWebsiteFeb., 2023
Refined Semantic Enhancement Towards Frequency Diffusion for Video CaptioningarXiv--Nov., 2022
A Generalist Framework for Panoptic Segmentation of Images and VideosarXivStarWebsiteOct., 2022
DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion ModelsarXiv--Jul., 2023
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video StreamingarXiv--Mar., 2023
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action RecognitionarXiv--Jul., 2023
PDPP: Projected Diffusion for Procedure Planning in Instructional VideosarXivStar-CVPR 2023