Home

Awesome

<h1 align="center">Diffusion Model-Based Video Editing: A Survey</h1> <p align="center"> <a href="https://github.com/wenhao728/awesome-diffusion-v2v"><img src="https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg" ></a> <a href="https://arxiv.org/abs/2407.07111"><img src="https://img.shields.io/badge/arXiv-2407.07111-B31B1B.svg"></a> <a href="https://opensource.org/license/mit/"><img src="https://img.shields.io/badge/license-MIT-blue"></a> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/wenhao728/awesome-diffusion-v2v?style=social"></a> <!-- <img alt="GitHub watchers" src="https://img.shields.io/github/watchers/wenhao728/awesome-diffusion-v2v?style=social"> --> <!-- <img alt="GitHub stars" src="https://img.shields.io/github/stars/wenhao728/awesome-diffusion-v2v?style=social"></a> --> </p> <p align="center"> <a href="https://github.com/wenhao728">Wenhao Sun</a>, <a href=https://github.com/rongchengtu1>Rong-Cheng Tu</a>, <a>Jingyi Liao</a>, <a>Dacheng Tao</a> <br> <em>Nanyang Technological University</em> </p> <!-- <p align="center"> <img src="asset/teaser.gif" width="1024px"/> </p> -->

https://github.com/wenhao728/awesome-diffusion-v2v/assets/65353366/fd42e40f-265d-4d72-8dc1-bf74d00fe87b

📌 Table of Contents

Introduction

<p align="center"> <img src="asset/taxonomy-repo.png" width="85%"> <br><em>Overview of diffusion-based video editing model components.</em> </p>

The diffusion process defines a Markov chain that progressively adds random noise to data and learns to reverse this process to generate desired data samples from noise. Deep neural networks facilitate the transitions between latent states.

Network and Training Paradigm

Temporal Adaption

MethodPaperProjectPublicationYear
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationarXivWebsite, GitHubICCVDec 2022
Towards Consistent Video Editing with Text-to-Image Diffusion ModelsarXivNeurIPSMay 2023
SimDA: Simple Diffusion Adapter for Efficient Video GenerationarXivWebsite, GitHubPreprintAug 2023
VidToMe: Video Token Merging for Zero-Shot Video EditingarXivWebsite, GitHubPreprintDec 2023
Fairy: Fast Parallelized Instruction-Guided Video-to-Video SynthesisarXivWebsitePreprintDec 2023
MaskINT: Video Editing via Interpolative Non-autoregressive Masked TransformersarXivWebsiteCVPRDec 2023
Video Editing via Factorized Diffusion DistillationarXivWebsiteECCVMar 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Structure Conditioning

MethodPaperProjectPublicationYear
Structure and Content-Guided Video Synthesis with Diffusion ModelsarXivWebsitePreprintFeb 2023
VideoComposer: Compositional Video Synthesis with Motion ControllabilityarXivWebsite, GitHubNeurIPSJun 2023
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNetarXivGitHubPreprintJul 2023
MagicEdit: High-Fidelity and Temporally Coherent Video EditingarXivWebsite, GitHubPreprintAug 2023
CCEdit: Creative and Controllable Video Editing via Diffusion ModelsarXivWebsite, GitHubPreprintSep 2023
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion ModelsarXivWebsite, GitHubICLROct 2023
LAMP: Learn A Motion Pattern for Few-Shot-Based Video GenerationarXivWebsite, GitHubPreprintOct 2023
Motion-Conditioned Image Animation for Video EditingarXivWebsite, GitHubPreprintNov 2023
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video SynthesisarXivWebsite, GitHubCVPRDec 2023
EVA: Zero-shot Accurate Attributes and Multi-Object Video EditingarXivWebsite, GitHubPreprintMar 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Training Modification

MethodPaperProjectPublicationYear
Dreamix: Video Diffusion Models are General Video EditorsarXivWebsitePreprintFeb 2023
InstructVid2Vid: Controllable Video Editing with Natural Language InstructionsarXivPreprintMay 2023
MotionDirector: Motion Customization of Text-to-Video Diffusion ModelsarXivWebsite, GitHubPreprintOct 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion ModelsarXivWebsite, GitHubPreprintNov 2023
Consistent Video-to-Video Transfer Using Synthetic DatasetarXivGitHubICLRNov 2023
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion ModelsarXivWebsite, GitHubCVPRDec 2023
SAVE: Protagonist Diversification with Structure Agnostic Video EditingarXivWebsite, GitHubPreprintDec 2023
VASE: Object-Centric Appearance and Shape Manipulation of Real VideosarXivWebsite, GitHubPreprintJan 2024
Still-Moving: Customized Video Generation without Customized Video DataarXivWebsite, Community ImplementationPreprintJul 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Attention Feature Injection

Inversion-Based Feature Injection

MethodPaperProjectPublicationYear
Video-P2P: Video Editing with Cross-attention ControlarXivWebsite, GitHubCVPRMar 2023
Edit-A-Video: Single Video Editing with Object-Aware ConsistencyarXivWebsitePreprintMar 2023
FateZero: Fusing Attentions for Zero-shot Text-based Video EditingarXivWebsite, GitHubICCVMar 2023
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion ModelsarXivGitHubPreprintMar 2023
Make-A-Protagonist: Generic Video Editing with An Ensemble of ExpertsarXivWebsite, GitHubPreprintMay 2023
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance EditingarXivWebsite, GitHubPreprintFeb 2023
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing TasksarXivWebsite, GitHubPreprintMar 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Motion-Based Feature Injection

MethodPaperProjectPublicationYear
TokenFlow: Consistent Diffusion Features for Consistent Video EditingarXivWebsite, GitHubICLRJul 2023
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editingarXivWebsite, GitHubICLROct 2023
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video TranslationarXivWebsite, GitHubCVPRMar 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Diffusion Latents Manipulation

Latent Initialization

MethodPaperProjectPublicationYear
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video GeneratorsarXivWebsite, GitHubICCVMar 2023
Control-A-Video: Controllable Text-to-Video Generation with Diffusion ModelsarXivWebsite, GitHubPreprintMay 2023
Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion ModelsarXivPreprintMay 2023
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video EditingarXivWebsite, GitHubCVPRDec 2023
<p align="right">(<a href="#top">back to top</a>)</p>

Latent Transition

MethodPaperProjectPublicationYear
Pix2Video: Video Editing using Image DiffusionarXivWebsite, GitHubICCVMar 2023
ControlVideo: Training-free Controllable Text-to-Video GenerationarXivWebsite, GitHubICLRMay 2023
Rerender A Video: Zero-Shot Text-Guided Video-to-Video TranslationarXivWebsite, GitHubSIGGRAPHJun 2023
DiffSynth: Latent In-Iteration Deflickering for Realistic Video SynthesisarXivWebsite, GitHubPreprintAug 2023
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion TransferarXivWebsite, GitHubCVPRNov 2023
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion ModelsarXivWebsite, GitHubCVPRDec 2023
MotionClone: Training-Free Motion Cloning for Controllable Video GenerationarXivWebsite, GitHubPreprintJun 2024
GenVideo: One-shot target-image and shape aware video editing using T2I diffusion modelsarXivCVPRApr 2024
<p align="right">(<a href="#top">back to top</a>)</p>

Canonical Representation

MethodPaperProjectPublicationYear
Shape-aware Text-driven Layered Video EditingOpen AccessWebsite, GitHubCVPRJan 2023
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video EditingarXivWebsiteTMLRJun 2023
CoDeF: Content Deformation Fields for Temporally Consistent Video ProcessingarXivWebsite, GitHubCVPRAug 2023
StableVideo: Text-driven Consistency-aware Diffusion Video EditingarXivGitHubICCVAug 2023
DiffusionAtlas: High-Fidelity Consistent Diffusion Video EditingarXivWebsitePreprintDec 2023
Neural Video Fields EditingarXivWebsite, GitHubPreprintDec 2023
<p align="right">(<a href="#top">back to top</a>)</p>

Novel Conditioning

Point-Based Editing

MethodPaperProjectPublicationYear
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point CorrespondencearXivWebsite, GitHubCVPRDec 2023
DragVideo: Interactive Drag-style Video EditingarXivGitHubPreprintDec 2023
Drag-A-Video: Non-rigid Video Editing with Point-based InteractionarXivPreprintDec 2023
MotionCtrl: A Unified and Flexible Motion Controller for Video GenerationarXivGitHub, WebsitePreprintDec 2023
<p align="right">(<a href="#top">back to top</a>)</p>

Pose-Guided Human Action Editing

MethodPaperProjectPublicationYear
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free VideosarXivWebsite, GitHubAAAIApr 2023
DreamPose: Fashion Image-to-Video Synthesis via Stable DiffusionarXivWebsite, GitHubICCVApr 2023
DisCo: Disentangled Control for Realistic Human Dance GenerationarXivWebsite, GitHubCVPRJun 2023
MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware DiffusionarXivWebsite, GitHubICMLNov 2023
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion ModelarXivWebsite, GitHubPreprintNov 2023
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character AnimationarXivWebsite, Official GitHub, Community ImplementationPreprintNov 2023
Zero-shot High-fidelity and Pose-controllable Character AnimationarXivPreprintApr 2024
<p align="right">(<a href="#top">back to top</a>)</p>

📈 V2VBench

Leaderboard

V2VBench is a comprehensive benchmark designed to evaluate video editing methods. It consists of:

For detailed information, please refer to the accompanying paper.

🍻 Citation

If you find this repository helpful, please consider citing our paper:

@article{sun2024v2vsurvey,
    author = {Wenhao Sun and Rong-Cheng Tu and Jingyi Liao and Dacheng Tao},
    title = {Diffusion Model-Based Video Editing: A Survey},
    journal = {CoRR},
    volume = {abs/2407.07111},
    year = {2024}
}