Home

Awesome

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing, CVPR 2024

Maomao Li, Yu Li, Tianyu Yang, Yunfei Liu, Dongxu Yue, Zhihui Lin, Dong Xu

<a href='https://arxiv.org/abs/2312.05856'><img src='https://img.shields.io/badge/ArXiv-2312.05856-red'></a> <a href='https://stem-inv.github.io/page/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> GitHub

teaser Reconstruction comparison between DDIM and STEM inversion. DDIM inversion in existing video editing methods usually exploits 1-frame or 2-frame context to invert each frame. Thus, we design a more radical inflated DDIM inversion that uses all-frame context as reference. Here, we use the typical DDIM reconstruction method to provide a video reconstruction comparison, where both our STEM inversion and the radical inflated DDIM one can explore context from the entire video, while the resource-consuming latter yields inferior performance.

🦴 Abstract

<b>TL; DR:</b> STEM inversion is a efficient video inversion method for text-guided video editing.

<details><summary>Click for the full abstract</summary>

We present a video inversion approach for zero-shot video editing, which aims to model the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods.

</details>

πŸš€ Method Overview

<div align="center"> <img src='images/STEM_DDIM_inv.png'/> </div>

The illustration of the proposed STEM inversion method. We estimate a more compact representation (bases $\mathbf{\mu}$) for the input video via the EM algorithm. The ST-E step and ST-M step are executed alternately for R times until convergence. The Self-attention (SA) in our STEM inversion are denoted as STEM-SA, where the $\rm{Key}$ and $\rm{Value}$ embeddings are derived by projections of the converged $\mathbf{\mu}$.

πŸ“‹ Changelog

πŸ—οΈ Todo

▢️ Quick Start for TokenFlow video editing using STEM inversion

Environment

Prepare the Conda environment using the following commands:

git clone https://github.com/STEM-Inv/stem-inv
cd STEM-Inv
cd TokenFlow-Edit
conda create -n stem-tf python=3.9
conda activate stem-tf
pip install -r requirements.txt

Video Editing

We provide demo source videos in the data folder. The corresponding config for STEM Inversion and Editing is in the configs folder. Below are the instructions for performing video editing on the provided source videos. You can run the following command to perform inversion and editing process at once:

bash run_editing.sh

The inversion results are saved in Stem_Inv_Latents/base_256_iter_5, and the editing results are saved in STEM_TF_results.

If you are only interested the reconstruction results of STEM inversion, please run:

bash run_inversion.sh

Note that our default setting is to use 256 bases to represent the whole input video, where 5 iterations are applied for EM algorithm convergence. You can also consider other configurations by modifying the values of β€œnum_bases” and "n_iters" in line 88 of tokenflow_utils.py.

πŸ“° Editing result

<table> <tr> <td><img src="assets/examples/editing_example.gif"></td> </tr> <tr> <td width=100% style="center;">Source prompt: A man is playing tennis. &nbsp;&nbsp;&nbsp;&nbsp; Target prompt: Spider-Man is playing tennis.</td> </tr> </table>

πŸ“Ž Citation

@inproceedings{li2024video,
  title={A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing},
  author={Li, Maomao and Li, Yu and Yang, Tianyu and Liu, Yunfei and Yue, Dongxu and Lin, Zhihui and Xu, Dong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7528--7537},
  year={2024}
}

πŸ“£ Disclaimer

This is official code of STEM Inversion. All the copyrights of the demo images and audio are from community users. Feel free to contact us if you would like remove them.

πŸ’ž Acknowledgements

The code is built upon the below repositories, we thank all the contributors for open-sourcing.