Home

Awesome

<div align="center"> <h1>vid2vid-zero for Zero-Shot Video Editing</h1> <h3><a href="https://arxiv.org/abs/2303.17599">Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models</a></h3>

Wen Wang<sup>1*</sup>,   Kangyang Xie<sup>1*</sup>,   Zide Liu<sup>1*</sup>,   Hao Chen<sup>1</sup>,   Yue Cao<sup>2</sup>,   Xinlong Wang<sup>2</sup>,   Chunhua Shen<sup>1</sup>

<sup>1</sup>ZJU,   <sup>2</sup>BAAI

<br>

Hugging Face Demo

<image src="docs/vid2vid-zero.png" /> <br> </div>

We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.

Highlights

News

Installation

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for improved efficiency and speed on GPUs.

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from 🤗 Hugging Face (e.g., Stable Diffusion v1-4, v2-1). We use Stable Diffusion v1-4 by default.

Zero-shot testing

Simply run:

accelerate launch test_vid2vid_zero.py --config path/to/config

For example:

accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml

Gradio Demo

Launch the local demo built with gradio:

python app.py

Or you can use our online gradio demo here.

Note that we disable Null-text Inversion and enable fp16 for faster demo response.

Examples

<table class="center"> <tr> <td style="text-align:center;"><b>Input Video</b></td> <td style="text-align:center;"><b>Output Video</b></td> <td style="text-align:center;"><b>Input Video</b></td> <td style="text-align:center;"><b>Output Video</b></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"A car is moving on the road"</td> <td width=25% style="text-align:center;">"A Porsche car is moving on the desert"</td> <td width=25% style="text-align:center;color:gray;">"A car is moving on the road"</td> <td width=25% style="text-align:center;">"A jeep car is moving on the snow"</td> </tr> <tr> <td style colspan="2"><img src="examples/jeep-moving_Porsche.gif"></td> <td style colspan="2"><img src="examples/jeep-moving_snow.gif"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"A man is running"</td> <td width=25% style="text-align:center;">"Stephen Curry is running in Time Square"</td> <td width=25% style="text-align:center;color:gray;">"A man is running"</td> <td width=25% style="text-align:center;">"A man is running in New York City"</td> </tr> <tr> <td style colspan="2"><img src="examples/man-running_stephen.gif"></td> <td style colspan="2"><img src="examples/man-running_newyork.gif"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"A child is riding a bike on the road"</td> <td width=25% style="text-align:center;">"a child is riding a bike on the flooded road"</td> <td width=25% style="text-align:center;color:gray;">"A child is riding a bike on the road"</td> <td width=25% style="text-align:center;">"a lego child is riding a bike on the road.gif"</td> </tr> <tr> <td style colspan="2"><img src="examples/child-riding_flooded.gif"></td> <td style colspan="2"><img src="examples/child-riding_lego.gif"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"A car is moving on the road"</td> <td width=25% style="text-align:center;">"A car is moving on the snow"</td> <td width=25% style="text-align:center;color:gray;">"A car is moving on the road"</td> <td width=25% style="text-align:center;">"A jeep car is moving on the desert"</td> </tr> <tr> <td style colspan="2"><img src="examples/red-moving_snow.gif"></td> <td style colspan="2"><img src="examples/red-moving_desert.gif"></td> </tr> </table>

Citation

@article{vid2vid-zero,
  title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},
  author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},
  journal={arXiv preprint arXiv:2303.17599},
  year={2023}
}

Acknowledgement

Tune-A-Video, diffusers, prompt-to-prompt.

Contact

We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, visual perception and multimodal learning, please contact Xinlong Wang (wangxinlong@baai.ac.cn) and Yue Cao (caoyue@baai.ac.cn).