Home

Awesome

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

arXiv Project Website Hits

Pytorch Implementation of "FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing".

๐ŸŽŠ๐ŸŽŠ๐ŸŽŠ We are proud to announce that our paper has been accepted at ICLR 2024! If you are interested in FLATTEN, please give us a star๐Ÿ˜ฌ teaser-ezgif com-resize

Thanks to @logtd for integrating FLATTEN into ComfyUI and the great sampled videos! Here is the Link!

https://github.com/yrcong/flatten/assets/47991543/1ad49092-9133-42d0-984f-38c6427bde34

๐Ÿ“–Abstract

๐ŸšฉText-to-Video ๐ŸšฉTraining-free ๐ŸšฉPlug-and-Play<br>

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. In this work, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency.

Requirements

First you can download Stable Diffusion 2.1 (base) here.

Install the following packages:

Usage

For text-to-video edting, a source video and a textual prompt should be given. You can run the script to get the teaser video easily:

sh cat.sh

or with the command:

python inference.py \
--prompt "A Tiger, high quality" \
--neg_prompt "a cat with big eyes, deformed" \
--guidance_scale 20 \
--video_path "data/puff.mp4" \
--output_path "outputs/" \
--video_length 32 \
--width 512 \
--height 512 \
--old_qk 0 \
--frame_rate 2 \

Editing tricks

<table class="center"> <tr> <td width=30% align="center"><img src="data/source.gif" raw=true></td> <td width=30% align="center"><img src="data/tiger_empty.gif" raw=true></td> <td width=30% align="center"><img src="data/tiger_neg.gif" raw=true></td> </tr> <tr> <td width=30% align="center">Source video</td> <td width=30% align="center">NP: " "</td> <td width=30% align="center">NP: "A cat with big eyes, deformed."</td> </tr> <tr> <td width=30% align="center"><img src="data/guidance10.gif" raw=true></td> <td width=30% align="center"><img src="data/guidance17.5.gif" raw=true></td> <td width=30% align="center"><img src="data/guidance20.gif" raw=true></td> </tr> <tr> <td width=30% align="center">Classifier-free guidance: 10</td> <td width=30% align="center">Classifier-free guidance: 17.5</td> <td width=30% align="center">Classifier-free guidance: 25</td> </tr> </table>

BibTex

@article{cong2023flatten,
  title={FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing},
  author={Cong, Yuren and Xu, Mengmeng and Simon, Christian and Chen, Shoufa and Ren, Jiawei and Xie, Yanping and Perez-Rua, Juan-Manuel and Rosenhahn, Bodo and Xiang, Tao and He, Sen},
  journal={arXiv preprint arXiv:2310.05922},
  year={2023}
}