Awesome

VideoElevator

Official pytorch implementation of "VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models"

https://github.com/YBYBZhang/VideoElevator/assets/40799060/f850bc9c-ccf6-48b3-8011-394986aade71

VideoElevator aims to elevate the quality of generated videos with text-to-image diffusion models. It is training-free and plug-and-play to support cooperation of various text-to-video and text-to-image diffusion models.

News

[04/07/2024] We release the code of VideoElevator, including three example scripts.

Method

<p align="center"> <img src="assets/introduction.png" width="1080px"/> </p> **Top:** Taking text τ as input, conventional T2V performs both temporal and spatial modeling and accumulates low-quality contents throughout sampling chain.

Bottom: VideoElevator explicitly decompose each step into temporal motion refining and spatial quality elevating, where the former encapsulates T2V to enhance temporal consistency and the latter harnesses T2I to provide more faithful details, e.g., dressed in suit. Empirically, applying T2V in several timesteps is enough to ensure temporal consistency.

Setup

1. Download Weights

All pre-trained weights are downloaded to checkpoints/ directory, including the pre-trained weights of text-to-video and text-to-image diffusion models. Users can download the corresponding weights according to their needs.

Text-to-video diffusion models: LaVie, ZeroScope, AnimateLCM.
Text-to-image diffusion models: StableDiffusion v1.5, StableDiffusion v2.1-base.
[Optional] LoRA from Civitai: RCNZ Cartoon, RealisticVision, Lyriel, ToonYou.

2. Requirements

conda create -n videoelevator python=3.10
conda activate videoelevator
pip install -r requirements.txt

Inference

We provide three example scripts of VideoElevator in example_scripts/ directory, and recommend to run example_scripts/sd_animatelcm.py. To perform improved text-to-video generation, directly run command python example_scripts/sd_animatelcm.py.

Notably, all scripts can run with less than 11 GB VRAM (e.g., 2080Ti GPU).

[Optional] Hyper-parameters

You can define the following hyper-parameters, and check their effects in Ablation studies of project page:

stable_steps: the choice of timestep in temporal motion refining.
stable_num: the number of steps used in T2V denoising.

Citation

@article{zhang2024videoelevator,
  title={VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models},
  author={Zhang, Yabo and Wei, Yuxiang and Lin, Xianhui and Hui, Zheng and Ren, Peiran and Xie, Xuansong and Ji, Xiangyang and Zuo, Wangmeng},
  journal={arXiv preprint arXiv:2403.05438},
  year={2024}
}

Acknowledgement

This repository borrows code from Diffusers, LaVie, AnimateLCM, and FreeInit. Thanks for their contributions!