Awesome
š Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
<a href='https://yhzhai.github.io/mcm/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2406.06890'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a> <a href='https://huggingface.co/yhzhai/mcm'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HF-checkpoint-yellow'></a> <a href='https://huggingface.co/spaces/yhzhai/mcm'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HF-demo-yellow'></a>
Yuanhao Zhai<sup>1</sup>, Kevin Lin<sup>2</sup>, Zhengyuan Yang<sup>2</sup>, Linjie Li<sup>2</sup>, Jianfeng Wang<sup>2</sup>, Chung-Ching Lin<sup>2</sup>, David Doermann<sup>1</sup>, Junsong Yuan<sup>1</sup>, Lijuan Wang<sup>2</sup>
<sup>1</sup>State University of New York at Buffalo Ā | Ā <sup>2</sup>Microsoft
NeurIPS 2024
TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.
<!-- **All training, inference, and evaluation code, as well as model checkpoints will be released in the coming two weeks. Please stay tuned!** -->š„ News
[09/2024] MCM was accepted to NeurIPS 2024!
[07/2024] Release learnable head parameter at this box link.
[06/2024] Our MCM achieves strong performance (using 4 sampling steps) on the ChronoMagic-Bench! Check out the leaderboard here.
[06/2024] Training code, pre-trained checkpoint, Gradio demo, and Colab demo release.
[06/2024] Paper and project page release.
Contents
Getting started <a name="getting-started"></a>
Environment setup <a name="env-setup"></a>
Instead of installing diffusers, peft, and open_clip from the official repos, we use our modified versions specified in the requirements.txt file. This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.
To set up the environment, run the following commands:
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 # please modify the cuda version according to your env
pip install -r requirements.txt
pip install scipy==1.11.1
pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl
Data preparation <a name="data"></a>
Please preparation the video and optional image datasets in the webdataset format.
Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files. Here is an example structure of the video .tar file:
.
āāā video_0.json
āāā video_0.mp4
...
āāā video_n.json
āāā video_n.mp4
The .json files contain video/image captions in key-value pairs, for example: {"caption": "World map in gray - world map with animated circles and binary numbers"}
.
We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom). Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.
DINOv2 and CLIP checkpoint download <a name="download"></a>
We provide a script scripts/download.py
to download the DINOv2 and CLIP checkpoint.
python scripts/download.py
Wandb integration <a name="wandb"></a>
Please input your wandb API key in utils/wandb.py
to enable wandb logging.
If you do not use wandb, please remove wandb
from the --report_to
argument in the training command.
Training <a name="train"></a>
We leverage accelerate for distributed training, and we support two different based text2video diffusion models: ModelScopeT2V and AnimateDiff. For both models, we train LoRA instead fine-tuning all parameters.
ModelScopeT2V
For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.
By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the --train_batch_size
argument accordingly for different GPU memory sizes.
Before running the scripts, please modify the data path in the environment variables defined at the top of each script.
Diffusion distillation
We provide the training script in scripts/modelscopet2v_distillation.sh
bash scripts/modelscopet2v_distillation.sh
Frame quality improvement
We provide the training script in scripts/modelscopet2v_improvement.sh
. Before running, please assign the IMAGE_DATA_PATH
in the script.
bash scripts/modelscopet2v_improvement.sh
AnimateDiff
Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.
We provide the diffusion distillation training script in scripts/animatediff_distillation.sh
.
bash scripts/animatediff_distillation.sh
Inference <a name="infer"></a>
We provide our pre-trained checkpoint here, Gradio demo here, and Colab demo here. demo.py
showcases how to run our MCM in local machine.
Feel free to try out our MCM!
MCM weights <a name="weight"></a>
We provide our pre-trained checkpoint here.
For research/debug purpose, we also provide intermediate parameters and states at this box link. The folder (~1.12GB) include model weight, discriminator weight, scheduler states, optimizer states and learnable head weight.
Acknowledgement <a name="ack"></a>
Some of our implementations are borrowed from the great repos below.
Citation <a name="cite"></a>
@article{zhai2024motion,
title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
Motion-Appearance Distillation},
author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
year={2024},
journal={arXiv preprint arXiv:2406.06890},
website={https://yhzhai.github.io/mcm/},
}