Awesome

🚀 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Yuanhao Zhai1, Kevin Lin2, Zhengyuan Yang2, Linjie Li2, Jianfeng Wang2, Chung-Ching Lin2, David Doermann1, Junsong Yuan1, Lijuan Wang2

1State University of New York at Buffalo | 2Microsoft

NeurIPS 2024

TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

Our motion consistency model not only distill the motion prior from the teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

🔥 News

[09/2024] MCM was accepted to NeurIPS 2024!

[07/2024] Release learnable head parameter at this box link.

[06/2024] Our MCM achieves strong performance (using 4 sampling steps) on the ChronoMagic-Bench! Check out the leaderboard here.

[06/2024] Training code, pre-trained checkpoint, Gradio demo, and Colab demo release.

[06/2024] Paper and project page release.

Getting started
Training
Inference
MCM weights
Acknowledgement
Citation

Getting started <a name="getting-started"></a>

Environment setup <a name="env-setup"></a>

Instead of installing diffusers, peft, and open_clip from the official repos, we use our modified versions specified in the requirements.txt file. This is particularly important for diffusers and open_clip, due to the former's current limited support for video diffusion model LoRA loading, and the latter's distributed training dependency.

To set up the environment, run the following commands:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118  # please modify the cuda version according to your env 
pip install -r requirements.txt
pip install scipy==1.11.1
pip install https://github.com/podgorskiy/dnnlib/releases/download/0.0.1/dnnlib-0.0.1-py3-none-any.whl

Data preparation <a name="data"></a>

Please preparation the video and optional image datasets in the webdataset format.

Specifically, please wrap the video/image files and their corresponding .json format metadata into .tar files. Here is an example structure of the video .tar file:

.
├── video_0.json
├── video_0.mp4
...
├── video_n.json
└── video_n.mp4

The .json files contain video/image captions in key-value pairs, for example: {"caption": "World map in gray - world map with animated circles and binary numbers"}.

We provide our generated anime, realistic, and 3D cartoon style image datasets here (coming soom). Due to dataset agreement, we could not publicly release the WebVid and LAION-aes dataset.

DINOv2 and CLIP checkpoint download <a name="download"></a>

We provide a script scripts/download.py to download the DINOv2 and CLIP checkpoint.

python scripts/download.py

Wandb integration <a name="wandb"></a>

Please input your wandb API key in utils/wandb.py to enable wandb logging. If you do not use wandb, please remove wandb from the --report_to argument in the training command.

Training <a name="train"></a>

We leverage accelerate for distributed training, and we support two different based text2video diffusion models: ModelScopeT2V and AnimateDiff. For both models, we train LoRA instead fine-tuning all parameters.

ModelScopeT2V

For ModelScopeT2V, our code supports pure video diffusion distillation training, and frame quality improvement training.

By default, the training script requires 8 GPUs, each with 80GB of GPU memory, to fit a batch size of 4. The minimal GPU memory requirement is 32GB for a batch size of 1. Please adjust the --train_batch_size argument accordingly for different GPU memory sizes.

Before running the scripts, please modify the data path in the environment variables defined at the top of each script.

Diffusion distillation

We provide the training script in scripts/modelscopet2v_distillation.sh

bash scripts/modelscopet2v_distillation.sh

Frame quality improvement

We provide the training script in scripts/modelscopet2v_improvement.sh. Before running, please assign the IMAGE_DATA_PATH in the script.

bash scripts/modelscopet2v_improvement.sh

AnimateDiff

Due to the higher resolution requirement, MCM with AnimateDiff base model training requires at least 70GB of GPU memory to fit a single batch.

We provide the diffusion distillation training script in scripts/animatediff_distillation.sh.

bash scripts/animatediff_distillation.sh

Inference <a name="infer"></a>

We provide our pre-trained checkpoint here, Gradio demo here, and Colab demo here. demo.py showcases how to run our MCM in local machine. Feel free to try out our MCM!

MCM weights <a name="weight"></a>

We provide our pre-trained checkpoint here.

For research/debug purpose, we also provide intermediate parameters and states at this box link. The folder (~1.12GB) include model weight, discriminator weight, scheduler states, optimizer states and learnable head weight.

Acknowledgement <a name="ack"></a>

Some of our implementations are borrowed from the great repos below.

Citation <a name="cite"></a>

@article{zhai2024motion,
  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
  Motion-Appearance Distillation},
  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
  year={2024},
  journal={arXiv preprint arXiv:2406.06890},
  website={https://yhzhai.github.io/mcm/},
}