

Vespa🐝: Video Diffusion State Space Models

This repo contains PyTorch model definitions, pre-trained weights and training/sampling code for our paper video diffusion state space models. Our model use clip/t5 as text encoder and mamba-based diffusion model. Its distinctive advantage lies in ites reduced spatial complexity, which renders it exceptionally adept at processing long videos or high-resolution images, eliminating the necessity for window operations.

The following cases are generated by model with prompt "sad".


1. Environments

2. Training

We provide a training script for VeSpa in train.py. This script can be used to train video diffusion state space models.

To launch DiS-M/2 (64x64) in the raw space training with N GPUs on one node:

torchrun --nnodes=1 --nproc_per_node=N train.py \
--model VeSpa-M/2 \
--model-type video \
--dataset-type ucf \
--data-path  /path/to/datat \
--anna-path /path/to/annate \
--image-size 64 \
--lr 1e-4

3. Evaluation

We include a sample.py script which samples images from a DiS model. Besides, we support other metrics evaluation, e.g., FLOPS and model parameters, in test.py script.

python sample.py \
--model VeSpa-M/2 \
--ckpt /path/to/model \
--image-size 64 \
--prompt sad 

4. BibTeX

  title={Video Diffusion State Space Models},
  author={Zhengcong Fei, Mingyuan Fan, Yujun Liu, Changqian Yu, Jusnshi Huang},
  journal={arXiv preprint},

5. Acknowledgments

The codebase is based on the awesome DiS, DiT, mamba, U-ViT, and Vim repos.