Awesome

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Model preparation

VidRD LDM model: GoogleDrive
VidRD Fine-tuned VAE: GoogleDrive
StableDiffusion 2.1: HuggingFace

Below is an example structure of these model files.

assets/
├── ModelT2V.pth
├── vae_finetuned/
│   ├── diffusion_pytorch_model.bin
│   └── config.json
└── stable-diffusion-2-1-base/
    ├── scheduler/...
    ├── text_encoder/...
    ├── tokenizer/...
    ├── unet/...
    ├── vae/...
    ├── ...
    └── README.md

Environment setup

Python version needs to be >=3.10.

pip install -r requirements.txt

Model inference

Configurations for model inferences are put in configs/examples.yaml including text prompts for video generation.

python main.py --config-name="example" \
  ++model.ckpt_path="assets/ModelT2V.pth" \
  ++model.temporal_vae_path="assets/vae_finetuned/" \
  ++model.pretrained_model_path="assets/stable-diffusion-2-1-base/"

BibTex

@article{reuse2023,
  title     = {Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation},
  journal   = {arXiv preprint arXiv:2309.03549},
  year      = {2023}
}