Awesome
VideoBooth
<!-- [![arXiv](https://img.shields.io/badge/arXiv-2311.99999-b31b1b.svg)](https://arxiv.org/abs/2311.99999) -->This repository will contain the implementation of the following paper:
VideoBooth: Diffusion-based Video Generation with Image Prompts<br> Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu<br>
From MMLab@NTU affliated with S-Lab, Nanyang Technological University and Shanghai AI Laboratory.
Overview
Our VideoBooth generates videos with the subjects specified in the image prompts.
Installation
- Clone the repository.
git clone https://github.com/Vchitect/VideoBooth.git
cd VideoBooth
- Install the environment.
conda env create -f environment.yml
conda activate videobooth
- Download pretrained models (Stable Diffusion v1.4, VideoBooth), and put them under the folder
./pretrained_models/
.
Inference
Here, we provide one example to perform the inference.
python sample_scripts/sample.py --config sample_scripts/configs/panda.yaml
If you want to use your own image, you need to segment the object first. We use Grounded-SAM to segment the subject from images.
Training
VideoBooth is training in a coarse-to-fine manner.
Stage 1: Coarse Stage Training
srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage1.py \
--model TAVU \
--num-frames 16 \
--dataset WebVideoImageStage1 \
--frame-interval 4 \
--ckpt-every 1000 \
--clip-max-norm 0.1 \
--global-batch-size 16 \
--reg-text-weight 0 \
--results-dir ./results \
--pretrained-t2v-model path-to-t2v-model \
--global-mapper-path path-to-elite-global-model
Stage 2: Fine Stage Training
srun --mpi=pmi2 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29125 train_stage2.py \
--model TAVU \
--num-frames 16 \
--dataset WebVideoImageStage2 \
--frame-interval 4 \
--ckpt-every 1000 \
--clip-max-norm 0.1 \
--global-batch-size 16 \
--reg-text-weight 0 \
--results-dir ./results \
--pretrained-t2v-model path-to-t2v-model \
--global-mapper-path path-to-stage1-model
Dataset Preparation
You can download our proposed dataset in HuggingFace.
# merge the splited zip files
zip -F webvid_parsing_2M_split.zip --out single-archive.zip
# replace the path-to-webvid-parsing to this path
unzip single-archive.zip
# replace the path-to-videobooth-subset to this path
unzip webvid_parsing_videobooth_subset.zip
Citation
If you find our repo useful for your research, please consider citing our paper:
@article{jiang2023videobooth,
author = {Jiang, Yuming and Wu, Tianxing and Yang, Shuai and Si, Chenyang and Lin, Dahua and Qiao, Yu and Loy, Chen Change and Liu, Ziwei},
title = {VideoBooth: Diffusion-based Video Generation with Image Prompts},
year = {2023}
}