Home

Awesome

control-a-video

<!-- <img src="basketball.gif" width="256"> -->

Official Implementation of "Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models"

Similar to Controlnet, We otain the condition maps from another video, and we support three kinds of control maps at this time.

depth controlcanny controlhed control
<img src="videos/depth_a_bear_walking_through_stars.gif" width="200"><br> a bear walking through stars,artstation<img src="videos/canny_a_dog_comicbook.gif" width="200"><br> a dog, comicbook style<img src="videos/hed_a_person_riding_a_horse_jumping_over_an_obstacle_watercolor_style.gif" width="200"><br> person riding horse, watercolor

Setup

The model has been tesed in torch version: 1.13.1+cu117, simply run

pip3 install -r requirements.txt

Usage

1. Quick Use

We provide a demo for quick testing in this repo, simply running:

python3 inference.py --prompt "a bear walking through stars, artstation" --input_video bear.mp4 --control_mode depth 

Args:

If the automatic downloading not work, the models weights can be downloaded from: depth_control_model, canny_control_model, hed_control_model.

2. Auto-Regressive Generation

Our model firstly generates the first frame. Once We get the first frame, we generate the subsquent frames conditioned on the first frame. Thus, it will allow our model to generate longer videos auto-regressive. (This operation is still under experiment and it may collaspe after 3 or 4 iterations.)

python3 inference.py --prompt "a bear walking through stars, artstation" --input_video bear.mp4 --control_mode depth --num_sample_frames 16 --each_sample_frame 8

Note that num_sample_frames should be multiple of each_sample_frame.

Replace the 2d model (Experimentally)

Since we freeze the 2d model, you can replace it with any other model based on stable-diffusion-v1-5 to generate custom-style videos.

state_dict_path = os.path.join(pipeline_model_path, "unet", "diffusion_pytorch_model.bin")
state_dict = torch.load(state_dict_path, map_location="cpu")
video_controlnet_pipe.unet.load_2d_state_dict(state_dict=state_dict)    # reload 2d model.

Citation

@misc{chen2023controlavideo,
        title={Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models}, 
        author={Weifeng Chen and Jie Wu and Pan Xie and Hefeng Wu and Jiashi Li and Xin Xia and Xuefeng Xiao and Liang Lin},
        year={2023},
        eprint={2305.13840},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }

Acknowledgement

This repository borrows heavily from Diffusers, ControlNet, Tune-A-Video, thanks for open-sourcing! This work was done in Bytedance, thanks for the cooperators!

Future Plan