Home

Awesome

CameraCtrl

This repository is the official implementation of CameraCtrl.

This main branch contains the codes and model for CameraCtrl implemented on AnimateDiffV3. For codes and models of CameraCtrl with stable video diffusion, please refer to the svd branch for detail.

CameraCtrl: Enabling Camera Control for Video Diffusion Models <br> Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, Ceyuan Yang<br>

[Paper] [Project Page] [Weights] [HF Demo]

Todo List

Configurations

Environment

conda env create -f environment.yaml
conda activate cameractrl

Dataset

- RealEstate10k
  - annotations
    - test.json
    - train.json
    - validation.json
  - pose_files
    - 0000cc6d8b108390.txt
    - 00028da87cc5a4c4.txt
    - 0002b126b0a8a685.txt
    - 0003a9bce989e532.txt
    - 000465ebe46a98d2.txt
    - ...
  - video_clips
    - 00ccbtp2aSQ
    - 00rMZpGSeOI
    - 01bTY_glskw
    - 01PJ3skCZPo
    - 01uaDoluhzo
    - ...

Inferences

Prepare Models

Prepare camera trajectory & prompts

Inference

python -m torch.distributed.launch --nproc_per_node=8 --master_port=25000 inference.py \
      --out_root ${OUTPUT_PATH} \
      --ori_model_path ${SD1.5_PATH} \ 
      --unet_subfolder ${SUBFOUDER_NAME} \
      --motion_module_ckpt ${ADV3_MM_CKPT} \ 
      --pose_adaptor_ckpt ${CAMERACTRL_CKPT} \
      --model_config configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml \
      --visualization_captions assets/cameractrl_prompts.json \
      --use_specific_seeds \
      --trajectory_file assets/pose_files/0f47577ab3441480.txt \
      --n_procs 8

where

The above inference example is used to generate videos in the original T2V model domain. The inference.py script supports generate videos in other domains with image LoRAs (args.image_lora_rank and args.image_lora_ckpt), like the RealEstate10K LoRA or some personalized base models (args.personalized_base_model), like the Realistic Vision. please refer to the code for detail.

Results

<table> <tr> <th width=13.3% style="text-align:center">Camera Trajectory</th> <th width=20% style="text-align:center">Video</th> <th width=13.3% style="text-align:center">Camera Trajectory</th> <th width=20% style="text-align:center">Video</th> <th width=13.3% style="text-align:center">Camera Trajectory</th> <th width=20% style="text-align:center">Video</th> </tr> <tr> <td width=13.3% ><img src="assets/images/horse_1.png" alt="horse1_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_1.gif" alt="horse1_vid" width="90%" ></td> <td width=13.3%><img src="assets/images/horse_2.png" alt="horse2_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_2.gif" alt="horse2_vid" width="90%" ></td> <td width=13.3%><img src="assets/images/horse_3.png" alt="horse3_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_3.gif" alt="horse3_vid" width="90%"></td> </tr> <tr> <td width=13.3%><img src="assets/images/horse_4.png" alt="horse4_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_4.gif" alt="horse4_vid" width="90%" ></td> <td width=13.3%><img src="assets/images/horse_5.png" alt="horse5_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_5.gif" alt="horse5_vid" width="90%"></td> <td width=13.3%><img src="assets/images/horse_6.png" alt="horse6_traj" width="90%"></td> <td width=20%><img src="assets/gifs/horse_6.gif" alt="horse6_vid" width="90%"></td> </tr> </table> <table> <tr> <th width=11.7% style="text-align:center">Generator</th> <th width=11.7% style="text-align:center">Camera Trajectory</th> <th width=17.6% style="text-align:center">Video</th> <th width=11.7% style="text-align:center">Camera Trajectory</th> <th width=17.6% style="text-align:center">Video</th> <th width=11.7% style="text-align:center">Camera Trajectory</th> <th width=17.6% style="text-align:center">Video</th> </tr> <tr> <td width=11.7% style="text-align:center" width="90%">SD1.5</td> <td width=11.7%><img src="assets/images/dd1.png" alt="dd1_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/0aa284f8166e19e4_A fish is swimming in the aquarium tank.gif" alt="dd1_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd2.png" alt="dd2_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/sunflowers_3b9420585a1e66fc.gif" alt="dd2_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd3.png" alt="dd3_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/massive, multi-tiered elven palace adorned with flowing waterfalls, its cascades forming staircases between ethereal realms_2f25826f0d0ef09a.gif" alt="dd3_vid" width="90%"></td> </tr> <tr> <td width=11.7% style="text-align:center" width="90%">SD1.5 + RealEstate LoRA </td> <td width=11.7%><img src="assets/images/dd4.png" alt="dd4_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/a_kitchen_with_wooden_cabinets_and_a_black_stove_0bf152ef84195293.gif" alt="dd4_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd5.png" alt="dd5_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/a_living_room_with_leather_couches_and_a_fireplace_2cc5f95fbe24ffe5.gif" alt="dd5_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd6.png" alt="dd6_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/an_empty_room_with_a_desk_and_chair_in_it_0f47577ab3441480.gif" alt="dd6_vid" width="90%"></td> </tr> <tr> <td width=11.7% style="text-align:center" width="90%">Realistic Vision</td> <td width=11.7%><img src="assets/images/dd7.png" alt="dd7_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/photo of coastline, rocks, storm weather, wind, waves, lightning, soft lighting_ 9d022c4ec370112a.gif" alt="dd7_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd8.png" alt="dd8_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/close up photo of a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred_3f79dc32d575bcdc.gif" alt="dd8_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd9.png" alt="dd9_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/realestate_horizontal_uniform.gif" alt="dd9_vid" width="90%"></td> </tr> <tr> <td width=11.7% style="text-align:center" width="90%">ToonYou</td> <td width=11.7%><img src="assets/images/dd10.png" alt="dd10_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/toonyou_ 62feb0ed164ebcbe.gif" alt="dd10_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd11.png" alt="dd11_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/0f47577ab3441480_mkvd, 1girl, turtleneck sweater, sweater yellow, happy, looking at viewer_.gif" alt="dd8_vid" width="90%"></td> <td width=11.7%><img src="assets/images/dd12.png" alt="dd12_traj" width="90%"></td> <td width=17.6%><img src="assets/gifs/closeup face photo of man in black clothes, night city street, bokeh, fireworks in background_4e012c05fdf8f9b3.gif" alt="dd12_vid" width="90%"></td> </tr> </table>

Note that, each image paired with the video represents the camera trajectory. Each small tetrahedron on the image represents the position and orientation of the camera for one video frame. Its vertex stands for the camera location, while the base represents the imaging plane of the camera. The red arrows indicate the movement of camera position. The camera rotation can be observed through the orientation of the tetrahedrons.

Training

Step1 (RealEstate10K image LoRA)

Update the below paths to data and pretrained model of the config configs/train_image_lora/realestate_lora.yaml

pretrained_model_path: "[replace with SD1.5 root path]"
train_data:
  root_path: "[replace RealEstate10K root path]"

Other training parameters (lr, epochs, validation settings, etc.) are also included in the config files.

Then, launch the image LoRA training using slurm

./slurm_run.sh ${PARTITION} image_lora 8 configs/train_image_lora/realestate_lora.yaml train_image_lora.py

or PyTorch

./dist_run.sh configs/train_image_lora/realestate_lora.yaml 8 train_image_lora.py

We provide our pretrained checkpoint of the RealEstate10K LoRA model in HuggingFace.

Step2 (Camera control model)

Update the below paths to data and pretrained model of the config configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml

pretrained_model_path: "[replace with SD1.5 root path]"
train_data:
  root_path: "[replace RealEstate10K root path]"
validation_data:
  root_path:       "[replace RealEstate10K root path]"
lora_ckpt: "[Replace with RealEstate10k image LoRA ckpt]"
motion_module_ckpt: "[Replace with ADV3 motion module]"

Other training parameters (lr, epochs, validation settings, etc.) are also included in the config files.

Then, launch the camera control model training using slurm

./slurm_run.sh ${PARTITION} cameractrl 8 configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml train_camera_control.py

or PyTorch

./dist_run.sh configs/train_cameractrl/adv3_256_384_cameractrl_relora.yaml 8 train_camera_control.py

Disclaimer

This project is released for academic use. We disclaim responsibility for user-generated content. Users are solely liable for their actions. The project contributors are not legally affiliated with, nor accountable for, users' behaviors. Use the generative model responsibly, adhering to ethical and legal standards.

Acknowledgement

We thank AnimateDiff for their amazing codes and models.

BibTeX

@article{he2024cameractrl,
      title={CameraCtrl: Enabling Camera Control for Text-to-Video Generation}, 
      author={Hao He and Yinghao Xu and Yuwei Guo and Gordon Wetzstein and Bo Dai and Hongsheng Li and Ceyuan Yang},
      journal={arXiv preprint arXiv:2404.02101},
      year={2024}
}