Home

Awesome

CamI2V: Camera-Controlled Image-to-Video Diffusion Model

Official repo of paper for "CamI2V: Camera-Controlled Image-to-Video Diffusion Model".

Project page: https://zgctroy.github.io/CamI2V/

Abstract: Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera movements, dynamic objects, or occlusions, ensuring robust performance in diverse environments. Furthermore, we develop a more robust and reproducible evaluation pipeline to address the inaccuracies and instabilities of existing camera control metrics. Our method achieves a 25.64% improvement in camera controllability on the RealEstate10K dataset without compromising dynamics or generation quality and demonstrates strong generalization to out-of-domain images. Training and inference require only 24GB and 12GB of memory, respectively, for 16-frame sequences at 256×256 resolution. We will release all checkpoints, along with training and evaluation code. Dynamic videos are available for viewing on our project page.

:star2: News and Todo List

!!!!! Code is not complete and clean. Environment installer, training scripts, evaluation code and gradio demo are on the way. In addition, we implement camera control methods using code injected to lvdm, which is not easy for python beginners. We will reconstruct code in several weeks. !!!!

:chart_with_upwards_trend: Performance

MethodRotErr $\downarrow$TransErr $\downarrow$CamMC $\downarrow$FVD $\downarrow$<br>(VideoGPT)FVD $\downarrow$<br>(StyleGAN)
DynamiCrafter3.34159.802411.625106.0292.196
+ MotionCtrl0.86362.50682.953670.82060.363
+ Plucker Embedding<br>(Baseline, CameraCtrl)0.70981.88772.255766.07755.889
+ Plucker Embedding<br>+ Epipolar Attention Only on Reference Frame<br>(CamCo-like)0.57381.60141.885166.43956.778
+ Plucker Embedding<br>+ Epipolar Attention<br>(Our CamI2V)0.47581.49551.715366.09055.701
+ Plucker Embedding<br>+ 3D Full Attention0.62991.82152.131571.02660.00

Inference Speed and GPU Memory

Measured under 256x256 resolution, 16 frames, 25steps.

Method# ParametersGPU MemoryGeneration Time<br>(RTX 3090)
DynamiCrafter1.4 B11.14 GiB8.14 s
+ MotionCtrl+ 63.4 M11.18 GiB8.27 s
+ Plucker Embedding<br>(Baseline, CameraCtrl)+ 211 M11.56 GiB8.38 s
+ Plucker Embedding<br>+ Epipolar Attention<br>(Our CamI2V)+ 261 M11.67 GiB10.3 s
<!-- ## :camera: Visualization ### 1024x576 zoom in + zoom out ![](https://github.com/user-attachments/assets/1405ee33-8404-40c9-b530-398c9aab88a5) ### 512x320 ![](https://github.com/user-attachments/assets/1c45d326-7dca-406b-a6e7-b46df90fceb1) ![](https://github.com/user-attachments/assets/a2176d29-d305-4a16-9ed3-c01440f5fc9a) ![](https://github.com/user-attachments/assets/a766dbb2-9a7c-4d0d-a991-87c6534be316) Also see 512 resolution part of [https://zgctroy.github.io/CamI2V/](https://zgctroy.github.io/CamI2V/) ### 256x256 See 256 resolution part of [https://zgctroy.github.io/CamI2V/](https://zgctroy.github.io/CamI2V/) -->

:gear: Environment

Quick Start

conda create -n cami2v python=3.10
conda activate cami2v

conda install -y pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y xformers -c xformers
pip install -r requirements.txt

:dizzy: Inference

Download Checkpoints

Currently we release checkpoints of DynamiCrafter-based CamI2v, CameraCtrl and MotionCtrl under 256x256 resolution on Huggingface. CamI2V with higher resolution is on the way, please stay tuned!

Download above checkpoints and put under ckpts folder. Please edit ckpt_path in models.json if you have a different model path.

Run Gradio Demo

python cami2v_gradio_app.py

Gradio may struggle to establish network connection, please re-try with --use_host_ip.

:rocket: Training

Prepare Dataset

Please follow instructions in datasets folder in this repo to download RealEstate10K dataset and pre-process necessary items like video_clips and valid_metadata.

Download Pretrained Models

Download pretrained weights of base model DynamiCrafter and put under pretrained_models folder:

─┬─ pretrained_models\
 └─┬─ DynamiCrafter\
   └─── model.ckpt

Launch

Start training by passing config yaml to --base argument of main/trainer.py. Example training configs are provided in configs folder.

torchrun --standalone --nproc_per_node 8 main/trainer.py --train \
    --logdir $(pwd)/logs \
    --base configs/<YOUR_CONFIG_NAME>.yaml \
    --name <YOUR_LOG_NAME>

:wrench: Evaluation

We calculate RotErr, TransErr, CamMC and FVD to evaluate camera controllability and visual quality. Code and installation guide for requirements are provided in evaluation folder, including COLMAP and GLOMAP. Support for VBench is planned in months as well.

:hugs: Related Repo

CameraCtrl: https://github.com/hehao13/CameraCtrl

MotionCtrl: https://github.com/TencentARC/MotionCtrl

DynamiCrafter: https://github.com/Doubiiu/DynamiCrafter

:spiral_notepad: Citation

@article{zheng2024cami2v,
  title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
  author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
  journal={arXiv preprint arXiv:2410.15957},
  year={2024}
}