Awesome
CamI2V: Camera-Controlled Image-to-Video Diffusion Model
official repo of paper for "CamI2V: Camera-Controlled Image-to-Video Diffusion Model"
github io: https://zgctroy.github.io/CamI2V/
Abstract: Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. With optimization, only 24GB and 12GB is required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes.
News and ToDo List
- 2024-11-16 !!!!! Code is not complete and clean. Evaluation codes, environment installer, bash scripts, and gradio codes are on the way. In addition, we implement camera control methods using code inject on lvdm, which is not easy for python beginner. We will reconstruct codes in about three weeks. !!!!
- 2024-11-16: Release most of codes including implementation for motionctrl, cameractrl, cami2v and training, inference, test code
- 2024-10-14: Release of checkpoints, training, and evaluation codes in a month
256x256 resolution, 25steps, RTX 3090, 16 frames
Method$(c_\text{txt,img}=7.5,c_\text{cam}=1.0)$ | Parameters | Generation Time$\downarrow$ | RotErr$\downarrow$ | TransErr$\downarrow$ | CamMC$\downarrow$ | FVD (VideoGPT)$\downarrow$ | FVD (StyleGAN)$\downarrow$ |
---|---|---|---|---|---|---|---|
DynamiCrafter | 1.4 B | 8.14 s | 3.3772 | 9.7700 | 11.544 | 117.785 | 103.510 |
DynamiCrafter + MotionCtrl | + 63.4 M | 8.27 s | 0.9771 | 2.4435 | 3.0235 | 68.545 | 61.027 |
DynamiCrafter + CameraCtrl | + 211 M | 8.38 s | 0.6984 | 1.8658 | 2.2445 | 68.422 | 60.235 |
DynamiCrafter + CamI2V | + 261 M | 10.3 s | 0.4257 | 1.4226 | 1.6277 | 63.940 | 54.897 |
DynamiCrafter + CamI2V (only plucker, no epipolar ) | 0.7624 | 2.0397 | 2.4542 | 66.237 | 58.179 | ||
DynamiCrafter + CamI2V (no plucker, only epipolar ) | 1.5905 | 5.2980 | 6.2457 | 87.248 | 77.236 |
Performance
Visualization
1024x576
zoom in + zoom out
512x320
Also see 512 resolution part of https://zgctroy.github.io/CamI2V/
256x256
See 256 resolution part of https://zgctroy.github.io/CamI2V/
Related Repo
CameraCtrl: https://github.com/hehao13/CameraCtrl
MotionCtrl: https://github.com/TencentARC/MotionCtrl/tree/animatediff
Citation
@inproceedings{anonymous2025camiv,
title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
author={Anonymous},
booktitle={Submitted to The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=dIZB7jeSUv},
note={under review}
}