Awesome
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
<div align="center"><a href='https://arxiv.org/abs/2306.00943'><img src='https://img.shields.io/badge/arXiv-2306.00943-b31b1b.svg'></a> <a href='https://doubiiu.github.io/projects/Make-Your-Video/'><img src='https://img.shields.io/badge/Project-Video-Green'></a>
Jinbo Xing, Menghan Xia*, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, <br>Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong <br><br> (* corresponding author)
From CUHK and Tencent AI Lab.
IEEE TVCG 2024
</div>🔆 Introduction
Make-Your-Video is a customized video generation model with both text and motion structure (depth) control. It inherits rich visual concepts from image LDM and supports longer video inference.
🤗 Applications
Real-life scene to video
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td>Real-life scene</td> <td>Ours</td> <td>Text2Video-zero+CtrlNet</td> <td>LVDM<sub>Ext</sub>+Adapter</td> </tr> <tr> <td> <img src=assets/real-life_GIF/dam_input.gif width="170"> </td> <td> <img src=assets/real-life_GIF/dam_ours.gif width="170"> </td> <td> <img src=assets/real-life_GIF/dam_t2vzero.gif width="170"> </td> <td> <img src=assets/real-life_GIF/dam_lvdm.gif width="170"> </td> </tr> <tr><td colspan="4">"A dam discharging water"</td></tr> <tr> <td> <img src=assets/real-life_GIF/rocket_input.gif width="170"> </td> <td> <img src=assets/real-life_GIF/rocket_ours.gif width="170"> </td> <td> <img src=assets/real-life_GIF/rocket_t2vzero.gif width="170"> </td> <td> <img src=assets/real-life_GIF/rocket_lvdm.gif width="170"> </td> </tr> <tr><td colspan="4">"A futuristic rocket ship on a launchpad, with sleek design, glowing lights"</td></tr> </table >3D scene modeling to video
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td>Real-life scene</td> <td>Ours</td> <td>Text2Video-zero+CtrlNet</td> <td>LVDM<sub>Ext</sub>+Adapter</td> </tr> <tr> <td> <img src=assets/3dmodeling_GIF/train_input.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/train_ours.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/train_t2vzero.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/train_lvdm.gif width="170"> </td> </tr> <tr><td colspan="4">"A train on the rail, 2D cartoon style"</td></tr> <tr> <td> <img src=assets/3dmodeling_GIF/book_input.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/book_ours.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/book_t2vzero.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/book_lvdm.gif width="170"> </td> </tr> <tr><td colspan="4">"A Van Gogh style painting on drawing board in park, some books on the picnic blanket, photorealistic"</td></tr> </tr> <tr> <td> <img src=assets/3dmodeling_GIF/mountain_input.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/mountain_ours.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/mountain_t2vzero.gif width="170"> </td> <td> <img src=assets/3dmodeling_GIF/mountain_lvdm.gif width="170"> </td> </tr> <tr><td colspan="4">"A Chinese ink wash landscape painting"</td></tr> </table >Video re-rendering
<table class="center"> <tr style="font-weight: bolder; text-align:center;"> <td>Original video</td> <td>Ours</td> <td>SD-Depth</td> <td>Text2Video-zero+CtrlNet</td> <td>LVDM<sub>Ext</sub>+Adapter</td> <td>Tune-A-Video</td> </tr> <tr> <td> <img src=assets/video-rerendering_GIF/bear_input.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/bear_ours.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/bear_sddepth.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/bear_t2vzero.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/bear_lvdm.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/bear_tav.gif width="170"> </td> </tr> <tr><td colspan="6">"A tiger walks in the forest, photorealistic"</td></tr> <tr> <td> <img src=assets/video-rerendering_GIF/boat_input.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/boat_ours.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/boat_sddepth.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/boat_t2vzero.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/boat_lvdm.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/boat_tav.gif width="170"> </td> </tr> <tr><td colspan="6">"An origami boat moving on the sea"</td></tr> <tr> <td> <img src=assets/video-rerendering_GIF/camel_input.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/camel_ours.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/camel_sddepth.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/camel_t2vzero.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/camel_lvdm.gif width="170"> </td> <td> <img src=assets/video-rerendering_GIF/camel_tav.gif width="170"> </td> </tr> <tr><td colspan="6">"A camel walking on the snow field, Miyazaki Hayao anime style"</td></tr> </table >🌟 Method Overview
📝 Changelog
- [2023.11.30]: 🔥🔥 Release the main model.
- [2023.06.01]: 🔥🔥 Create this repo and launch the project webpage. <br>
🧰 Models
Model | Resolution | Checkpoint |
---|---|---|
MakeYourVideo256 | 256x256 | Hugging Face |
It takes approximately 13 seconds and requires a peak GPU memory of 20 GB to animate an image using a single NVIDIA A100 (40G) GPU.
⚙️ Setup
Install Environment via Anaconda (Recommended)
conda create -n makeyourvideo python=3.8.5
conda activate makeyourvideo
pip install -r requirements.txt
💫 Inference
1. Command line
- Download the pre-trained depth estimation model from Hugging Face, and put the
dpt_hybrid-midas-501f0c75.pt
incheckpoints/depth/dpt_hybrid-midas-501f0c75.pt
. - Download pretrained models via Hugging Face, and put the
model.ckpt
incheckpoints/makeyourvideo_256_v1/model.ckpt
. - Input the following commands in terminal.
sh scripts/run.sh
👨👩👧👦 Other Interesting Open-source Projects
VideoCrafter1: Framework for high-quality video generation.
DynamiCrafter: Open-domain image animation methods using video diffusion priors.
Play with these projects in the same conda environement!
😉 Citation
@article{xing2023make,
title={Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance},
author={Xing, Jinbo and Xia, Menghan and Liu, Yuxin and Zhang, Yuechen and Zhang, Yong and He, Yingqing and Liu, Hanyuan and Chen, Haoxin and Cun, Xiaodong and Wang, Xintao and others},
journal={arXiv preprint arXiv:2306.00943},
year={2023}
}
📢 Disclaimer
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.
🌞 Acknowledgement
We gratefully acknowledge the Visual Geometry Group of University of Oxford for collecting the WebVid-10M dataset and follow the corresponding terms of access.