Awesome
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
<a href='https://cococozibojia.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/pdf/2403.12035'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a>
Bojia Zi<sup>1</sup>, Shihao Zhao<sup>2</sup>, Xianbiao Qi<sup>*5</sup>, Jianan Wang<sup>4</sup>, Yukai Shi<sup>3</sup>, Qianyu Chen<sup>1</sup>, Bin Liang<sup>1</sup>, Rong Xiao<sup>5</sup>, Kam-Fai Wong<sup>1</sup>, Lei Zhang<sup>4</sup>
* is corresponding author.
This is the inference code for our paper CoCoCo.
<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/COCOCO.PNG" alt="COCOCO" style="width: 100%;"/> </p> <table> <tr> <td><img src="__asset__/sea_org.gif"></td> <td><img src="__asset__/sea1.gif"></td> <td><img src="__asset__/sea2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The ocean, the waves ... </td> <td> The ocean, the waves ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/river_org.gif"></td> <td><img src="__asset__/river1.gif"></td> <td><img src="__asset__/river2.gif"></td> </tr> <tr> <td> Orginal </td> <td> The river with ice ... </td> <td> The river with ice ... </td> </tr> </table> <table> <tr> <td><img src="__asset__/sky_org.gif"></td> <td><img src="__asset__/sky1.gif"></td> <td><img src="__asset__/sky2.gif"></td> </tr> <tr> <td> Orginal </td> <td> Meteor streaking in the sky ... </td> <td> Meteor streaking in the sky ... </td> </tr> </table>Table of Contents <!-- omit in toc -->
Features
- Consistent text-guided video inpainting
- By using damped attention, we have decent inpainting visual content
- Higher text controlability
- We have better text controlability
- Personalized video inpainting
- We develop a training-free method to implement personalized video inpainting by leveraging personalized T2Is
- Gradio Demo using SAM2
- We use SAM2 to create Video Inpaint Anything
- Infinite Video Inpainting
- By using the slidding window, you are allowed to inpaint any length videos.
- Controlable Video Inpainting
- By composing with the controlnet, we find that we can inpaint controlable content in the given masked region
- More inpainting tricks will be released soon...
Installation
Step1. Installation Checklist
Before install the dependencies, you should check the following requirements to overcome the installation failure.
- You have a GPU with at least 24G GPU memory.
- Your CUDA with nvcc version is greater than 12.0.
- Your Pytorch version is greater than 2.4.
- Your gcc version is greater than 9.4.
- Your diffusers version is 0.11.1.
- Your gradio version is 3.40.0.
Step2. Install the requirements
If you update your enviroments successfully, then try to install the dependencies by pip.
# Install the CoCoCo dependencies
pip3 install -r requirements.txt
# Compile the SAM2
pip3 install -e .
If everything goes well, I think you can turn to the next steps.
Usage
1. Download pretrained models.
Note that our method requires both parameters of SD1.5 inpainting and cococo.
-
The pretrained image inpainting model (Stable Diffusion Inpainting.)
-
The CoCoCo Checkpoints.
-
Warning: the runwayml delete their models and weights, so we must download the image inpainting model from other url.
-
After download, you should put these two models in two folders, the image inpainting folder should contains scheduler, tokenizer, text_encoder, vae, unet, the cococo folder should contain model_0.pth to model-3.pth
2. Prepare the mask
You can obtain mask by GroundingDINO or Track-Anything, or draw masks by yourself.
We release the gradio demo to use the SAM2 to implement Video Inpainting Anything. Try our Demo!
<p align="center"> <img src="https://github.com/zibojia/COCOCO/blob/main/__asset__/DEMO.PNG" alt="DEMO" style="width: 95%;"/> </p>3. Run our validation script.
By running this code, you can simply get the video inpainting results.
python3 valid_code_release.py --config ./configs/code_release.yaml \
--prompt "Trees. Snow mountains. best quality." \
--negative_prompt "worst quality. bad quality." \
--guidance_scale 10 \ # the cfg number, higher means more powerful text controlability
--video_path ./images/ \ # the path that store the video and masks, the format is the images.npy and masks.npy
--model_path [cococo_folder_name] \ # the path to cococo weights, e.g. ./cococo_weights
--pretrain_model_path [sd_folder_name] \ # the path that store the pretrained stable inpainting model, e.g. ./stable-diffusion-v1-5-inpainting
--sub_folder unet # set the subfolder of pretrained stable inpainting model to get the unet checkpoints
4. Personalized Video Inpainting (Optional)
We give a method to allow users to compose their own personlized video inpainting model by using personalized T2Is WITHOUT TRAINING. There are three steps in total:
-
Convert the opensource model to Pytorch weights.
-
Transform the personalized image diffusion to personliazed inpainting diffusion. Substract the weights of personalized image diffusion from SD1.5, and add them on inpainting model. Surprisingly, this method can get a personalized image inpainting model, and it works well:)
-
Add the weight of personalized inpainting model to our CoCoCo.
Convert safetensors to Pytorch weights
-
For the model using different key, we use the following script to process opensource T2I model.
For example, the epiCRealism, it is different from the key of the StableDiffusion.
model.diffusion_model.input_blocks.1.1.norm.bias model.diffusion_model.input_blocks.1.1.norm.weight
Therefore, we develope a tool to convert this type model to the delta of weight.
cd task_vector; python3 convert.py \ --tensor_path [safetensor_path] \ # set the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --source_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_path ./resources \ # the path you put some preliminary files, e.g. ./resources --target_prefix [prefix]; # set the converted filename prefix
-
For the model using same key and trained by LoRA.
For example, the Ghibli LoRA.
lora_unet_up_blocks_3_resnets_0_conv1.lora_down.weight lora_unet_up_blocks_3_resnets_0_conv1.lora_up.weight
python3 convert_lora.py \ --tensor_path [tensor_path] \ # the safetensor path --unet_path [unet_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_encoder_path [text_encoder_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_path [vae_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --regulation_path ./lora.json \ # use this path defaultly. Please don't change --target_prefix [target_prefix] # et the converted filename prefix
Take Pytorch weights and add them on CoCoCo to create personalized video inpainting
-
You can use customized T2I or LoRA to create vision content in the masks.
python3 valid_code_release_with_T2I_LoRA.py \ --config ./configs/code_release.yaml --guidance_scale 10 \ # set this as default --video_path ./images \ # the path that store the videos, the format is the images.npy --masks_path ./images \ # the path that store the masks, the format is the masks.npy --model_path [model_path] \ # the path that store the cococo weights --pretrain_model_path [pretrain_model_path] \ # the path that store the SD1.5 Inpainting, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder unet \ # set the subfolder of pretrained stable inpainting model to get the unet checkpoints --unet_lora_path [unet_lora_path] \ # the LoRA weights for unet --beta_unet 0.75 \ # the hyper-parameter $beta$ for unet LoRA weights --text_lora_path [text_lora_path] \ # the LoRA weights for text_encoder --beta_text 0.75 \ # the hyper-parameter $beta$ for text encoder LoRA weights --vae_lora_path [text_lora_path] \ # the LoRA weights for vae --beta_vae 0.75 \ # the hyper-parameter $beta$ for vae LoRA weights --unet_model_path [unet_model_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --text_model_path [text_model_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --vae_model_path [vae_model_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --prompt [prompt] \ --negative_prompt [negative_prompt]
5. COCOCO INFERENCE with SAM2
-
Try our demo with original COCOCO
CUDA_VISIBLE_DEVICES=0,1 python3 app.py \ --config ./configs/code_release.yaml \ --model_path [model_path] \ # the path to cococo weights --pretrain_model_path [pretrain_model_path] \ # the image inpainting pretrained model path, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder [sub_folder] # set unet as default
-
Try our demo with LoRA and checkpoint
-
By using our convertion code, we obtain some personalized image inpainting models and LoRAs, you can download from the bellow:
-
Run the Gradio demo with LoRA.
CUDA_VISIBLE_DEVICES=0,1 python3 app_with_T2I_LoRA.py \ --config ./configs/code_release.yaml \ --unet_lora_path [unet_lora_path] \ # the LoRA weights for unet --text_lora_path [text_lora_path] \ # the LoRA weights for text_encoder --vae_lora_path [vae_lora_path] \ # the LoRA weights for vae --beta_unet 0.75 \ # the hyper-parameter $beta$ for unet LoRA weights --beta_text 0.75 \ # the hyper-parameter $beta$ for text_encoder LoRA weights --beta_vae 0.75 \ # the hyper-parameter $beta$ for vae LoRA weights --text_model_path [text_model_path] \ # set the text encoder path, e.g. stable-diffusion-v1-5-inpainting/text_encoder/pytorch_model.bin --unet_model_path [unet_model_path] \ # set the path to SD1.5 unet weights, e.g. stable-diffusion-v1-5-inpainting/unet/diffusion_pytorch_model.bin --vae_model_path [vae_model_path] \ # set the vae path, e.g. stable-diffusion-v1-5-inpainting/vae/diffusion_pytorch_model.bin --model_path [model_path] \ # cococo weights --pretrain_model_path [pretrain_model_path] \ # the image inpainting pretrained model path, e.g. ./stable-diffusion-v1-5-inpainting --sub_folder [sub_folder] # the default is unet
-
TO DO
[1]. We will use larger dataset with high-quality videos to produce a more powerful video inpainting model soon.
[2]. The training code is under preparation.
Citation
@article{Zi2024CoCoCo,
title={CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility},
author={Bojia Zi and Shihao Zhao and Xianbiao Qi and Jianan Wang and Yukai Shi and Qianyu Chen and Bin Liang and Kam-Fai Wong and Lei Zhang},
journal={ArXiv},
year={2024},
volume={abs/2403.12035},
url={https://arxiv.org/abs/2403.12035}
}
Acknowledgement
This code is based on AnimateDiff, Segment-Anything-2 and propainter.