Home

Awesome

<div align="center">

StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, Dong Liu

[Paper] / [Project] / [Models (Huggingface)] / [DAVIS-Edit (HuggingFace)] / [Models (wisemodel)] / [DAVIS-Edit (wisemodel)] / [Models (ModelScope)] / [DAVIS-Edit (ModelScope)]

</div> <!-- omit in toc -->

Table of Contents

If you have any questions about this work, please feel free to start a new issue or propose a PR.

<!-- omit in toc -->

Overview of StableV2V

StableV2V presents a novel paradigm to perform video editing in a shape-consistent manner, especially handling the editing scenarios when user prompts cause significant shape changes to the edited contents. Besides, StableV2V shows superior flexibility in handling a wide series of down-stream applications, considering various user prompts from different modalities.

<div align="center"> <video width="500" src="https://alonzoleeeooo.github.io/assets/github-teasor-comparison.mp4" autoplay loop muted></video> <video width="500" src="https://alonzoleeeooo.github.io/assets/github-teasor-applications.mp4" autoplay loop muted></video> </div>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

News

<!-- omit in toc -->

To-Do List

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Code Structure

StableV2V
├── LICENSE
├── README.md
├── assets
├── datasets                       <----- Code of datasets for training of the depth refinement network
├── models                         <----- Code of model definitions in different components
├── runners                        <----- Code of engines to run different components
├── inference.py                   <----- Script to inference StableV2V
├── train_completion_net.py        <----- Script to train the shape-guided depth completion network
└── utils                          <----- Code of toolkit functions

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Prerequisites

<!-- omit in toc -->

1. Install the Dependencies

We offer an one-click command line to install all the dependencies that the code requires. First, create the virtual environment with conda:

conda create -n stablev2v python=3.10

Then, you can execute the following lines to install the dependencies with pip:

bash install_pip.sh

You can also install the dependencies with conda, following the command line below:

bash install_conda.sh

Then, you are ready to go with conda activate stablev2v.

<!-- omit in toc -->

2. Pre-trained Model Weights

Before you start the inference process, you need to prepare the model weights that StableV2V requires.

<details> <summary> We uploaded all model weights that `StableV2V` requires to our HuggingFace repo. Besides, you can also get access to them in their official releases, where we provide the corresponding details in the following table. </summary>
ModelComponentLink
Paint-by-ExamplePFEFantasy-Studio/Paint-by-Example
InstructPix2PixPFEtimbrooks/instruct-pix2pix
SD InpaintPFEbotp/stable-diffusion-v1-5-inpainting
ControlNet + SD InpaintPFEControlNet models at lllyasviel
AnyDoorPFExichenhku/AnyDoor
RAFTISAGoogle Drive
MiDaSISALink
U2-NetISALink
Depth Refinement NetworkISALink
SD v1.5CIGstable-diffusion-v1-5/stable-diffusion-v1-5
ControlNet (depth)CIGlllyasviel/control_v11f1p_sd15_depth
Ctrl-AdapterCIGhanlincs/Ctrl-Adapter (i2vgenxl_depth)
I2VGen-XLCIGali-vilab/i2vgen-xl
</details>

Once you downloaded all the model weights, put them in the checkpoints folder.

[!NOTE] If your network environment can get access to HuggingFace, you can directly use the HuggingFace repo ID to download the models. Otherwise we highly recommend you to prepare the model weights locally.

Specfically, make sure you modify the configuration file of AnyDoor at models/anydoor/configs/anydoor.yaml with the path of DINO-v2 pre-trained weights:

(at line 83)
cond_stage_config:
  target: models.anydoor.ldm.modules.encoders.modules.FrozenDinoV2Encoder
  weight: /path/to/dinov2_vitg14_pretrain.pth

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Inference of StableV2V (Command Lines)

You may refer to the following command line to run StableV2V:

python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth  --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --image-editor-type paint-by-example --image-editor-checkpoint-path /path/to/image/editor --source-video-frames examples/frames/bear --external-guidance examples/reference-images/raccoon.jpg --prompt "a raccoon" --outdir results
<details><summary> For detailed illustrations of the arguments, please refer to the table below. </summary>
ArgumentDefault SettingRequired or NotExplanation
Model arguments---
--image-editor-type-YesArgument to define the image editor type.
--image-editor-checkpoint-path-YesPath of model weights for the image editor, required by PFE.
--raft-checkpoint-pathcheckpoints/raft-things.pthYesPath of model weights for RAFT, required by ISA.
--midas-checkpoint-pathcheckpoints/dpt_swin2_large_382.ptYesPath of model weights for MiDaS, required by ISA.
--u2net-checkpoint-pathcheckpoints/u2net.pthYesPath of model weights for U2-Net, required by ISA to obtain the segmentation masks of video frames (will be replaced by SAM in near future)
--stable-diffusion-checkpoint-pathstable-diffusion-v1-5/stable-diffusion-v1-5YesPath of model weights for SD v1.5, required by CIG.
--controlnet-checkpoint-pathlllyasviel/control_v11f1p_sd15_depthYesPath of model weights for ControlNet (depth) required by CIG.
--ctrl-adapter-checkpoint-pathhanlincs/Ctrl-AdapterYesPath of model weights for Ctrl-Adapter, required by CIG.
--i2vgenxl-checkpoint-pathali-vilab/i2vgen-xlYesPath of model weights for I2VGen-XL, required by CIG.
--completion-checkpoint-pathcheckpoints/depth-refinement/50000.ckptYesPath of model weights for I2VGen-XL, required by CIG.
Input Arguments---
--source-video-frames-YesPath of input video frames.
--prompt-YesText prompt of the edited video.
--external-guidance-YesExternal inputs for the image editors if you use Paint-by-Example, InstructPix2Pix, and AnyDoor.
--outdirresultsYesPath of output directory.
--edited-first-frame-NoPath of customized first edited frame, where the image editor will not be used if this argument is configured.
--input-condition-NoPath of cusromzied depth maps. We directly extract depth maps from the source video frames with MiDaS if this argument is not configured
--input-condition-NoPath of cusromzied depth maps. We directly extract depth maps from the source video frames with MiDaS if this argument is not configured.
--reference-masks-NoPath of segmentation masks of the reference image, required by AnyDoor. We will automatically extract segmentation mask from the reference image if this argument is not configured.
--image-guidance-scale1.0NoHyper-parameter required by InstructPix2Pix.
--kernel-size9NoKernel size of the binary dilation operation, to make sure that the pasting processes cover the regions of edited contents.
--dilation-iteration1NoIteration for binary dilation operation.
--guidance-scale9.0NoClassifier-free guidance scale.
--mixed-precisionbf16NoPrecision of models in StableV2V.
--n-sample-frames16NoNumber of video frames of the edited video.
--seed42NoRandom seed.
</details>

[!NOTE] Some specific points that you may pay additional attentions to while inferencing:

  1. By configuring --image-editor-checkpoint-path, the path will be automatically delievered to the corresponding editor according to your --image-editor-type. So please do not be worried about some extra arguments in the codebase.
  2. If you are using Paint-by-Example, InstructPix2Pix, AnyDoor, you are required to configure the --external-guidance argument, which corresponds to reference image and user instruction accordingly.
  3. Our method does not currently support xformers, which might cause artifacts in the produced results. Such issue might be fixed in the future if possible.
<!-- omit in toc -->

Performing Sketch-based Editing with StableV2V

<details> <summary> So far, we have not found an efficient way to perform the sketch-based editing within one command line, thus we showcase our way in doing so for reference, where the procedures are shown below. </summary>

1. Prepare the Hand-drawn Sketches through External Devices

To obtain the human-drawn sketches, you need to manually draw them on external devices such as a tablet, and then export the result for later uses. Particularly, we obtain the hand-drawn sketches on the iPad application Sketchbook. An example hand-drawn sketch might look like this:

<div align="center"> <img src="https://alonzoleeeooo.github.io/assets/bear-input-frame.png" width="400"/> <img src="https://alonzoleeeooo.github.io/assets/bear-elephant-sketch.png" width="400"/> </div>

2. Use ControlNet (scribble) to Generate the First Edited Frame

Once you obtain the hand-drawn sketch, the next step is to get the first edited frame. In doing so, we use ControlNet (scribble), where you need to prepare the model weights of ControlNet (scribble) and SD Inpaint in advance. Suppose we put the previously hand-drawn sketches at inputs/hand-drawn-sketches, you can execute the following command line by running ControlNet (scribble):

python scripts/inference_controlnet_inpaint.py --controlnet-checkpoint-path lllyasviel/control_v11p_sd15_scribble --stable-diffusion-checkpoint-path botp/stable-diffusion-v1-5-inpainting --prompt "an elephant" --input-mask inputs/masks/bear.png --controlnet-guidance inputs/hand-drawn-sketches/bear-elephant-sketch.png --outdir results/sketch-guided-result.png

The result might seem like:

<div align="center"> <img src="https://alonzoleeeooo.github.io/assets/bear-elephant-edited-first-frame.png" width="400"/> </div>

3. Use the First Edited Frame to Generate the Edited Video

Finally, you are ready to generate the entire edited video. We offer an example command line as follows:

python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth  --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --source-video-frames examples/frames/bear --edited-first-frame inputs/edited-first-frames/bear-elephant.png --prompt "an elephant walking on the rocks in a zoo" --outdir results 

By configuring the --edited-first-frame, the codebase will automatically skip the first-frame editing process, where we visualize the source video and the edited video below:

<div align="center"> <img width="400" src="https://alonzoleeeooo.github.io/assets/bear-elephant-source-video.gif"/> <img width="400" src="https://alonzoleeeooo.github.io/assets/bear-elephant-edited-video.gif"/> </div> </details>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Performing Video Inpainting with StableV2V

<details> <summary> The application of video inpainting has similar problem to that of sketch-based editing, we have not found integrated solution so far. Thus, we showcase how we perform such application in the following contents for potential reference. </summary>

1. (Optional) Dilate the Input Mask

Before you inpaint the first video frame, we recommend you dilate the annotated segmentation mask (if any) using the following script:

python scripts/run_dilate_mask.py --input-folder inputs/masks/car-turn.png --output-folder inputs/dilated-masks --kernel-size 15 --iterations 1

The original (left) and dilated (right) masks might look like:

<div align="center"> <img src="inputs/masks/car-turn.png" width="400"/> <img src="inputs/dilated-masks/car-turn.png" width="400"/> </div>

2. Use IOPaint to Generated the First Inpainted Frame

We recommend you to use the library IOPaint for convenient use. To install it, you can simply run:

pip install iopaint

Then, you are able to execute LaMa through the library:

iopaint run --model=lama --image inputs/frames/car-turn/00000.jpg --mask inputs/dilated-masks/car-turn.png --output inputs/edited-first-frames/

The original and inpainted first frames might look like:

<div align="center"> <img src="https://alonzoleeeooo.github.io/assets/car-turn-frame.jpg" width="400"/> <img src="https://alonzoleeeooo.github.io/assets/car-turn-inpainted.png" width="400"/> </div>

3. Use the First Edited Frame to Generate the Edited Video

Finally, you are ready to generate the entire edited video. We offer an example command line as follows:

python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth  --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --source-video-frames examples/frames/car-turn --edited-first-frame inputs/edited-first-frame/car-turn-inpainted.png --prompt "an elephant walking on the rocks in a zoo" --outdir results 

By configuring the --edited-first-frame, the codebase will automatically skip the first-frame editing process, where we visualize the source video and the edited video below:

<div align="center"> <img width="400" src="https://alonzoleeeooo.github.io/assets/car-turn-source-video.gif"/> <img width="400" src="https://alonzoleeeooo.github.io/assets/car-turn-edited-video.gif"/> </div> </details>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Inference of StableV2V (Gradio Demo)

We also offer a gradio demo to try StableV2V through interactive UI. Before you go, we recommend you to follow the instructions in this section to prepare all the required model weights locally (in the ./checkpoints folder). Then, feel free to test it out by simply running:

python app.py

In the following figure, we illustrate the functions of different modules in our Gradio demo:

<div align="center"> <img width="800" src="https://alonzoleeeooo.github.io/assets/gradio-demo.jpg"/> </div>

Please do not hesitate to start a new issue or propose a PR if have any further questions about the demo.

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Details of DAVIS-Edit

<details> <summary> We illustrate more details of the curated testing benchmark `DAVIS-Edit` below, where you can get access to the dataset in our HuggingFace repo. </summary>

Data Structure

We construct DAVIS-Edit following the same data structure as the one of DAVIS, as is shown below:

DAVIS-Edit
├── Annotations                                 <----- Official annotated masks of DAVIS
  ├── bear
  ├── blackswan
  ├── ...
  └── train
├── JPEGImages                                  <----- Official video frames of DAVIS
  ├── bear
  ├── blackswan
  ├── ...
  └── train
  ├── ReferenceImages                           <----- Annotated reference images for image-based editing on DAVIS-Edit
  ├── bear.png
  ├── blackswan.png
  ├── ...
  └── train.png
├── .gitattributes
├── README.md
├── edited_video_caption_dict_image.json        <----- Annotated text descriptions for image-based editing on DAVIS-Edit
└── edited_video_caption_dict_text.json         <----- Annotated text descriptions for text-based editing on DAVIS-Edit

Specifically, edited_video_caption_dict_image.json and edited_video_caption_dict_text.json are constructed as Python dictionary, with its keys as the names of video folders in JPEGImages. For example in edited_video_caption_dict_text.json:

{
  "bear": {
    "original": "a bear walking on rocks in a zoo",
    "similar": "A panda walking on rocks in a zoo",
    "changing": "A rabbit walking on rocks in a zoo"
  },
# ...
}

The annotations of reference images contain two sub-folders, i.e., similar and changing, corresponding to the annotations for DAVIS-Edit-S and DAVIS-Edit-C, respectively, where the structure are constructed in the same folder name as that in JPEGImages.

How to use DAVIS-Edit?

We highly recommend you to index different elements in DAVIS-Edit through the annotation files. Particularly, you may refer to the script below:

import os
import json
from tqdm import tqdm
from PIL import Image

# TODO: Modify the configurations here to your local paths
frame_root = 'JPEGImages'
mask_root = 'Annotations'
reference_image_root = 'ReferenceImages/similar'            # Or 'ReferenceImages/changing'
annotation_file_path = 'edited_video_caption_dict_text.json'

# Load the annotation file
with open(annotation_file_path, 'r') as f:
  annotations = json.load(f)

# Iterate all data samples in DAVIS-Edit
for video_name in tqdm(annotations.keys()):

  # Load text prompts
  original_prompt = annotations[video_name]['original']
  similar_prompt = annotations[video_name]['similar']
  changing_prompt = annotations[video_name]['changing']

  # Load reference images
  reference_image = Image.open(os.path.join(reference_image_root, video_name + '.png'))

  # Load video frames
  video_frames = []
  for path in sorted(os.listdir(os.path.join(frame_root, video_name))):
    if path != 'Thumbs.db' and path != '.DS_store':
      video_frames.append(Image.open(os.path.join(frame_root, path)))

  # Load masks
  masks = []
  for path in sorted(os.listdir(os.path.join(mask_root, video_name))):
    if path != 'Thumbs.db' and path != '.DS_store':
      masks.append(Image.open(os.path.join(frame_root, path)))

# (add further operations that you expect in the lines below)
</details>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Training of the Shape-guided Depth Refinement Network

<details> <summary> We have open-sourced the pre-trained model weights of the proposed shape-guided depth refinement network in our HuggingFace repo, where you are free to use `StableV2V` with it. Meanwhile, we offer the specific procedures to perform cutomized training for the refinement network, with details listed below.</summary>

1. Download the YouTube-VOS Dataset

We use YouTube-VOS to conduct the training process of our shape-guided depth refinement network. Before you start the training process, you need to first download its source videos and annotations from this link. Once downloaded, the data follows the structures below:

youtube-vos
├── JPEGImages                     <----- Path of source video frames
├── Annotations                    <----- Path of segmentation masks
└── meta.json                      <----- Annotation file for the segmentation masks

2. Use MiDaS to Annotate the Depth Maps

Once the video frames are ready, the next step is to annotate their corresponding depth maps. Specifically, make sure you download the MiDaS model weights from this link. Then, you can execute the following command lines with our automatic script:

python scripts/extract_youtube_vos_depths.py --midas-path checkpoints/dpt_swin2_large_384.pt --dataset-path data/youtube-vos/JPEGImages --outdir data/youtube-vos/DepthMaps

3. Use U2-Net to Annotate the First-frame Shape Masks

Our depth refinement network uses an additional network channel to take the first-frame shape mask as guidance, thus you need to annotate them for the YouTube-VOS dataset. First, make sure you download the U2-Net model weights from this link. Then, you execute the following command lines with our automatic script:

python scripts/extract_youtube_vos_shapes.py --video-root data/youtube-vos/JPEGImages --model-dir checkpoints/u2net.pth --outdir data/youtube-vos/FirstFrameMasks

4. Train the Model

Finally, you are ready to execute the training process with the following command line:

python train_completion_net.py --video-path data/youtube-vos/JPEGImages --shape-path data/youtube-vos/FirstFrameMasks --max-train-steps 50000 --outdir results/shape-guided-depth-refinement --checkpoint-freq 2000 --validation-freq 200

The trained model weights will be saved at results/checkpoints, and the visualizations of intermediate results can be checked via tensorboard, with the logs saved at results/tensorboard.

</details>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Results

Please refer to our project page for more results and comparisons performed by StableV2V.

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Citation

If you find this work helpful to your research, or use our testing benchmark DAVIS-Edit, please cite our paper:

@misc{liu-etal-2024-stablev2v,
      title={StableV2V: Stablizing Shape Consistency in Video-to-Video Editing}, 
      author={Chang Liu and Rui Li and Kaidong Zhang and Yunwei Lan and Dong Liu},
      year={2024},
      eprint={2411.11045},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Star History

<p align="center"> <a href="hhttps://api.star-history.com/svg?repos=alonzoleeeooo/StableV2V&type=Date" target="_blank"> <img width="550" src="https://api.star-history.com/svg?repos=alonzoleeeooo/StableV2V&type=Date" alt="Star History Chart"> </a> </p>

<u><small><🎯Back to Table of Contents></small></u>

<!-- omit in toc -->

Acknowledgements

This repo is heavily modified based on Diffusers, Ctrl-Adapter, AnyDoor, and RAFT. We sincerely thanks the authors for their fantasitic implementations.

<u><small><🎯Back to Table of Contents></small></u>