Awesome
VidStyleODE Disentangled Video Editing via StyleGAN and NeuralODEs (ICCV 2023)
<img src="assets/teaser.jpg" width="1000"/>VidStyleODE Disentangled Video Editing via StyleGAN and NeuralODEs (ICCV 2023)
<div align="justify"> <b>Abstract</b>: We propose VidStyleODE, a spatiotemporally continuous disentangled Video representation based upon StyleGAN and Neural-ODEs. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN W+ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. For more details, please visit our <a href='https://cyberiada.github.io/VidStyleODE/'>project webpage</a> or read our <a href='https://arxiv.org/abs/2304.06020'>paper</a>. </div> <br>Content
Environment Setup
- initialize and activate a new conda environment by running
conda create -n vidstyleode python=3.10
conda activate vidstyleode
- Install the requirements by running
pip install -r requirements.txt
Dataset Preparation
Downloading and Arranging Training Datasets
Please refer to RAVDESSand Fashion Dataset official websites for instructions on downloading the datasets used in the paper. You may also experiment with your own dataset. The datasets should be arranged with the following structure
Folder1
Video_1.mp4
Video_2.mp4
..
Folder2
Video_1.mp4
Video_2.mp4
..
It is recommended to extract the frames of the video for easier training. To extract the frames, please run the following command
python scripts/extract_video_frames.py \
--source_directory <path-to-video-directory> \
--target_directory <path-to-output-target-directory>
The output folder will have the following structure
Folder1_1
000.png
001.png
..
Folder1_2
000.png
001.png
..
Setup StyleGAN2 Generator
- Our method relies on a pretrained StyleGAN2 generation. Please download your pretrained generator checkpoint and provide its path in the training configuration file.
- For Face video (RAVDESS), we relied on the rosinality pretrained checkpoint. A converted checkpoint can be accessed from the StyleCLIP official repository, which can be downloaded from here.
- For full-body videos (Fashion Dataset), we relied on the pretrained checkpoint provided by StyleGAN-Human.
Setup StyleGAN2 Inversion
- For memory efficiency and to reduce the computation during training, we precomput the StyleGAN W+ embedding vector.
- Frames Proprocessing: - It is important to center-align your video frames before applying inversion. This is because stylegan generators usually generate aligned frames. Image inversion piplelines typically center-align the images before applying the stylegan inversion. If your videos are not center-aligned, please replace your video frames with those aligned.
- We rely on the official checkpoint of the pSp Inversion for our experiments on face videos (RAVDESS), and on the official checkpoint from StyleGAN-Human for our experiments on full-body videos (Fashion Dataset).
- Please refer to their official repositories for instructions on extracting the StyleGAN2 W+ embeddings. An embedding vector is typically of the shape
1 x 18 x hidden_dims
- The embeddings should be saved as
.pt
files and arranged in a structure similar to the video frames.
Folder1_1
000.pt
001.pt
..
Folder1_2
000.pt
001.pt
..
(Optional) Setup Textual Descriptions
To enable style editing, you need to provide a textual description for each training video. Please store these descriptions in a file named text_descriptions.txt
within the corresponding video frames folder. For example:
Folder1_1
000.pt
001.pt
..
text_descriptions.txt
Training Validation Split
- Prepare a
.txt
file containing the video folder names for the training and validation. - Our splits for RAVDESS and Fasion Dataset are provided under the data folder.
Training
- Prepare a
.yaml
configuration file where you need to specify the video frames directory underimg_root
, the W+ inversion folder underinversion_root
, and the training and validationtxt
files undervideo_list
. - Our config files for the RAVDESS and Fashion Dataset are provided under the configs folder.
- To start the training, run the following command:
python main.py --name <tag-for-your-experiment> \
--base <path-to-config-file>
- To resume the training, run the following command
python main.py --name <tag-for-your-experiment> \
--base <path-to-config-file> \
--resume <path-to-log-directory> or <path-to-checkpoint>
By default, the training checkpoint and figures will be logged under logs
folder as well as into wandb. Therefore, please log in to wandb by running
wandb login
Applications
Image Animation
To generate image animation results by using the motion from a driving video, please run the following script
python scripts/image_animation.py
--model_dir <log-dir-to-pretrained-model> \
--n_samples <number-of-sample-to-generate> \
--output_dir <path-to-save-dir> \
--n_frames <num-of-frames-to-generate-per-video> \
--spv <num-of-dirving-videos-per-sample> \ # driving videos will be chosen randomly
--video_list <txt-file-of-possible-target-videios> \
--img_root <path-to-videos-root-dir> \
--inversion_root <path-to-frames-inversion-root-dir> \
Appearance Manipulation
Instructions will be added later.
Frame Interpolation
Instructions will be added later.
Frame Extrapolation
Instructions will be added later.
Citation
If you find this paper useful in your research, please consider citing:
@misc{ali2023vidstyleodedisentangledvideoediting,
title={VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs},
author={Moayed Haji Ali and Andrew Bond and Tolga Birdal and Duygu Ceylan and Levent Karacan and Erkut Erdem and Aykut Erdem},
year={2023},
eprint={2304.06020},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2304.06020},
}