Awesome
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting
By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, Jifeng Dai, Hongsheng Li.
This repo is the official Pytorch implementation of FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.
Introduction
<img src='imgs/pipeline.png' width="900px">Usage
Prerequisites
- Python >= 3.6
- Pytorch >= 1.0 and corresponding torchvision (https://pytorch.org/)
Install
- Clone this repo:
git clone https://github.com/ruiliu-ai/FuseFormer.git
- Install other packages:
cd FuseFormer
pip install -r requirements.txt
Training
Dataset preparation
Download datasets (YouTube-VOS and DAVIS) into the data folder.
mkdir data
Note: We use YouTube Video Object Segmentation dataset 2019 version.
Training script
python train.py -c configs/youtube-vos.json
Test
Download pre-trained model into checkpoints folder.
mkdir checkpoints
Test script
python test.py -c checkpoints/fuseformer.pth -v data/DAVIS/JPEGImages/blackswan -m data/DAVIS/Annotations/blackswan
Evaluation
You can follow free-form mask generation scheme for synthesizing random masks.
Or just download our prepared stationary masks and unzip it to data folder.
mv random_mask_stationary_w432_h240 data/
mv random_mask_stationary_youtube_w432_h240 data/
Then you need to download pre-trained model for evaluating VFID.
mv i3d_rgb_imagenet.pt checkpoints/
Evaluation script
python evaluate.py --model fuseformer --ckpt checkpoints/fuseformer.pth --dataset davis --width 432 --height 240
python evaluate.py --model fuseformer --ckpt checkpoints/fuseformer.pth --dataset youtubevos --width 432 --height 240
For evaluating warping error, please refer to https://github.com/phoenix104104/fast_blind_video_consistency
Citing FuseFormer
If you find FuseFormer useful in your research, please consider citing:
@InProceedings{Liu_2021_FuseFormer,
title={FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting},
author={Liu, Rui and Deng, Hanming and Huang, Yangyi and Shi, Xiaoyu and Lu, Lewei and Sun, Wenxiu and Wang, Xiaogang and Dai, Jifeng and Li, Hongsheng},
booktitle = {International Conference on Computer Vision (ICCV)},
year={2021}
}
Acknowledement
This code borrows heavily from the video inpainting framework spatial-temporal transformer net.