Awesome
[CVPR'23] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
A PyTorch implementation of TVC
Paper | Project | Slide | Video
<img src='_imgs/intro.jpg' width='60%' />Overview
TVC is an implementation of <br> "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation" <br> Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, and Sean Bell <br> in Conference on Computer Vision and Pattern Recognition (CVPR) 2023
<img src='_imgs/mmvg.jpg' width='80%' />To model the video along with language, we propose temporal-aware VQGAN to represent a frame as visual tokens, which converts it into the same discrete space as the words. We present an effective masking strategy that masks different video parts for video completion learning. Those missing fragments are replaced by the unique [SPAN] tokens, and we consider the visual guidance from diverse time points. The multimodal encoder consumes the text and the partial missing video, and the decoder learns to produce the complete video from arbitrary guided frames. By varying the masking conditions, MMVG learns to utilize the [SPAN] token and unifies all TVC tasks during the training.
Requirements
This code is implemented under Python 3.9, Torch 1.11, Torchvision 0.12, TorchMetrics 0.6, and Lightning 1.3. <br>
Since there is no obvious performance gap, we simplify the implementation and adopt VideoGPT in our MMVG.
Usage
Dataset
show_data.ipynb
Inference
inference.ipynb
Citation
@inproceedings{fu2023tvc,
author = {Tsu-Jui Fu and Licheng Yu and Ning Zhang and Cheng-Yang Fu and Jong-Chyi Su and William Yang Wang and Sean Bell},
title = {{Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation}},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}