Awesome

[CVPR'23] Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

A PyTorch implementation of TVC

Overview

TVC is an implementation of <br> "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation" <br> Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, and Sean Bell <br> in Conference on Computer Vision and Pattern Recognition (CVPR) 2023

To model the video along with language, we propose temporal-aware VQGAN to represent a frame as visual tokens, which converts it into the same discrete space as the words. We present an effective masking strategy that masks different video parts for video completion learning. Those missing fragments are replaced by the unique [SPAN] tokens, and we consider the visual guidance from diverse time points. The multimodal encoder consumes the text and the partial missing video, and the decoder learns to produce the complete video from arbitrary guided frames. By varying the masking conditions, MMVG learns to utilize the [SPAN] token and unifies all TVC tasks during the training.

Requirements

This code is implemented under Python 3.9, Torch 1.11, Torchvision 0.12, TorchMetrics 0.6, and Lightning 1.3. <br>

Since there is no obvious performance gap, we simplify the implementation and adopt VideoGPT in our MMVG.

Usage

Dataset

Put dataset in ./_data.

show_data.ipynb

Inference

Put ckpt in ./_ckpt.

inference.ipynb

Citation

@inproceedings{fu2023tvc, 
  author = {Tsu-Jui Fu and Licheng Yu and Ning Zhang and Cheng-Yang Fu and Jong-Chyi Su and William Yang Wang and Sean Bell}, 
  title = {{Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation}}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023}
}

Acknowledgement

This code is based on Taming and TATS