Home

Awesome

MAGVIT: Masked Generative Video Transformer

PWC

PWC

PWC

PWC

PWC

PWC

PWC

PWC

[Paper] | [Project Page] | [Colab]

Official code and models for the CVPR 2023 paper:

MAGVIT: Masked Generative Video Transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
CVPR 2023

Summary

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

If you find this code useful in your research, please cite

@inproceedings{yu2023magvit,
  title={{MAGVIT}: Masked generative video transformer},
  author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Disclaimers

Please note that this is not an officially supported Google product.

Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.

Installation

There is a conda environment file for running with GPUs. CUDA 11 and CuDNN 8.6 is required for JAX. This VM Image has been tested.

conda env create -f environment.yaml
conda activate magvit

Alternatively, you can install the dependencies via

pip install -r requirements.txt

Pretrained models

As for the model pretrained weight release, please see this note.

MAGVIT 3D-VQ models

ModelSizeInputOutputCodebook sizeDataset
3D-VQB16 frames x 64x644x16x161024BAIR Robot Pushing
3D-VQL16 frames x 64x644x16x161024BAIR Robot Pushing
3D-VQB16 frames x 128x1284x16x161024UCF-101
3D-VQL16 frames x 128x1284x16x161024UCF-101
3D-VQB16 frames x 128x1284x16x161024Kinetics-600
3D-VQL16 frames x 128x1284x16x161024Kinetics-600
3D-VQB16 frames x 128x1284x16x161024Something-Something-v2
3D-VQL16 frames x 128x1284x16x161024Something-Something-v2

MAGVIT transformers

Each transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.

ModelTaskSizeDatasetFVD
TransformerClass-conditionalBUCF-101159
TransformerClass-conditionalLUCF-10176
TransformerFrame predictionBBAIR Robot Pushing76 (48)
TransformerFrame predictionLBAIR Robot Pushing62 (31)
TransformerFrame prediction (5)BKinetics-60024.5
TransformerFrame prediction (5)LKinetics-6009.9
TransformerMulti-task-8BBAIR Robot Pushing32.8
TransformerMulti-task-8LBAIR Robot Pushing22.8
TransformerMulti-task-10BSomething-Something-v243.4
TransformerMulti-task-10LSomething-Something-v227.3
<!-- ## Usage ### Inference Inference pretrained models in the [colab](). ### Training new models Instructions for training new models can be [found here](). -->