Awesome
Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training<br> Zhan Tong, Yibing Song, Jue Wang, Limin Wang<br>Nanjing University, Tencent AI Lab
๐ฐ News
[2023.4.18] ๐Everyone can download Kinetics-400, which is used in VideoMAE, from this link.<br>
[2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo!<br>
[2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework.<br>
[2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! ๐<br>
[2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available! <br>
[2022.12.27] ๐Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo.<br>
[2022.11.20] ๐ VideoMAE is integrated into and , supported by @Sayak Paul.<br>
[2022.10.25] ๐ VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully. <br>
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available! <br>
[2022.10.19] The pre-trained models and scripts on UCF101 are available! <br>
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! ๐ <br>
[2022.8.8] ๐ VideoMAE is integrated into official ๐คHuggingFace Transformers now! <br>
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details. <br>
[2022.4.24] Code and pre-trained models are available now! <br>
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.
โจ Highlights
๐ฅ Masked Video Modeling for Video Pre-Training
VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.
โก๏ธ A Simple, Efficient and Strong Baseline in SSVP
VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.
๐ฎ High performance, but NO extra data required
VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.
๐ Main Results
โจ Something-Something V2
Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |
---|---|---|---|---|---|---|
VideoMAE | no | ViT-S | 224x224 | 16x2x3 | 66.8 | 90.3 |
VideoMAE | no | ViT-B | 224x224 | 16x2x3 | 70.8 | 92.4 |
VideoMAE | no | ViT-L | 224x224 | 16x2x3 | 74.3 | 94.6 |
VideoMAE | no | ViT-L | 224x224 | 32x1x3 | 75.4 | 95.2 |
โจ Kinetics-400
Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |
---|---|---|---|---|---|---|
VideoMAE | no | ViT-S | 224x224 | 16x5x3 | 79.0 | 93.8 |
VideoMAE | no | ViT-B | 224x224 | 16x5x3 | 81.5 | 95.1 |
VideoMAE | no | ViT-L | 224x224 | 16x5x3 | 85.2 | 96.8 |
VideoMAE | no | ViT-H | 224x224 | 16x5x3 | 86.6 | 97.1 |
VideoMAE | no | ViT-L | 320x320 | 32x4x3 | 86.1 | 97.3 |
VideoMAE | no | ViT-H | 320x320 | 32x4x3 | 87.4 | 97.6 |
โจ AVA 2.2
Please check the code and checkpoints in VideoMAE-Action-Detection.
Method | Extra Data | Extra Label | Backbone | #Frame x Sample Rate | mAP |
---|---|---|---|---|---|
VideoMAE | Kinetics-400 | โ | ViT-S | 16x4 | 22.5 |
VideoMAE | Kinetics-400 | โ | ViT-S | 16x4 | 28.4 |
VideoMAE | Kinetics-400 | โ | ViT-B | 16x4 | 26.7 |
VideoMAE | Kinetics-400 | โ | ViT-B | 16x4 | 31.8 |
VideoMAE | Kinetics-400 | โ | ViT-L | 16x4 | 34.3 |
VideoMAE | Kinetics-400 | โ | ViT-L | 16x4 | 37.0 |
VideoMAE | Kinetics-400 | โ | ViT-H | 16x4 | 36.5 |
VideoMAE | Kinetics-400 | โ | ViT-H | 16x4 | 39.5 |
VideoMAE | Kinetics-700 | โ | ViT-L | 16x4 | 36.1 |
VideoMAE | Kinetics-700 | โ | ViT-L | 16x4 | 39.3 |
โจ UCF101 & HMDB51
Method | Extra Data | Backbone | UCF101 | HMDB51 |
---|---|---|---|---|
VideoMAE | no | ViT-B | 91.3 | 62.6 |
VideoMAE | Kinetics-400 | ViT-B | 96.1 | 73.3 |
๐จ Installation
Please follow the instructions in INSTALL.md.
โก๏ธ Data Preparation
Please follow the instructions in DATASET.md for data preparation.
๐ Pre-training
The pre-training instruction is in PRETRAIN.md.
โคด๏ธ Fine-tuning with pre-trained models
The fine-tuning instruction is in FINETUNE.md.
๐Model Zoo
We provide pre-trained and fine-tuned models in MODEL_ZOO.md.
๐ Visualization
We provide the script for visualization in vis.sh
. Colab notebook for better visualization is coming soon.
โ๏ธ Contact
Zhan Tong: tongzhan@smail.nju.edu.cn
๐ Acknowledgements
Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.<br> This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.
๐ License
The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.
โ๏ธ Citation
If you think this project is helpful, please feel free to leave a starโญ๏ธ and cite our paper:
@inproceedings{tong2022videomae,
title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
@article{videomae,
title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
journal={arXiv preprint arXiv:2203.12602},
year={2022}
}