Home

Awesome

[CVPR'23] AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

:book: Paper: CVPR'23 and arXiv

Our paper (AdaMAE) has been accepted for presentation at CVPR'23.

:bulb: Contributions:

Method

mask-vis-1

Adaptive mask visualizations from $SSv2$ (samples from $50th$ epoch)

  Video  Pred.    Error      CAT  Mask VideoPred.    Error      CAT  Mask  
<p float="left"> <img src="figs/ssv2-mask-vis-1.gif" width="410" /> <img src="figs/ssv2-mask-vis-2.gif" width="410" /> </p> <p float="left"> <img src="figs/ssv2-mask-vis-3.gif" width="410" /> <img src="figs/ssv2-mask-vis-4.gif" width="410" /> </p> <p float="left"> <img src="figs/ssv2-mask-vis-5.gif" width="410" /> <img src="figs/ssv2-mask-vis-6.gif" width="410" /> </p> <p float="left"> <img src="figs/ssv2-mask-vis-7.gif" width="410" /> <img src="figs/ssv2-mask-vis-8.gif" width="410" /> </p> <p float="left"> <img src="figs/ssv2-mask-vis-9.gif" width="410" /> <img src="figs/ssv2-mask-vis-10.gif" width="410" /> </p> <p float="left"> <img src="figs/ssv2-mask-vis-11.gif" width="410" /> <img src="figs/ssv2-mask-vis-12.gif" width="410" /> </p>

Adaptive mask visualizations from $K400$ (samples from $50th$ epoch):

  Video  Pred.    Error      CAT  Mask VideoPred.    Error      CAT  Mask  
<p float="left"> <img src="figs/k400-mask-vis-1.gif" width="410" /> <img src="figs/k400-mask-vis-2.gif" width="410" /> </p> <p float="left"> <img src="figs/k400-mask-vis-3.gif" width="410" /> <img src="figs/k400-mask-vis-4.gif" width="410" /> </p> <p float="left"> <img src="figs/k400-mask-vis-5.gif" width="410" /> <img src="figs/k400-mask-vis-6.gif" width="410" /> </p> <p float="left"> <img src="figs/k400-mask-vis-7.gif" width="410" /> <img src="figs/k400-mask-vis-8.gif" width="410" /> </p> <p float="left"> <img src="figs/k400-mask-vis-9.gif" width="410" /> <img src="figs/k400-mask-vis-10.gif" width="410" /> </p> <p float="left"> <img src="figs/k400-mask-vis-11.gif" width="410" /> <img src="figs/k400-mask-vis-12.gif" width="410" /> </p>

A comparision

Comparison of our adaptive masking with existing random patch, tube, and frame masking for masking ratio of 80%.} Our adaptive masking approach selects more tokens from the regions with high spatiotemporal information while a small number of tokens from the background.

mask-type-comp

Ablation experiments on SSv2 dataset:

We use ViT-Base as the backbone for all experiments. MHA $(D=2, d=384)$ denotes our adaptive token sampling network with a depth of two and embedding dimension of $384$. All pre-trained models are evaluated based on the evaluation protocol described in Sec. 4. The default choice of our AdaMAE is highlighted in gray color. The GPU memory consumption is reported for a batch size of 16 on a single GPU.

ssv2-ablations

Pre-training AdaMAE & fine-tuning:

Pre-trained model weights

Acknowledgement:

Our AdaMAE codebase is based on the implementation of VideoMAE paper. We thank the authors of the VideoMAE for making their code available to the public.

Citation:

@InProceedings{Bandara_2023_CVPR,
    author    = {Bandara, Wele Gedara Chaminda and Patel, Naman and Gholami, Ali and Nikkhah, Mehdi and Agrawal, Motilal and Patel, Vishal M.},
    title     = {AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14507-14517}
}