Home

Awesome

<div align="center"> <h1>UM-MAE</h1> <h3>Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality</h3>

Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang

</div>

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Visualization: See Colab notebook.

@article{Li2022ummae,
  author  = {Li, Xiang and Wang, Wenhai and Yang, Lingfeng and Yang, Jian},
  journal = {arXiv:2205.10063},
  title   = {Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality},
  year    = {2022},
}

Updates

30/May/2022: Visualization code/demo is updated at Colab notebook.

26/May/2022: The Chinese blog of this paper is available at zhihu.

23/May/2022: The preprint version is public at arxiv.

Motivation

(a) In MAE, the global window of Vanilla ViT can receive arbitrary subset of image patches by skipping random 75% of the total, whilst (b) skipping these 75% patches is unacceptable for Pyramid-based ViT as patch elements are not equivalent across the local windows. (c) A straightforward solution is to adopt the mask token for the encoder (e.g., SimMIM) at the cost of slower training. (d) Our Uniform Masking (UM) approach (including Uniform Sampling and Secondary Masking) enables the efficient MAE-style pre-training for Pyramid-based ViTs while keeping its competitive fine-tuning accuracy.

<p align="center"> <img src="https://github.com/implus/UM-MAE/blob/main/figs/pipeline_cropped.png" width="480"> </p>

Introduction

UM-MAE is an efficient and general technique that supports MAE-style MIM Pre-training for popular Pyramid-based Vision Transformers (e.g., PVT, Swin).

Main Results on ImageNet-1K

ModelsPre-train MethodSampling StrategySecondary Mask RatioEncoder RatioPretrain EpochsPretrain HoursFT acc@1(%)FT weight/log
ViT-BMAERS--25%200todo82.88weight/log
ViT-BMAEUM25%25%200todo82.88weight/log
PVT-SSimMIMRS--100%20038.079.28weight/log
PVT-SUM-MAEUM25%25%20021.379.31weight/log
Swin-TSimMIMRS--100%20049.382.20weight/log
Swin-TUM-MAEUM25%25%20025.082.04weight/log
Swin-LSimMIMRS--100%800--85.4link
Swin-LUM-MAEUM25%25%800todo85.2weight/log

RS: Random Sampling; UM: Uniform Masking, consisting of Uniform Sampling and Secondary Masking

Acknowledgement

The pretraining and finetuning of our project are based on DeiT, MAE and SimMIM. The object detection and semantic segmentation parts are based on MMDetection and MMSegmentation respectively. Thanks for their wonderful work.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.