Home

Awesome

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

SupMAE

This is a offical PyTorch/GPU implementation of the paper SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners.

TL;DR

Supervised MAE (SupMAE) is an extension of MAE by adding a supervised classification branch. SupMAE is efficient and can achieve comparable performance with MAE using only 30% compute. SupMAE’s robustness on ImageNet variants and transfer learning performance outperforms MAE and standard supervised pre-training counterparts.

:one: SupMAE is more training efficient

SupMAE Performance

:two: SupMAE model is more robust

datasetMAEDeiTSupMAE(Ours)
IN-Corruption ↓51.747.448.1
IN-Adversarial35.927.935.5
IN-Rendition48.345.351.0
IN-Sketch34.532.036.0
Score41.839.543.6

Note: The score is measured by the averaging metric across four variants (we use ’100 - error’ for the IN-Corruption performance metric).

:three: SupMAE learns more transferable features

Few-shot learning on 20 classification datasets
CheckpointMethod5-shot20-shot50-shot
Linear ProbeMAESelf-Sup.33.37 ± 1.9848.03 ± 2.7058.26 ± 0.84
Linear ProbeMoCo-v3Self-Sup.50.17 ± 3.4361.99 ± 2.5169.71 ± 1.03
Linear ProbeSupMAE(Ours)Sup.47.97 ± 0.4460.86 ± 0.3166.68 ± 0.47
Fine-tuneMAESelf-Sup.36.10 ± 3.2554.13 ± 3.8665.86 ± 2.42
Fine-tuneMoCo-v3Self-Sup.39.30 ± 3.8458.75 ± 5.5570.33 ± 1.64
Fine-tuneSupMAE(Ours)Sup.46.76 ± 0.1264.61 ± 0.8271.71 ± 0.66

Note: We are using the Elevater_Toolkit_IC (HIGHLY recommendation)!

Semantic segmentation with ADE-20k
methodmIoUaAccmAcc
Naive supervised47.4--
MAE48.682.859.4
SupMAE (ours)49.082.760.2

Note: We are using mmsegmentaion

Abstract

Recently, self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability. However, the pretext task, Masked Image Modeling (MIM), reconstructs the missing local patches, lacking the global understanding of the image. This paper extends MAE to a fully-supervised setting by adding a supervised classification branch, thereby enabling MAE to effectively learn global features from golden labels. The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used. Through experiments, we demonstrate that not only is SupMAE more training efficient but also it learns more robust and transferable features.

Catalog

Pre-trained checkpoints & logs

Due to computation constraint, we ONLY test the ViT-B/16 model.

Pre-trainingFine-tuning
checkpointckpt <br /> md5: <tt>d83c8a</tt>ckpt <br /> md5: <tt>1fb748</tt>
logsloglog

Pre-training

The pre-training instruction is in PRETRAIN.md.

Fine-tuning

The fine-tuning instruction is in FINETUNE.md.

Citation

If you find this repository helpful, please consider citing our work

@article{liang2022supmae,
  title={SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners},
  author={Liang, Feng and Li, Yangguang and Marculescu, Diana},
  journal={arXiv preprint arXiv:2205.14540},
  year={2022}
}

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.