Home

Awesome

<div align="center"> <h3>[NeurIPS 2022] MCMAE: Masked Convolution Meets Masked Autoencoders</h3>

Peng Gao<sup>1</sup>, Teli Ma<sup>1</sup>, Hongsheng Li<sup>2</sup>, Ziyi Lin<sup>2</sup>, Jifeng Dai<sup>3</sup>, Yu Qiao<sup>1</sup>,

<sup>1</sup> Shanghai AI Laboratory, <sup>2</sup> MMLab, CUHK, <sup>3</sup> Sensetime Research.

</div>

* We change the project name from ConvMAE to MCMAE.

This repo is the official implementation of MCMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Video Classification: See VideoConvMAE.

Updates

14/Mar/2023

MR-MCMAE (a.k.a. ConvMAE-v2) paper released: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.

15/Sep/2022

Paper accepted at NeurIPS 2022.

9/Sep/2022

ConvMAE-v2 pretrained checkpoints are released.

21/Aug/2022

Official-ConvMAE-Det which follows official ViTDet codebase is released.

08/Jun/2022

🚀FastConvMAE🚀: significantly accelerates the pretraining hours (4000 single GPU hours => 200 single GPU hours). The code is going to be released at FastConvMAE.

27/May/2022

  1. The supported codes for ImageNet-1K pretraining.
  2. The supported codes and models for semantic segmentation are provided.

20/May/2022

Update results on video classification.

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

  1. Pretrained models on ImageNet-1K for ConvMAE.
  2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

tenser

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

ConvMAE-Base
pretrained checkpointsdownload
logsdownload

The following results are for ConvMAE-v2 (pretrained for 200 epochs on ImageNet-1k).

modelpretrained checkpointsft. acc. on ImageNet-1k
ConvMAE-v2-Smalldownload83.6
ConvMAE-v2-Basedownload85.7
ConvMAE-v2-Largedownload86.8
ConvMAE-v2-Hugedownload88.0

Main Results on ImageNet-1K

Models#Params(M)SupervisionEncoder RatioPretrain EpochsFT acc@1(%)LIN acc@1(%)FT logs/weightsLIN logs/weights
BEiT88DALLE100%30083.037.6--
MAE88RGB25%160083.667.8--
SimMIM88RGB100%80084.056.7--
MaskFeat88HOG100%30083.6N/A--
data2vec88RGB100%80084.2N/A--
ConvMAE-B88RGB25%160085.070.9log/weight

Main Results on COCO

Mask R-CNN

ModelsPretrainPretrain EpochsFinetune Epochs#Params(M)FLOPs(T)box APmask APlogs/weights
Swin-BIN21K w/ labels90361090.751.445.4-
Swin-LIN21K w/ labels90362181.152.446.2-
MViTv2-BIN21K w/ labels9036730.653.147.4-
MViTv2-LIN21K w/ labels90362391.353.647.5-
Benchmarking-ViT-BIN1K w/o labels16001001180.950.444.9-
Benchmarking-ViT-LIN1K w/o labels16001003401.953.347.2-
ViTDetIN1K w/o labels16001001110.851.245.5-
MIMDet-ViT-BIN1K w/o labels1600361271.151.546.0-
MIMDet-ViT-LIN1K w/o labels1600363452.653.347.5-
ConvMAE-BIN1K w/o lables1600251040.953.247.1log/weight

Main Results on ADE20K

UperNet

ModelsPretrainPretrain EpochsFinetune Iters#Params(M)FLOPs(T)mIoUlogs/weights
DeiT-BIN1K w/ labels30016K1630.645.6-
Swin-BIN1K w/ labels30016K1210.348.1-
MoCo V3IN1K30016K1630.647.3-
DINOIN1K40016K1630.647.2-
BEiTIN1K+DALLE160016K1630.647.1-
PeCoIN1K30016K1630.646.7-
CAEIN1K+DALLE80016K1630.648.8-
MAEIN1K160016K1630.648.1-
ConvMAE-BIN1K160016K1530.651.7log/weight

Main Results on Kinetics-400

ModelsPretrain EpochsFinetune Epochs#Params(M)Top1Top5logs/weights
VideoMAE-B2001008777.8
VideoMAE-B8001008779.4
VideoMAE-B16001008779.8
VideoMAE-B1600100 (w/ Repeated Aug)8780.794.7
SpatioTemporalLearner-B800150 (w/ Repeated Aug)8781.394.9
VideoConvMAE-B2001008680.194.3Soon
VideoConvMAE-B8001008681.795.1Soon
VideoConvMAE-B-MSD8001008682.795.5Soon

Main Results on Something-Something V2

ModelsPretrain EpochsFinetune Epochs#Params(M)Top1Top5logs/weights
VideoMAE-B200408766.1
VideoMAE-B800408769.3
VideoMAE-B2400408770.3
VideoConvMAE-B200408667.791.2Soon
VideoConvMAE-B800408669.992.4Soon
VideoConvMAE-B-MSD800408670.793.0Soon

Getting Started

Prerequisites

Training and evaluation

Visualization

tenser

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}