Awesome

MAE for Self-supervised ViT

Introduction

This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT. This repo is mainly based on moco-v3, pytorch-image-models, BEiT and MAE-pytorch.

TODO

Main Result

We support two representations (repre.) for classification: GAP (Global Average Pooling) and Cls-token. According to paper, MAE works similarily well with both of them. In Cls-token mode, it is trained in encoder of MAE.

For k-NN evaluation, we use k=10 as default.

ViT-Small

pretrain epoch	repre.	ft. top1	lin.	k-NN	config	weight	log
100	GAP	76.58%	34.65%	19.7%	pretrain finetune	pretrain finetune	pretrain finetune
100	Cls-token	75.77%	38.95%	23.7%	pretrain finetune	pretrain finetune	pretrain finetune
200	GAP	76.86%	36.46%	19.8%	pretrain finetune	pretrain finetune	pretrain finetune
400	GAP	77.56% / 80.02% / 80.89%	36.98%	20.8%	pretrain finetune	pretrain finetune	pretrain finetune
800	GAP	77.93% / 80.87% / 81.11%	36.88%	20.7%	pretrain finetune	pretrain finetune	pretrain finetune
1600	GAP	-		-	pretrain finetune	pretrain finetune	pretrain finetune

We finetune models by 50 epochs as default. For 400 and 800 epochs pretraining, we use 50 / 100 / 150 epochs for fine-tuning (logs and weights provided under 50 epochs).
BaiduNetdisk （2lt1）

ViT-Base

pretrain epoch	repre.	ft. top1	k-NN	config	weight	log
400	GAP	83.08%	28.9%	pretrain finetune	pretrain finetune	pretrain finetune

Following paper, we finetune models by 100 epochs as default.
BaiduNetdisk (k2ef)

ViT-Large

pretrain epoch	repre.	ft. top1	lin.	k-NN	config	weight	log
100	GAP	83.51%	58.90%	33.08%	pretrain finetune	pretrain finetune	pretrain finetune

Following paper, we finetune models by 50 epochs as default.
BaiduNetdisk (825g)

Usage

Preparation

The code has been tested with CUDA 11.4, PyTorch 1.8.2.

Notes:

The batch size specified by -b is batch-size per card.
The learning rate specified by --lr is the base lr (corresponding to 256 batch-size), and is adjusted by the linear lr scaling rule.
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported.
We support cls-token (token) and global averaging pooling (GAP) for classification. Please verify the correspondence of pretraining and finetuning/linear probing. For cls-token mode during pretraining, cls-token is trained in encoder.

Self-supervised Pre-Training

Below is examples for MAE pre-training.

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch-size 4096, GAP.

sh run_pretrain.sh \
	--config cfgs/pretrain/Vit-S_100E_GAP.yaml \
	--data_path /path/to/train/data

End-to-End Fine-tuning

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, 50epochs, batch-size 4096, GAP.

sh run_finetune.sh \
	--config cfgs/finetune/ViT-S_50E_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model

Linear Classification

According to paper, we have two training modes: SGD + 4096 batch-size and LARS + 16384 batch-size.

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, 50epochs, SGD + batch-size 4096, GAP.

sh run_lincls.sh \
	--config cfgs/lincls/ViT-S_SGD_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model

k-NN Evaluation of Pretrain Model

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090), GAP.

sh run_knn.sh \
	--config cfgs/finetune/ViT-S_50E_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model \
	--save_path /path/to/save/result

Visualization of Restruction

ViT-Base Pretrained by 400 Epochs.

python tools/run_mae_vis.py \
	--config cfgs/pretrain/ViT-B_400E_Norm_GAP.yaml \
	--save_path output/restruct/ \
	--model_path /path/to/pretrain/model \
	--img_path /path/to/image

Visualization of Restruction

ViT-Small w/ CLS-Token Pretrained by 100 Epochs.

python tools/vit_explain.py
--config cfgs/finetune/ViT-S_50E_CLS-Token.yaml
--finetune /path/to/pretrain/model
--image_path /path/to/image
--head_fusion max
--discard_ratio 0.9

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you use the code of this repo, please cite the original paper and this repo:

@Article{he2021mae,
  author  = {Kaiming He* and Xinlei Chen* and Saining Xie and Yanghao Li and Piotr Dolla ́r and Ross Girshick},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  journal = {arXiv preprint arXiv:2111.06377},
  year    = {2021},
}

@misc{yang2021maepriv,
  author       = {Lu Yang* and Pu Cao* and Yang Nie and Qing Song},
  title        = {MAE-priv},
  howpublished = {\url{https://github.com/BUPT-PRIV/MAE-priv}},
  year         = {2021},
}