Home

Awesome

BootMAE, ECCV2022

This repo is the official implementation of "Bootstrapped Masked Autoencoders for Vision BERT Pretraining".

Introduction

We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs:

  1. momentum encoder that provides online feature as extra BERT prediction targets;
  2. target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.

pipeline

Requirements

timm==0.3.4, pytorch>=1.7, opencv, ... , run:

bash setup.sh

Results

modelPretrain EpochPretrain ModelLinear acc@1Finetune ModelFinetune acc@1
ViT-B800model66.1model84.2
ViT-L800model77.1model85.9

See Segmentation for segmetation results and config.

Pretrain

The BootMAE-base model can be pretrained on ImageNet-1k using 16 V100-32GB:

OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet

run_pretraining.py \
    --data_path ${DATA_PATH} \
    --output_dir ${OUTPUT_DIR} \
    --model ${MODEL} \
    --model_ema --model_ema_decay 0.999 --model_ema_dynamic \
    --batch_size 256 --lr 1.5e-4 --min_lr 1e-4 \
    --epochs 801 --warmup_epochs 40 --update_freq 1 \
    --mask_num 147 --feature_weight 1 --weight_mask 

see scripts/pretrain for more config

Finetuning

For finetuning BootMAE-base on ImageNet-1K

MODEL=bootmae_base
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
FINE=/path/to/your_pretrain_model

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model ${MODEL} --data_path $DATA_PATH \
    --input_size 224 \
    --finetune ${FINE} \
    --num_workers 8 \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 256 --lr 5e-3 --update_freq 1 \
    --warmup_epochs 20 --epochs 100 \
    --layer_decay 0.6 --backbone_decay 1 \
    --drop_path 0.1 \
    --abs_pos_emb --disable_rel_pos_bias \
    --weight_decay 0.05 --mixup 0.8 --cutmix 1.0 \
    --nb_classes 1000 --model_key model \
    --enable_deepspeed \
    --model_ema --model_ema_decay 0.9998 \

see scripts/finetune for more config

Linear Probing

For evaluate linear probing accuracy of BootMAE-base on ImageNet-1K with 8 GPU

OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
FINETUNE=/path/to/your_pretrain_model

LAYER=9

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 \
        main_linprobe.py \
        --batch_size 1024 --accum_iter 2 \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} \
        --model base_patch16_224 --depth ${LAYER} \
        --finetune ${FINETUNE} \
        --global_pool \
        --epochs 90 \
        --blr 0.1 \
        --weight_decay 0.0 \
        --dist_eval 

see scripts/linear for more config

Acknowledgments

This repository is modified from BEiT, built using the timm library, the DeiT repository and the Dino repository. The linear probing part is modified from MAE.

Citation

If you use this code for your research, please cite our paper.

@article{dong2022bootstrapped,
  title={Bootstrapped Masked Autoencoders for Vision BERT Pretraining},
  author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
  journal={arXiv preprint arXiv:2207.07116},
  year={2022}
}