Home

Awesome

<div align="center"> <h2><a href="https://arxiv.org/abs/2205.13943">Architecture-Agnostic Masked Image Modeling - From ViT back to CNN</a></h2>

Siyuan Li<sup>*,1,2</sup>, Di Wu<sup>*,1,2</sup>, Fang Wu<sup>1,3</sup>, Zelin Zang<sup>1,2</sup>, Stan Z. Li<sup>†,1</sup>

<sup>1</sup>Westlake University, <sup>2</sup>Zhejiang University, <sup>3</sup>Tsinghua University

</div> <p align="center"> <a href="https://arxiv.org/abs/2205.13943" alt="arXiv"> <img src="https://img.shields.io/badge/arXiv-2205.13943-b31b1b.svg?style=flat" /></a> <a href="https://github.com/Westlake-AI/A2MIM/blob/main/LICENSE" alt="license"> <img src="https://img.shields.io/badge/license-Apache--2.0-%23B7A800" /></a> </p> <p align="center"> <img src="https://user-images.githubusercontent.com/44519745/234438993-b5a145ab-d345-46ae-9267-25f68379bb62.png" width=100% height=100% class="center"> </p>

Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers (ViT). Its underlying idea is simple: a portion of the input image is randomly masked out and then reconstructed via the pre-text task. However, why MIM works well is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this paper, we first study interactions among patches to understand what knowledge is learned and how it is acquired via the MIM task. We observe that MIM essentially teaches the model to learn better middle-level interactions among patches and extract more generalized features. Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling framework (A2MIM), which is compatible with not only Transformers but also CNNs in a unified way. Extensive experiments on popular benchmarks show that our A2MIM learns better representations and endows the backbone model with the stronger capability to transfer to various downstream tasks for both Transformers and CNNs.

<details> <summary>Table of Contents</summary> <ol> <li><a href="#catalog">Catalog</a></li> <li><a href="#license">License</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>

Catalog

We have released implementations of A2MIM based on OpenMixup. In the future, we plan to add A2MIM implementations to MMPretrain. Pre-trained and fine-tuned models are released in GitHub / Baidu Cloud.

Pre-training on ImageNet

1. Installation

Please refer to INSTALL.md for installation instructions.

2. Pre-training and fine-tuning

We provide scripts for multiple GPUs pre-training and the specified CONFIG_FILE.

bash tools/dist_train.sh ${CONFIG_FILE} ${GPUS} [optional arguments]

For example, you can run the script below to pre-train ResNet-50 with A2MIM on ImageNet with 8 GPUs:

PORT=29500 bash tools/dist_train.sh configs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300.py 8

After pre-trianing, you can fine-tune and evaluate the models with the corresponding script:

python tools/model_converters/extract_backbone_weights.py work_dirs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300/latest.pth ${PATH_TO_CHECKPOINT}
PORT=29500 bash tools/dist_train_ft_8gpu.sh configs/openmixup/finetune/imagenet/r50_rsb_a3_ft_sz160_4xb512_cos_fp16_ep100.py ${PATH_TO_CHECKPOINT}

3. Implementation Details

Results and Models

We provide the summarization of pre-training (800 or 300 epochs) and fine-tuning (100 or 300 epochs) results of A2MIM and baselines on ImageNet-1K.

Methods# Params.SupervisionSimMIMA2MIM
Target(M)LabelRGBRGB
ViT-S48.879.981.782.1
ViT-B86.781.883.884.2
ViT-L304.682.685.686.1
ResNet-5025.679.879.980.4
ResNet-10144.581.381.381.9
ResNet-15260.281.881.982.5
ResNet-20064.782.182.283.0
ConvNeXt-S50.283.183.283.7
ConvNeXt-B88.683.583.684.1

Config files, models, logs, and visualization of reconstructions are provided as follows. These files can also be downloaded from a2mim-in1k-weights, OpenMixup-a2mim-in1k-weights or Baidu Cloud: A2MIM (3q5i).

<details open> <summary>ViT-S/B/L on ImageNet-1K.</summary>
MethodBackbonePT EpochFT Top-1Pre-trainingFine-tuningResults
SimMIMViT-Small80081.7config | ckpt | visconfigckpt | log
A2MIMViT-Small80082.1config | ckpt | visconfigckpt | log
SimMIMViT-Base80083.8config | ckpt | visconfigckpt | log
A2MIMViT-Base80084.3config | ckpt | visconfigckpt | log
SimMIMViT-Large80085.6config | ckptconfiglog
A2MIMViT-Large80086.1config | ckpt | visconfiglog
</details> <details> <summary>ResNet-50/101/152/200 on ImageNet-1K.</summary>
MethodBackbonePT EpochFT (A2) Top-1Pre-trainingFine-tuningResults
SimMIMResNet-5030079.9config | ckpt | visRSB A2-
A2MIMResNet-5010078.8config | ckpt | visRSB A3ckpt | log
A2MIMResNet-5030080.4config | ckpt | visRSB A2ckpt | log
SimMIMResNet-10130081.3config | ckptRSB A2ckpt (A3) | log (A3)
A2MIMResNet-10130081.9config | ckpt (300ep) | ckpt (800ep)RSB A2ckpt (A2) | log (A2)
SimMIMResNet-15230081.9config | ckptRSB A2log (A3)
A2MIMResNet-15230082.5config | ckpt (300ep) | ckpt (800ep)RSB A2ckpt (A2) | log (A2)
SimMIMResNet-20030082.2config | ckpt | visRSB A2ckpt | log
A2MIMResNet-20030083.0config | ckpt | visRSB A2ckpt | log
</details> <details> <summary>ConvNeXt-S/B on ImageNet-1K.</summary>
MethodBackbonePT EpochFT (A2) Top-1Pre-trainingFine-tuningResults
SimMIMConvNeXt-S30083.2config | ckpt | visRSB A2-
A2MIMConvNeXt-S30083.7config | ckpt | visRSB A2ckpt | log
SimMIMConvNeXt-B30083.6config | ckptRSB A2ckpt | log
A2MIMConvNeXt-B30084.1config | ckptRSB A2ckpt (A2) | ckpt (A3) | log (A2) | log (A3)
</details>

4. Empirical Studies

Following RepBottleneck, we provided interpretation of how masked image modeling works with representation bottleneck based on ViTs and CNNs. As shown in Figure 1/5/A1/A2 in A2MIM and following figures, we visualize the multi-order interation strengths with representation_bottleneck. Following How ViT works, we also provided analysis from frequency perspectives in Figure A3/A4 in A2MIM based on fourier_analysis.

<p align="center"> <img src="https://github.com/Westlake-AI/A2MIM/assets/44519745/1b5470b3-51f9-4585-9ff2-eeec34cef766" width=100% height=100% class="center"> </p>

License

This project is released under the Apache 2.0 license.

Acknowledgement

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

Citation

If you find this repository helpful, please consider citing our paper:

@inproceedings{icml2023a2mim,
  title={Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN},
  author={Li, Siyuan and Wu, Di and Wu, Fang and Zang, Zelin and Li, Stan. Z.},
  booktitle={International Conference on Machine Learning},
  year={2023},
}
<p align="right">(<a href="#top">back to top</a>)</p>