Home

Awesome

MetaFormer Baselines for Vision (TPAMI 2024)

<p align="left"> <a href="https://arxiv.org/abs/2210.13452" alt="arXiv"> <img src="https://img.shields.io/badge/arXiv-2210.13452-b31b1b.svg?style=flat" /></a> <a href="https://colab.research.google.com/drive/1raon_oZRnUBXb9ZYcMY3Au_r-3l4eP1I?usp=sharing" alt="Colab"> <img src="https://colab.research.google.com/assets/colab-badge.svg" /></a> </p>

This is a PyTorch implementation of several MetaFormer baslines including IdentityFormer, RandFormer, ConvFormer and CAFormer proposed by our paper "MetaFormer Baselines for Vision".

Figure1 Figure 1: Performance of MetaFormer baselines and other state-of-the-art models on ImageNet-1K at 224x224 resolution. The architectures of our proposed models are shown in Figure 2. (a) IdentityFormer/RandFormer achieve over 80%/81% accuracy, indicating MetaFormer has solid lower bound of performance and works well on arbitrary token mixers. The accuracy of well-trained ResNet-50 is from "ResNet strikes back". (b) Without novel token mixers, pure CNN-based ConvFormer outperforms ConvNeXt, while CAFormer sets a new record of 85.5% accuracy on ImageNet-1K at 224x224 resolution under normal supervised training without external data or distillation.

Overall Figure 2: (a-d) Overall frameworks of IdentityFormer, RandFormer, ConvFormer and CAFormer. Similar to ResNet, the models adopt hierarchical architecture of 4 stages, and stage $i$ has $L_i$ blocks with feature dimension $D_i$. Each downsampling module is implemented by a layer of convolution. The first downsampling has kernel size of 7 and stride of 4, while the last three ones have kernel size of 3 and stride of 2. (e-h) Architectures of IdentityFormer, RandFormer, ConvFormer and Transformer blocks, which have token mixer of identity mapping, global random mixing, separable depthwise convolutions, or vanilla self-attention, respectively.

Comparision

News

Models of MetaFormer baselines are now integrated in timm by Fredo Guan and Ross Wightman. Many thanks!

Requirements

torch>=1.7.0; torchvision>=0.8.0; pyyaml; timm (pip install timm==0.6.11)

Data preparation: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

MetaFormer baselines

Models with common token mixers trained on ImageNet-1K

ModelResolutionParamsMACsTop1 AccDownload
caformer_s1822426M4.1G83.6here
caformer_s18_38438426M13.4G85.0here
caformer_s3622439M8.0G84.5here
caformer_s36_38438439M26.0G85.7here
caformer_m3622456M13.2G85.2here
caformer_m36_38438456M42.0G86.2here
caformer_b3622499M23.2G85.5*here
caformer_b36_38438499M72.2G86.4here
convformer_s1822427M3.9G83.0here
convformer_s18_38438427M11.6G84.4here
convformer_s3622440M7.6G84.1here
convformer_s36_38438440M22.4G85.4here
convformer_m3622457M12.8G84.5here
convformer_m36_38438457M37.7G85.6here
convformer_b36224100M22.6G84.8here
convformer_b36_384384100M66.5G85.7here

:astonished: :astonished: * To the best of our knowledge, the model sets a new record on ImageNet-1K with the accuracy of 85.5% at 224x224 resolution under normal supervised setting (without external data or distillation).

Models with common token mixers pretrained on ImageNet-21K and finetuned on ImgeNet-1K

ModelResolutionParamsMACsTop1 AccDownload
caformer_s18_in21ft1k22426M4.1G84.1here
caformer_s18_384_in21ft1k38426M13.4G85.4here
caformer_s36_in21ft1k22439M8.0G85.8here
caformer_s36_384_in21ft1k38439M26.0G86.9here
caformer_m36_in21ft1k22456M13.2G86.6here
caformer_m36_384_in21ft1k38456M42.0G87.5here
caformer_b36_in21ft1k22499M23.2G87.4here
caformer_b36_384_in21ft1k38499M72.2G88.1here
convformer_s18_in21ft1k22427M3.9G83.7here
convformer_s18_384_in21ft1k38427M11.6G85.0here
convformer_s36_in21ft1k22440M7.6G85.4here
convformer_s36_384_in21ft1k38440M22.4G86.4here
convformer_m36_in21ft1k22457M12.8G86.1here
convformer_m36_384_in21ft1k38457M37.7G86.9here
convformer_b36_in21ft1k224100M22.6G87.0here
convformer_b36_384_in21kft1k384100M66.5G87.6here

Models with common token mixers pretrained on ImageNet-21K

ModelResolutionDownload
caformer_s18_in21k224here
caformer_s36_in21k224here
caformer_m36_in21k224here
caformer_b36_in21k224here
convformer_s18_in21k224here
convformer_s36_in21k224here
convformer_m36_in21k224here
convformer_b36_in21k224here

Models with basic token mixers trained on ImageNet-1K

ModelResolutionParamsMACsTop1 AccDownload
identityformer_s1222411.9M1.8G74.6here
identityformer_s2422421.3M3.4G78.2here
identityformer_s3622430.8M5.0G79.3here
identityformer_m3622456.1M8.8G80.0here
identityformer_m4822473.3M11.5G80.4here
randformer_s1222411.9 + <ins>0.2</ins>M1.9G76.6here
randformer_s2422421.3 + <ins>0.5</ins>M3.5G78.2here
randformer_s3622430.8 + <ins>0.7</ins>M5.2G79.5here
randformer_m3622456.1 + <ins>0.7</ins>M9.0G81.2here
randformer_m4822473.3 + <ins>0.9</ins>M11.9G81.4here
poolformerv2_s1222411.9M1.8G78.0here
poolformerv2_s2422421.3M3.4G80.7here
poolformerv2_s3622430.8M5.0G81.6here
poolformerv2_m3622456.1M8.8G82.2here
poolformerv2_m4822473.3M11.5G82.6here

The underlined numbers mean the numbers of parameters that are frozen after random initialization.

The checkpoints can also be found in Baidu Disk.

Usage

We also provide a Colab notebook which run the steps to perform inference with MetaFormer baselines: Colab

Validation

To evaluate our CAFormer-S18 models, run:

MODEL=caformer_s18
python3 validate.py /path/to/imagenet  --model $MODEL -b 128 \
  --checkpoint /path/to/checkpoint 

Train

We use batch size of 4096 by default and we show how to train models with 8 GPUs. For multi-node training, adjust --grad-accum-steps according to your situations.

DATA_PATH=/path/to/imagenet
CODE_PATH=/path/to/code/metaformer # modify code path here


ALL_BATCH_SIZE=4096
NUM_GPU=8
GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.
let BATCH_SIZE=ALL_BATCH_SIZE/NUM_GPU/GRAD_ACCUM_STEPS


cd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \
--model convformer_s18 --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path 0.2 --head-dropout 0.0

Training (fine-tuning) scripts of other models are shown in scripts.

Acknowledgment

Weihao Yu would like to thank TRC program and GCP research credits for the support of partial computational resources. Our implementation is based on the wonderful pytorch-image-models codebase.

Bibtex

@article{yu2024metaformer,
  author={Yu, Weihao and Si, Chenyang and Zhou, Pan and Luo, Mi and Zhou, Yichen and Feng, Jiashi and Yan, Shuicheng and Wang, Xinchao},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={MetaFormer Baselines for Vision}, 
  year={2024},
  volume={46},
  number={2},
  pages={896-912},
  doi={10.1109/TPAMI.2023.3329173}}
}