Home

Awesome

Demystify Transformers & Convolutions in Modern Image Deep Networks

This repository is a codebase for modern image deep networks, which is used in our paper "Demystify Transformers & Convolutions in Modern Image Deep Networks". In this paper, we develop a unified architecture for different spatial token mixing paradigms, and make various comparisons and analyses for these "spatial token mixers".

Figure1

The Purpose of This Project

Recently, a series of transformer-based vision backbones with novel spatial feature aggregation paradigms (spatial token mixer, STM) are proposed and report remarkable performance. However, network engineering techniques can also improve their performance significantly. Some works also argue simple STM can attain competitive performance with proper design. Hence, we aim to identify the real difference and performance gains among different STMs under a unified and optimized overall design (architecture and training recipes). Hence, we elaborate a unified architecture, upon which a series of STMs are fit into it for comparisons and analyses.

Currently Supported STMs

Updates

Highlights

Usage

Requirements and Data Preparation

Installation

We suggest the python 3.8 environments with the following packages:

If you want to use the U-InternImage, please install the MultiScaleDeformableAttention:

cd ./classification/ops
sh make.sh

Data Preparation

ImageNet with the following folder structure, and you can extract ImageNet by this script.

│imagenet/
├──meta/
|  ├──train.txt
│  ├──val.txt
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Using Pretrained Models

You can refer to the models folder to create any stand-alone models, and then load them with the pre-trained weights. All the pre-trained weights can be found here.

All models

U-HaloNetU-PVTU-Swin TransformerU-ConvNeXtU-InternImage
Micro (~4.5M)75.8% | Download72.8% | Download74.4% | Download75.1% | Download75.3% | Download
Tiny (~30M)83.0% | Download82.1% | Download82.3% | Download82.2% | Download83.3% | Download
Small (~50M)84.0% | Download83.2% | Download83.3% | Download83.1% | Download84.1% | Download
Base (~100M)84.6% | Download83.4% | Download83.7% | Download83.7% | Download84.5% | Download
<!-- The detailed complexity and accuracy of each model are listed below. Note that the original accuracy denotes the reported accuracy of their official paper and implementation. We fit their spatial token mixers into our unified architecture. ### HaloNet | Scale | #Params (M) | GMACs | Acc. (Our Implementation) | Acc. (Original) | Checkpoint | | :---: | :---------: | :---: | :-----------------------: | :-------------: | :------------: | | Micro | 4.4 | 0.71 | 74.4 | -- | [Download](xx) | | Tiny | 31.5 | 4.91 | 82.3 | | [Download](xx) | | Small | 52.9 | 9.18 | 83.3 | | [Download](xx) | | Base | 93.4 | 16.18 | 83.7 | | [Download](xx) | ### PVT | Scale | #Params (M) | GMACs | Acc. (Our Implementation) | Acc. (Original) | Checkpoint | | :---: | :---------: | :---: | :-----------------------: | :-------------: | :------------: | | Micro | 4.4 | 0.71 | 74.4 | -- | [Download](xx) | | Tiny | 31.5 | 4.91 | 82.3 | | [Download](xx) | | Small | 52.9 | 9.18 | 83.3 | | [Download](xx) | | Base | 93.4 | 16.18 | 83.7 | | [Download](xx) | ### Swin Transformer | Scale | #Params (M) | GMACs | Acc. (Our Implementation) | Acc. (Original) | Checkpoint | | :---: | :---------: | :---: | :-----------------------: | :-------------: | :------------: | | Micro | 4.4 | 0.71 | 74.4 | -- | [Download](xx) | | Tiny | 31.5 | 4.91 | 82.3 | | [Download](xx) | | Small | 52.9 | 9.18 | 83.3 | | [Download](xx) | | Base | 93.4 | 16.18 | 83.7 | | [Download](xx) | ### ConvNeXt | Scale | #Params (M) | GMACs | Acc. (Our Implementation) | Acc. (Original) | Checkpoint | | :---: | :---------: | :---: | :-----------------------: | :-------------: | :------------: | | Micro | 4.4 | 0.71 | 74.4 | -- | [Download](xx) | | Tiny | 31.5 | 4.91 | 82.3 | | [Download](xx) | | Small | 52.9 | 9.18 | 83.3 | | [Download](xx) | | Base | 93.4 | 16.18 | 83.7 | | [Download](xx) | ### InternImage | Scale | #Params (M) | GMACs | Acc. (Our Implementation) | Acc. (Original) | Checkpoint | | :---: | :---------: | :---: | :-----------------------: | :-------------: | :------------: | | Micro | 4.4 | 0.71 | 74.4 | -- | [Download](xx) | | Tiny | 31.5 | 4.91 | 82.3 | | [Download](xx) | | Small | 52.9 | 9.18 | 83.3 | | [Download](xx) | | Base | 93.4 | 16.18 | 83.7 | | [Download](xx) | -->

Evaluation of Classification Models

You can use the shell scripts in shell/eval to evaluate the model. The provided code works with slurm. If you are using a slurm-supported cluster to run the model, please modify the virtual partition and checkpoint path. For example, to evaluate HaloNet-Tiny on ImageNet-1k, you use the following command:

cd ./classification
sh ./shell/eval/eval.sh $MODEL_NAME$

The $MODEL_NAME$ for different models are listed as follows:

U-HaloNetU-PVTU-Swin TransformerU-ConvNeXtU-InternImage
Micro (~4.5M)unified_halo_microunified_pvt_microunified_swin_microunified_convnext_microunified_dcn_v3_micro
Tiny (~30M)unified_halo_tinyunified_pvt_tinyunified_swin_tinyunified_convnext_tinyunified_dcn_v3_tiny
Small (~50M)unified_halo_smallunified_pvt_smallunified_swin_smallunified_convnext_smallunified_dcn_v3_small
Base (~100M)unified_halo_baseunified_pvt_baseunified_swin_baseunified_convnext_baseunified_dcn_v3_base

Training Classification Models

Currently, this repository only supports ImageNet-1k training. ImageNet-21k training will be updated soon. You can use the shell scripts in shell/1k_pretrain to reproduce our results. For example, if you want to train HaloNet-Tiny, you can use the following command:

cd ./classification
sh ./shell/1k_pretrain/transformer.sh $MODEL_NAME$

Remember to modify the output directory and the virtual partition. The scripts also work with slurm, you can use PyTorch official DDP mechanism to launch the training with some modifications, refer to ConvNeXt for details.

Training and Evaluation on Object Detection

Plese refer this guide the train and evaluate the models on object detection.

Bibtex

If you find our work or models useful, please consider citing our paper as follows:

@article{hu2022demystify,
  title={Demystify Transformers \& Convolutions in Modern Image Deep Networks},
  author={Hu, Xiaowei and Shi, Min and Wang, Weiyun and Wu, Sitong and Xing, Linjie and Wang, Wenhai and Zhu, Xizhou and Lu, Lewei and Zhou, Jie and Wang, Xiaogang and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2211.05781},
  year={2022}
}

Acknowledgment