Home

Awesome

Scale-Aware Modulation Meet Transformer

This repo is the official implementation of "Scale-Aware Modulation Meet Transformer".

<!-- [[`SMT Paper`](https://github.com/AFeng-x/SMT)] --> <!-- It currently includes code and models for the following tasks: > **Image Classification** > **Object Detection and Instance Segmentation** > **Semantic Segmentation** -->

📣 Announcement

Introduction

SMT is capably serves as a promising new generic backbone for efficient visual modeling. It is a new hybrid ConvNet and vision Transformer backbone, which can effectively simulate the transition from local to global dependencies as the network goes deeper, resulting in superior performance over both ConvNets and Transformers. teaser

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained SMT Models

namepretrainresolutionacc@1acc@5#paramsFLOPs22K model1K model
SMT-TImageNet-1K224x22482.296.012M2.4G-github/config/
SMT-SImageNet-1K224x22483.796.521M4.7G-github/config
SMT-BImageNet-1K224x22484.396.932M7.7G-github/config
SMT-LImageNet-22K224x22487.198.181M17.6Ggithub/configgithub/config
SMT-LImageNet-22K384x38488.198.481M51.6Ggithub/configgithub/config

Main Results on Downstream Tasks

COCO Object Detection (2017 val)

BackboneMethodpretrainLr Schdbox mAPmask mAP#paramsFLOPs
SMT-SMask R-CNNImageNet-1K3x49.043.440M265G
SMT-BMask R-CNNImageNet-1K3x49.844.052M328G
SMT-SCascade Mask R-CNNImageNet-1K3x51.944.778M744G
SMT-SRetinaNetImageNet-1K3x47.3-30M247G
SMT-SSparse R-CNNImageNet-1K3x50.2-102M171G
SMT-SATSSImageNet-1K3x49.9-28M214G
SMT-SDINOImageNet-1K4scale54.0-40M309G

ADE20K Semantic Segmentation (val)

BackboneMethodpretrainCrop SizeLr SchdmIoU (ss)mIoU (ms)#paramsFLOPs
SMT-SUperNetImageNet-1K512x512160K49.250.250M935G
SMT-BUperNetImageNet-1K512x512160K49.650.662M1004G

Getting Started

git clone https://github.com/Afeng-x/SMT.git
cd SMT
conda create -n smt python=3.8 -y
conda activate smt

Install PyTorch>=1.10.0 with CUDA>=10.2:

pip3 install torch==1.10 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu113
pip install timm==0.4.12
pip install opencv-python==4.4.0.46 termcolor==1.1.0 yacs==0.1.8 pyyaml scipy ptflops thop

Evaluation

To evaluate a pre-trained SMT on ImageNet val, run:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/smt/smt_base_224.yaml --resume /path/to/ckpt.pth \
--data-path /path/to/imagenet-1k

Training from scratch on ImageNet-1K

To train a SMT on ImageNet from scratch, run:

python -m torch.distributed.launch --master_port 4444 --nproc_per_node 8 main.py \
--cfg configs/smt/smt_tiny_224.yaml \
--data-path /path/to/imagenet-1k --batch-size 128

Pre-training on ImageNet-22K

For example, to pre-train a SMT-Large model on ImageNet-22K:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  main.py \
--cfg configs/smt/smt_large_224_22k.yaml --data-path /path/to/imagenet-22k \
--batch-size 128 --accumulation-steps 4 

Fine-tuning

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  main.py \
--cfg configs/smt/smt_large_384_22kto1k_finetune.yaml \
--pretrained /path/to/pretrain_ckpt.pth --data-path /path/to/imagenet-1k \
--batch-size 64 [--use-checkpoint]

Throughput

To measure the throughput, run:

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345  main.py \
--cfg <config-file> --data-path <imagenet-path> --batch-size 64 --throughput --disable_amp

Citation

@misc{lin2023scaleaware,
      title={Scale-Aware Modulation Meet Transformer}, 
      author={Weifeng Lin and Ziheng Wu and Jiayu Chen and Jun Huang and Lianwen Jin},
      year={2023},
      eprint={2307.08579},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This repository is built on top of the timm library and the official Swin Transformer repository. For object detection, we utilize mmdetection and adopt the pipeline configuration from Swin-Transformer-Object-Detection. Moreover, we incorporate detrex for implementing the DINO method. As for semantic segmentation, we employ mmsegmentation and ollow the pipeline setup outlined in Swin-Transformer-Semantic-Segmentation.