Awesome
Scale-Aware Modulation Meet Transformer
This repo is the official implementation of "Scale-Aware Modulation Meet Transformer".
<!-- [[`SMT Paper`](https://github.com/AFeng-x/SMT)] --> <!-- It currently includes code and models for the following tasks: > **Image Classification** > **Object Detection and Instance Segmentation** > **Semantic Segmentation** -->📣 Announcement
18 Jul, 2023
: The paper is available on arXiv.16 Jul, 2023
: The detection code and segmentation code are now open source and available!14 Jul, 2023
: SMT is accepted to ICCV 2023!
Introduction
SMT is capably serves as a promising new generic backbone for efficient visual modeling. It is a new hybrid ConvNet and vision Transformer backbone, which can effectively simulate the transition from local to global dependencies as the network goes deeper, resulting in superior performance over both ConvNets and Transformers.
Main Results on ImageNet with Pretrained Models
ImageNet-1K and ImageNet-22K Pretrained SMT Models
name | pretrain | resolution | acc@1 | acc@5 | #params | FLOPs | 22K model | 1K model |
---|---|---|---|---|---|---|---|---|
SMT-T | ImageNet-1K | 224x224 | 82.2 | 96.0 | 12M | 2.4G | - | github/config/ |
SMT-S | ImageNet-1K | 224x224 | 83.7 | 96.5 | 21M | 4.7G | - | github/config |
SMT-B | ImageNet-1K | 224x224 | 84.3 | 96.9 | 32M | 7.7G | - | github/config |
SMT-L | ImageNet-22K | 224x224 | 87.1 | 98.1 | 81M | 17.6G | github/config | github/config |
SMT-L | ImageNet-22K | 384x384 | 88.1 | 98.4 | 81M | 51.6G | github/config | github/config |
Main Results on Downstream Tasks
COCO Object Detection (2017 val)
Backbone | Method | pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs |
---|---|---|---|---|---|---|---|
SMT-S | Mask R-CNN | ImageNet-1K | 3x | 49.0 | 43.4 | 40M | 265G |
SMT-B | Mask R-CNN | ImageNet-1K | 3x | 49.8 | 44.0 | 52M | 328G |
SMT-S | Cascade Mask R-CNN | ImageNet-1K | 3x | 51.9 | 44.7 | 78M | 744G |
SMT-S | RetinaNet | ImageNet-1K | 3x | 47.3 | - | 30M | 247G |
SMT-S | Sparse R-CNN | ImageNet-1K | 3x | 50.2 | - | 102M | 171G |
SMT-S | ATSS | ImageNet-1K | 3x | 49.9 | - | 28M | 214G |
SMT-S | DINO | ImageNet-1K | 4scale | 54.0 | - | 40M | 309G |
ADE20K Semantic Segmentation (val)
Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU (ss) | mIoU (ms) | #params | FLOPs |
---|---|---|---|---|---|---|---|---|
SMT-S | UperNet | ImageNet-1K | 512x512 | 160K | 49.2 | 50.2 | 50M | 935G |
SMT-B | UperNet | ImageNet-1K | 512x512 | 160K | 49.6 | 50.6 | 62M | 1004G |
Getting Started
- Clone this repo:
git clone https://github.com/Afeng-x/SMT.git
cd SMT
- Create a conda virtual environment and activate it:
conda create -n smt python=3.8 -y
conda activate smt
Install PyTorch>=1.10.0
with CUDA>=10.2
:
pip3 install torch==1.10 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu113
- Install
timm==0.4.12
:
pip install timm==0.4.12
- Install other requirements:
pip install opencv-python==4.4.0.46 termcolor==1.1.0 yacs==0.1.8 pyyaml scipy ptflops thop
Evaluation
To evaluate a pre-trained SMT
on ImageNet val, run:
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/smt/smt_base_224.yaml --resume /path/to/ckpt.pth \
--data-path /path/to/imagenet-1k
Training from scratch on ImageNet-1K
To train a SMT
on ImageNet from scratch, run:
python -m torch.distributed.launch --master_port 4444 --nproc_per_node 8 main.py \
--cfg configs/smt/smt_tiny_224.yaml \
--data-path /path/to/imagenet-1k --batch-size 128
Pre-training on ImageNet-22K
For example, to pre-train a SMT-Large
model on ImageNet-22K:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \
--cfg configs/smt/smt_large_224_22k.yaml --data-path /path/to/imagenet-22k \
--batch-size 128 --accumulation-steps 4
Fine-tuning
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \
--cfg configs/smt/smt_large_384_22kto1k_finetune.yaml \
--pretrained /path/to/pretrain_ckpt.pth --data-path /path/to/imagenet-1k \
--batch-size 64 [--use-checkpoint]
Throughput
To measure the throughput, run:
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py \
--cfg <config-file> --data-path <imagenet-path> --batch-size 64 --throughput --disable_amp
Citation
@misc{lin2023scaleaware,
title={Scale-Aware Modulation Meet Transformer},
author={Weifeng Lin and Ziheng Wu and Jiayu Chen and Jun Huang and Lianwen Jin},
year={2023},
eprint={2307.08579},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgement
This repository is built on top of the timm library and the official Swin Transformer repository. For object detection, we utilize mmdetection and adopt the pipeline configuration from Swin-Transformer-Object-Detection. Moreover, we incorporate detrex for implementing the DINO method. As for semantic segmentation, we employ mmsegmentation and ollow the pipeline setup outlined in Swin-Transformer-Semantic-Segmentation.