Home

Awesome

SwiftFormer

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Abdelrahman Shaker<sup>*1</sup>, Muhammad Maaz<sup>1</sup>, Hanoona Rasheed<sup>1</sup>, Salman Khan<sup>1</sup>, Ming-Hsuan Yang<sup>2,3</sup> and Fahad Shahbaz Khan<sup>1,4</sup>

Mohamed Bin Zayed University of Artificial Intelligence<sup>1</sup>, University of California Merced<sup>2</sup>, Google Research<sup>3</sup>, Linkoping University<sup>4</sup>

<!-- [![Website](https://img.shields.io/badge/Project-Website-87CEEB)](site_url) -->

paper

<!-- [![video](https://img.shields.io/badge/Video-Presentation-F9D371)](youtube_link) --> <!-- [![slides](https://img.shields.io/badge/Presentation-Slides-B762C1)](presentation) -->

:rocket: News

<hr /> <p align="center"> <img src="images/Swiftformer_performance.png" width=60%> <br> Comparison of our SwiftFormer Models with state-of-the-art on ImgeNet-1K. The latency is measured on iPhone 14 Neural Engine (iOS 16). </p> <p align="center"> <img src="images/attentions_comparison.png" width=99%> <br> </p> <p align="left"> Comparison with different self-attention modules. (a) is a typical self-attention. (b) is the transpose self-attention, where the self-attention operation is applied across channel feature dimensions (d×d) instead of the spatial dimension (n×n). (c) is the separable self-attention of MobileViT-v2, it uses element-wise operations to compute the context vector from the interactions of Q and K matrices. Then, the context vector is multiplied by V matrix to produce the final output. (d) Our proposed efficient additive self-attention. Here, the query matrix is multiplied by learnable weights and pooled to produce global queries. Then, the matrix K is element-wise multiplied by the broadcasted global queries, resulting the global context representation. </p> <details> <summary> <font size="+1">Abstract</font> </summary> Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8~ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. </details> <br>

Classification on ImageNet-1K

Models

ModelTop-1 accuracy#paramsGMACsLatencyCkptCoreML
SwiftFormer-XS75.7%3.5M0.6G0.7msXSXS
SwiftFormer-S78.5%6.1M1.0G0.8msSS
SwiftFormer-L180.9%12.1M1.6G1.1msL1L1
SwiftFormer-L383.0%28.5M4.0G1.9msL3L3

Detection and Segmentation Qualitative Results

<p align="center"> <img src="images/detection_seg.png" width=100%> <br> </p> <p align="center"> <img src="images/semantic_seg.png" width=100%> <br> </p>

Latency Measurement

The latency reported in SwiftFormer for iPhone 14 (iOS 16) uses the benchmark tool from XCode 14.

SwiftFormer meets Android

Community-driven results with Samsung Galaxy S23 Ultra, with Qualcomm Snapdragon 8 Gen 2:

  1. Export & profiler results of SwiftFormer_L1:

    QNN2.162.172.18
    Latency (msec)2.632.262.43
  2. Export & profiler results of SwiftFormerEncoder block:

    QNN2.162.172.18
    Latency (msec)2.171.691.7

    Refer to script above for details of the input & block parameters.

Interested in reproducing the results above?

Refer to Issue #14 for details about exporting & profiling.

ImageNet

Prerequisites

conda virtual environment is recommended.

conda create --name=swiftformer python=3.9
conda activate swiftformer

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install timm
pip install coremltools==5.2.0

Data preparation

Download and extract ImageNet train and val images from http://image-net.org. The training and validation data are expected to be in the train folder and val folder respectively:

|-- /path/to/imagenet/
    |-- train
    |-- val

Single machine multi-GPU training

We provide training script for all models in dist_train.sh using PyTorch distributed data parallel (DDP).

To train SwiftFormer models on an 8-GPU machine:

sh dist_train.sh /path/to/imagenet 8

Note: specify which model command you want to run in the script. To reproduce the results of the paper, use 16-GPU machine with batch-size of 128 or 8-GPU machine with batch size of 256. Auto Augmentation, CutMix, MixUp are disabled for SwiftFormer-XS, and CutMix, MixUp are disabled for SwiftFormer-S.

Multi-node training

On a Slurm-managed cluster, multi-node training can be launched as

sbatch slurm_train.sh /path/to/imagenet SwiftFormer_XS

Note: specify slurm specific paramters in slurm_train.sh script.

Testing

We provide an example test script dist_test.sh using PyTorch distributed data parallel (DDP). For example, to test SwiftFormer-XS on an 8-GPU machine:

sh dist_test.sh SwiftFormer_XS 8 weights/SwiftFormer_XS_ckpt.pth

Citation

if you use our work, please consider citing us:

@InProceedings{Shaker_2023_ICCV,
    author    = {Shaker, Abdelrahman and Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
    title     = {SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2023},
}

Contact:

If you have any questions, please create an issue on this repository or contact at abdelrahman.youssief@mbzuai.ac.ae.

Acknowledgement

Our code base is based on LeViT and EfficientFormer repositories. We thank authors for their open-source implementation.

Our Related Works