Home

Awesome

Vision Transformer Attention Benchmark

This repo is a collection of attention mechanisms in vision Transformers. Beside the re-implementation, it provides a benchmark on model parameters, FLOPs and CPU/GPU throughput.

Requirements

Testing Environment

Setting

Testing

For example, to test HiLo attention,

cd attentions/
python hilo.py

By default, the script will test models on both CPU and GPU. FLOPs is measured by fvcore. You may want to edit the source file as needed.

Outputs:

Number of Params: 2.2 M
FLOPs = 298.3 M
throughput averaged with 30 times
batch_size 64 throughput on CPU 1029
throughput averaged with 30 times
batch_size 64 throughput on GPU 5104

Supported Attentions

Single Attention Layer Benchmark

NameParams (M)FLOPs (M)CPU SpeedGPU SpeedDemo
MSA2.36521.435054403msa.py
Cross Window2.37493.283254334cross_window.py
DAT2.38528.692233074dat.py
Performer2.36617.241813180performer.py
Linformer2.46616.565184578linformer
SRA4.72419.567104810sra.py
Local Window2.36477.176314537shifted_window.py
Shifted Window2.36477.173744351shifted_window.py
Focal2.44526.851462842focal.py
XCA2.36481.695834659xca.py
QuadTree5.33613.25723978quadtree.py
VAN1.83357.96594213van.py
HorNet2.23436.511323996hornet.py
HiLo2.20298.3010295104hilo.py

Note: Each method has its own hyperparameters. For a fair comparison on 1/16 scale feature maps, all methods in the above table adopt their default 1/16 scale settings, as shown in their released code repo. For example, when dealing with 1/16 scale feature maps, HiLo in LITv2 adopt a window size of 2 and alpha of 0.9. Future works will consider more scales and memory benchmarking.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.