Home

Awesome

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Lingchen Meng*<sup>1</sup>, Hengduo Li*<sup>2</sup>, Bor-Chun Chen<sup>3</sup>, Shiyi Lan<sup>2</sup>, Zuxuan Wu<sup>1</sup>, Yu-Gang Jiang<sup>1</sup>, Ser-Nam Lim<sup>3</sup><br> <sup>1</sup>Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan Univeristy, <sup>2</sup>University of Maryland, <sup>3</sup>Meta AI<br> * Equal contribution

This repository is an official implementation of the AdaViT.
Our codes are based on the pytorch-image-models and T2T-ViT

Abstract

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of tasks recently. While achieving excellent performance, they still require relatively intensive computational cost that scales up drastically as the numbers of patches, self-attention heads and transformer blocks increase. In this paper, we argue that due to the large variations among images, their need for modeling long-range dependencies between patches differ. To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition. Optimized jointly with a transformer backbone in an end-to-end manner, a light-weight decision network is attached to the backbone to produce decisions on-the-fly. Extensive experiments on ImageNet demonstrate that our method obtains more than $2\times$ improvement on efficiency compared to state-of-the-art vision transformers with only $0.8%$ drop of accuracy, achieving good efficiency/accuracy trade-offs conditioned on different computational budgets. We further conduct quantitative and qualitative analysis on learned usage polices and provide more insights on the redundancy in vision transformers.

<img src="assets/adavit_approach.png">

Model Zoo

We have put our model checkpoints here.

ModelTop1 AccMACsDownload
Ada-T2T-ViT-1981.13.9Glink
Ada-DeiT-S77.32.3Glink

Eval our model

Download our AdaViT with T2T-ViT-19 from google drive and perform the command below. You can expect to get the Acc about 81.1 with 3.9 GFLOPS.

python3 ada_main.py /path/to/your/imagenet  \
  --model ada_step_t2t_vit_19_lnorm \
  --ada-head --ada-layer --ada-token-with-mlp \
  --flops-dict adavit_ckpt/t2t-19-h-l-tmlp_flops_dict.pth \
  --eval_checkpoint /path/to/your/checkpoint

python3 ada_main.py /path/to/your/imagenet  \
  --model ada_step_deit_small_patch16_224 \
  --ada-head --ada-layer --ada-token-with-mlp \
  --flops-dict adavit_ckpt/deit-s-h-l-tmlp_flops_dict.pth \
  --eval_checkpoint /path/to/your/checkpoint

Citation

@inproceedings{meng2022adavit,
  title={AdaViT: Adaptive Vision Transformers for Efficient Image Recognition},
  author={Meng, Lingchen and Li, Hengduo and Chen, Bor-Chun and Lan, Shiyi and Wu, Zuxuan and Jiang, Yu-Gang and Lim, Ser-Nam},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={12309--12318},
  year={2022}
}