Home

Awesome

Self-slimmed Vision Transformer (ECCV2022)

This repo is the official implementation of "Self-slimmed Vision Transformer".

Updates

07/20/2022

[Initial commits]:

  1. The supported code and models for LV-ViT are provided.

Introduction

SiT (Self-slimmed Vision Transformer) is introduce in arxiv and serves as a generic self-slimmed learning method for vanilla vision transformers. Our concise TSM (Token Slimming Module) softly integrates redundant tokens into fewer informative ones. For stable and efficient training, we introduce a novel FRD framework to leverage structure knowledge, which can densely transfer token information in a flexible auto-encoder manner.

Our SiT can speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by 3.6x while maintaining 97% of their performance. Surprisingly, by simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet, surpassing all the recent CNNs and ViTs. teaser

Main results on LV-ViT

We follow the settings of LeViT for inference speed evaluation.

ModelTeacherResolutionTop-1#Param.FLOPsCkptShell
SiT-TLV-ViT-T224x22480.115.9M1.0Ggoogletrain.sh
SiT-XSLV-ViT-S224x22481.225.6M1.5Ggoogletrain.sh
SiT-SLV-ViT-S224x22483.125.6M4.0Ggoogletrain.sh
SiT-MLV-ViT-M224x22484.255.6M8.1Ggoogletrain.sh
SiT-LLV-ViT-L288x28885.6148.2M34.4Ggoogletrain.sh

The LV-ViT teacher models are trained with token-labeling and their checkpoints are provided.

ModelResolutionTop-1#Param.FLOPsCkpt
LV-ViT-T224x22481.815.7M3.5Ggoogle
LV-ViT-S224x22483.125.4M5.5Ggoogle
LV-ViT-M224x22484.055.2M11.9Ggoogle
LV-ViT-L288x28885.3147M56.1Ggoogle

Cite SiT

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{zong2021self,
      title={Self-slimmed Vision Transformer}, 
      author={Zhuofan Zong and Kunchang Li and Guanglu Song and Yali Wang and Yu Qiao and Biao Leng and Yu Liu},
      year={2021},
      eprint={2111.12624},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.