Home

Awesome

<div align=center> <img src="overview.png" width="500px" /> </div>

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

This repository contains the official implementation for the paper "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference". To the best of our knowledge, this is the first work on integer-only quantization for vision transformers.

Below are instructions of Pytorch code to reproduce the accuracy results of quantization-aware training (QAT). TVM benchmark is the TVM deployment project for reproducing latency results.

Installation

git clone https://github.com/zkkli/I-ViT.git
cd I-ViT

QAT Experiments

python quant_train.py [--model] [--data] [--epochs] [--lr]

optional arguments:
--model: Model architecture, the choises can be: 
         deit_tiny, deit_small, deit_base, swin_tiny, swin_small, swin_base.
--data: Path to ImageNet dataset.
--epochs: recommended values are: [30, 60, 90], default=90.
--lr: recommended values are: [2e-7, 5e-7, 1e-6, 2e-6], default=1e-6.
python quant_train.py --model deit_tiny --data <YOUR_DATA_DIR> --epochs 30 --lr 5e-7 

Results

Below are the Top-1 (%) accuracy results of our proposed I-ViT that you should get on ImageNet dataset.

ModelFP32INT8 (I-ViT)Diff.
ViT-S81.3981.27-0.12
ViT-B84.5384.76+0.23
DeiT-T72.2172.24+0.03
DeiT-S79.8580.12+0.27
DeiT-B81.8581.74-0.11
Swin-T81.3581.50+0.15
Swin-S83.2083.01-0.19

Citation

We appreciate it if you would please cite the following paper if you found the implementation useful for your work:

@inproceedings{li2023vit,
  title={I-vit: Integer-only quantization for efficient vision transformer inference},
  author={Li, Zhikai and Gu, Qingyi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={17065--17075},
  year={2023}
}