Home

Awesome

BinaryViT

This repository contains the training code of our work: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models".

Vision transformers (ViTs) suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods or existing binarization methods to binarize ViTs compared to CNNs on datasets with a large number of classes such as ImageNet-1k. Therefore, we propose BinaryViT, in which inspired by the CNN architecture, we include operations from the CNN architecture into a pure ViT architecture to enrich the representational capability of a binary ViT without introducing convolutions. These include an average pooling layer instead of a token pooling layer, a block that contains multiple average pooling branches, an affine transformation right before the addition of each main residual connection, and a pyramid structure. Experimental results on the ImageNet-1k dataset show the effectiveness of these operations that allow a fully-binary pure ViT model to be competitive with previous state-of-the-art binary (SOTA) CNN models.

An overview of our architectural modifications is illustrated below:

<div align=center> <img src="https://github.com/Phuoc-Hoan-Le/BinaryViT/blob/main/overview.png"/> </div>

Run

1. Requirements:

2. To run:

Citation

If you find our work or this code useful, please cite our paper:

@InProceedings{Le_2023_CVPR,
    author    = {Le, Phuoc-Hoan Charles and Li, Xinlin},
    title     = {BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2023},
    pages     = {4664-4673}
}