Home

Awesome

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

Paper Tutorial In Colab video

This repository hosts the official TensorFlow implementation of MAXViT models:

MaxViT: Multi-Axis Vision Transformer. ECCV 2022.
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li
Google Research, University of Texas at Austin

Disclaimer: This is not an officially supported Google product.

News:

MaxViT Models

MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers. They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages.

MaxViT meta-architecture:

<p align="center"> <img src = "./doc/maxvit_arch.png" width="80%"> </p>

Results on ImageNet-1k train and test:

<p align="center"> <img src = "./doc/imagenet_results.png" width="80%"> </p>

Results on ImageNet-21k and JFT pre-trained models:

<p align="center"> <img src = "./doc/i21k_jft_results.png" width="80%"> </p>

Colab Demo

We have released a Google Colab Demo on the tutorials of how to run MaxViT on images. Try it here Open In Colab

Pretrained MaxViT Checkpoints

We have provided a list of results and checkpoints as follows:

NameResolutionTop1 Acc.#ParamsFLOPsModel
MaxViT-T224x22483.62%31M5.6Bckpt
MaxViT-T384x38485.24%31M17.7Bckpt
MaxViT-T512x51285.72%31M33.7Bckpt
MaxViT-S224x22484.45%69M11.7Bckpt
MaxViT-S384x38485.74%69M36.1Bckpt
MaxViT-S512x51286.19%69M67.6Bckpt
MaxViT-B224x22484.95%119M24.2Bckpt
MaxViT-B384x38486.34%119M74.2Bckpt
MaxViT-B512x51286.66%119M138.5Bckpt
MaxViT-L224x22485.17%212M43.9Bckpt
MaxViT-L384x38486.40%212M133.1Bckpt
MaxViT-L512x51286.70%212M245.4Bckpt

Here are a list of ImageNet-21K pretrained and ImageNet-1K finetuned models:

NameResolutionTop1 Acc.#ParamsFLOPs21k model1k model
MaxViT-B224x224-119M24.2Bckpt-
MaxViT-B384x384-119M74.2B-ckpt
MaxViT-B512x512-119M138.5B-ckpt
MaxViT-L224x224-212M43.9Bckpt-
MaxViT-L384x384-212M133.1B-ckpt
MaxViT-L512x512-212M245.4B-ckpt
MaxViT-XL224x224-475M97.8Bckpt-
MaxViT-XL384x384-475M293.7B-ckpt
MaxViT-XL512x512-475M535.2B-ckpt

Citation

Should you find this repository useful, please consider citing:

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

Other Related Works

Acknowledgement: This repository is built on the EfficientNets and CoAtNet.