Awesome

MaxViT: Multi-Axis Vision Transformer (ECCV 2022)

This repository hosts the official TensorFlow implementation of MAXViT models:

MaxViT: Multi-Axis Vision Transformer. ECCV 2022.
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li
Google Research, University of Texas at Austin

Disclaimer: This is not an officially supported Google product.

News:

May, 2023: MaxViT is officially released in Tensorflow model garden to support training!
Oct 12, 2022: Added the remaining ImageNet-1K and -21K checkpoints.
Oct 4, 2022: A list of updates
- Added MaxViTTiny and MaxViTSmall checkpoints.
- Added a Colab tutorial.
Sep 8, 2022: our Google AI blog covering both MaxViT and MAXIM is live.
Sep 7, 2022: @rwightman released a few small model weights in timm. Achieves even better results than our paper. See more here.
Aug 26, 2022: our MaxViT models have been implemented in timm (pytorch-image-models). Kudos to @rwightman!
July 21, 2022: Initial code release of MaxViT models: accepted to ECCV'22.
Apr 6, 2022: MaxViT has been implemented by @lucidrains: vit-pytorch :scream: :exploding_head:
Apr 4, 2022: initial uploads to Arxiv

MaxViT Models

MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers. They can also scale well on large dataset sizes like ImageNet-21K. Notably, due to the linear-complexity of the grid attention used, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages.

MaxViT meta-architecture:

Results on ImageNet-1k train and test:

Results on ImageNet-21k and JFT pre-trained models:

Colab Demo

We have released a Google Colab Demo on the tutorials of how to run MaxViT on images. Try it here

Pretrained MaxViT Checkpoints

We have provided a list of results and checkpoints as follows:

Name	Resolution	Top1 Acc.	#Params	FLOPs	Model
MaxViT-T	224x224	83.62%	31M	5.6B	ckpt
MaxViT-T	384x384	85.24%	31M	17.7B	ckpt
MaxViT-T	512x512	85.72%	31M	33.7B	ckpt
MaxViT-S	224x224	84.45%	69M	11.7B	ckpt
MaxViT-S	384x384	85.74%	69M	36.1B	ckpt
MaxViT-S	512x512	86.19%	69M	67.6B	ckpt
MaxViT-B	224x224	84.95%	119M	24.2B	ckpt
MaxViT-B	384x384	86.34%	119M	74.2B	ckpt
MaxViT-B	512x512	86.66%	119M	138.5B	ckpt
MaxViT-L	224x224	85.17%	212M	43.9B	ckpt
MaxViT-L	384x384	86.40%	212M	133.1B	ckpt
MaxViT-L	512x512	86.70%	212M	245.4B	ckpt

Here are a list of ImageNet-21K pretrained and ImageNet-1K finetuned models:

Name	Resolution	Top1 Acc.	#Params	FLOPs	21k model	1k model
MaxViT-B	224x224	-	119M	24.2B	ckpt	-
MaxViT-B	384x384	-	119M	74.2B	-	ckpt
MaxViT-B	512x512	-	119M	138.5B	-	ckpt
MaxViT-L	224x224	-	212M	43.9B	ckpt	-
MaxViT-L	384x384	-	212M	133.1B	-	ckpt
MaxViT-L	512x512	-	212M	245.4B	-	ckpt
MaxViT-XL	224x224	-	475M	97.8B	ckpt	-
MaxViT-XL	384x384	-	475M	293.7B	-	ckpt
MaxViT-XL	512x512	-	475M	535.2B	-	ckpt

Citation

Should you find this repository useful, please consider citing:

@article{tu2022maxvit,
  title={MaxViT: Multi-Axis Vision Transformer},
  author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
  journal={ECCV},
  year={2022},
}

Other Related Works

MAXIM: Multi-Axis MLP for Image Processing, CVPR 2022. Paper | Code
CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers, CoRL 2022. Paper | Code
Improved Transformer for High-Resolution GANs, NeurIPS 2021. Paper | Code
CoAtNet: Marrying Convolution and Attention for All Data Sizes, NeurIPS 2021. Paper
EfficientNetV2: Smaller Models and Faster Training, ICML 2021. Paper | Code

Acknowledgement: This repository is built on the EfficientNets and CoAtNet.