Home

Awesome

TransNeXt

PWC PWC PWC

arXiv Hugging Face Models License

Official PyTorch implementation of "TransNeXt: Robust Foveal Visual Perception for Vision Transformers" [CVPR 2024] .

🤗 Don’t hesitate to give me a ⭐️, if you are interested in this project!

Updates

2024.06.08 We have created an explanatory video for our paper. You can watch it on YouTube or BiliBili.

2024.04.20 We have released the complete training and inference code, pre-trained model weights, and training logs!

2024.02.26 Our paper has been accepted by CVPR 2024! 🎉

2023.11.28 We have submitted the preprint of our paper to Arxiv

2023.09.21 We have submitted our paper and the model code to OpenReview, where it is publicly accessible.

Current Progress

:heavy_check_mark: Release of model code and CUDA implementation for acceleration.

:heavy_check_mark: Release of comprehensive training and inference code.

:heavy_check_mark: Release of pretrained model weights and training logs.

Motivation and Highlights

<div align="center"> <img src="figures/multi_scale_inference.jpg" alt="multi_scale_inference" style="width: 60%;" /> </div>

Methods

Pixel-focused attention (Left) & aggregated attention (Right):

pixel-focused_attention

Convolutional GLU (First on the right):

Convolutional GLU

Results

Image Classification, Detection and Segmentation:

experiment_figure

Attention Visualization:

foveal_peripheral_vision

Model Zoo

Image Classification

Classification code & weights & configs & training logs are >>>here<<<.

ImageNet-1K 224x224 pre-trained models:

Model#Params#FLOPsIN-1KIN-AIN-C↓IN-RSketchIN-V2DownloadConfigLog
TransNeXt-Micro12.8M2.7G82.529.950.845.833.072.6modelconfiglog
TransNeXt-Tiny28.2M5.7G84.039.946.549.637.673.8modelconfiglog
TransNeXt-Small49.7M10.3G84.747.143.952.539.774.8modelconfiglog
TransNeXt-Base89.7M18.4G84.850.643.553.941.475.1modelconfiglog

ImageNet-1K 384x384 fine-tuned models:

Model#Params#FLOPsIN-1KIN-AIN-RSketchIN-V2DownloadConfig
TransNeXt-Small49.7M32.1G86.058.356.443.276.8modelconfig
TransNeXt-Base89.7M56.3G86.261.657.744.777.0modelconfig

ImageNet-1K 256x256 pre-trained model fully utilizing aggregated attention at all stages:

(See Table.9 in Appendix D.6 for details)

ModelToken mixer#Params#FLOPsIN-1KDownloadConfigLog
TransNeXt-MicroA-A-A-A13.1M3.3G82.6modelconfiglog

Object Detection

Object detection code & weights & configs & training logs are >>>here<<<.

COCO object detection and instance segmentation results using the Mask R-CNN method:

BackbonePretrained ModelLr Schdbox mAPmask mAP#ParamsDownloadConfigLog
TransNeXt-TinyImageNet-1K1x49.944.647.9Mmodelconfiglog
TransNeXt-SmallImageNet-1K1x51.145.569.3Mmodelconfiglog
TransNeXt-BaseImageNet-1K1x51.745.9109.2Mmodelconfiglog

COCO object detection results using the DINO method:

BackbonePretrained Modelscalesepochsbox mAP#ParamsDownloadConfigLog
TransNeXt-TinyImageNet-1K4scale1255.147.8Mmodelconfiglog
TransNeXt-TinyImageNet-1K5scale1255.748.1Mmodelconfiglog
TransNeXt-SmallImageNet-1K5scale1256.669.6Mmodelconfiglog
TransNeXt-BaseImageNet-1K5scale1257.1110Mmodelconfiglog

Semantic Segmentation

Semantic segmentation code & weights & configs & training logs are >>>here<<<.

ADE20K semantic segmentation results using the UPerNet method:

BackbonePretrained ModelCrop SizeLr SchdmIoUmIoU (ms+flip)#ParamsDownloadConfigLog
TransNeXt-TinyImageNet-1K512x512160K51.151.5/51.759Mmodelconfiglog
TransNeXt-SmallImageNet-1K512x512160K52.252.5/52.880Mmodelconfiglog
TransNeXt-BaseImageNet-1K512x512160K53.053.5/53.7121Mmodelconfiglog

ADE20K semantic segmentation results using the Mask2Former method:

BackbonePretrained ModelCrop SizeLr SchdmIoU#ParamsDownloadConfigLog
TransNeXt-TinyImageNet-1K512x512160K53.447.5Mmodelconfiglog
TransNeXt-SmallImageNet-1K512x512160K54.169.0Mmodelconfiglog
TransNeXt-BaseImageNet-1K512x512160K54.7109Mmodelconfiglog

Installation

CUDA Implementation

Before installing the CUDA extension, please ensure that the CUDA version on your machine (checked with nvcc -V) matches the CUDA version of PyTorch.

cd swattention_extension
pip install .

Acknowledgement

License

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find our work helpful, please consider citing the following bibtex. We would greatly appreciate a star for this project.

@InProceedings{shi2023transnext,
  author    = {Dai Shi},
  title     = {TransNeXt: Robust Foveal Visual Perception for Vision Transformers},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2024},
  pages     = {17773-17783}
}