Home

Awesome

UniFormer

<a src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" href="https://huggingface.co/spaces/Andy1621/uniformer_light "> <img src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" alt="Open in Huggingface"> </a> <a src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red" href="https://arxiv.org/abs/2201.09450"> <img src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red"> </a> <a src="https://img.shields.io/badge/cs.CV-2305.06355-b31b1b?logo=arxiv&logoColor=red" href="https://arxiv.org/abs/2201.04676"> <img src="https://img.shields.io/badge/cs.CV-2201.04676-b31b1b?logo=arxiv&logoColor=red"> </a>

πŸ’¬ This repo is the official implementation of:

πŸ€– It currently includes code and models for the following tasks:

🌟 Other popular repos:

⚠️ Note!!!!!

For downstream tasks:

πŸ”₯ Updates

05/19/2023

The extension version has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) πŸŽ‰πŸŽ‰πŸŽ‰. In revision, we explore the simple yet effective lightweight design: Hourglass UniFormer. Based on that, we propose the efficient UniFormer-XS and UniFormer-XXS:

11/20/2022

We have released UniFormerV2, which aims to arming the pre-trained ViTs with efficient UniFormer designs. It can save a lot of reaining resources and achieve powerful performance on 8 popular benchmarks. Please have a try! πŸŽ‰πŸŽ‰

10/26/2022

We have provided the code for video visualizations, please see video_classification/vis.

05/24/2022

  1. Some bugs for video recognition have been fixed in Nightcrawler. We successfully adapt UniFormer for extreme dark video classification! πŸŽ‰πŸŽ‰
  2. More demos for Detection and Segmentation are provided. πŸ‘πŸ˜„

03/6/2022

Some models with head_dim=64 are released, which can save memory cost for downstream tasks.

02/9/2022

Some popular models and demos are updated in hugging face.

02/3/2022

Integrated into Hugging Face Spaces using Gradio. Have fun!

01/21/2022

UniFormer for video is accepted by ICLR2022 (8868, Top 3%)!

01/19/2022

  1. Pretrained models on ImageNet-1K with Token Labeling.
  2. Large resolution fine-tuning.

01/18/2022

  1. The supported code and models for COCO object detection.
  2. The supported code and models for ADE20K semantic segmentation.
  3. The supported code and models for COCO pose estimation.

01/13/2022

  1. Pretrained models on ImageNet-1K, Kinetics-400, Kinetics-600, Something-Something V1&V2.

  2. The supported code and models for image classification and video classification are provided.

πŸ“– Introduction

UniFormer (Unified transFormer) is introduce in arxiv (more details can be found in arxiv), which can seamlessly integrate merits of convolution and self-attention in a concise transformer format. We adopt local MHRA in shallow layers to largely reduce computation burden and global MHRA in deep layers to learn global token relation.

Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks. Our UniFormer obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, and 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks. It also achieves 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Moreover, we build an efficient UniFormer with a concise hourglass design of token shrinking and recovering, which achieves 2-4Γ— higher throughput than the recent lightweight models.

<div align=center> <h3> General Framework </h3> </div> <div align="center"> <img src="figures/framework.png" width="80%"> </div> <div align=center> <h3> Efficient Framework </h3> </div> <div align="center"> <img src="figures/efficient_uniformer.png" width="80%"> </div> <div align=center> <h3> Different Downstream Tasks </h3> </div> <div align="center"> <img src="figures/dense_adaption.jpg" width="100%"> </div>

Main results on ImageNet-1K

Please see image_classification for more details.

More models with large resolution and token labeling will be released soon.

ModelPretrainResolutionTop-1#Param.FLOPs
UniFormer-XXSImageNet-1K128x12876.810.2M0.43G
UniFormer-XXSImageNet-1K160x16079.110.2M0.67G
UniFormer-XXSImageNet-1K192x19279.910.2M0.96G
UniFormer-XXSImageNet-1K224x22480.610.2M1.3G
UniFormer-XSImageNet-1K192x19281.516.5M1.4G
UniFormer-XSImageNet-1K224x22482.016.5M2.0G
UniFormer-SImageNet-1K224x22482.922M3.6G
UniFormer-S†ImageNet-1K224x22483.424M4.2G
UniFormer-BImageNet-1K224x22483.950M8.3G
UniFormer-S+TLImageNet-1K224x22483.422M3.6G
UniFormer-S†+TLImageNet-1K224x22483.924M4.2G
UniFormer-B+TLImageNet-1K224x22485.150M8.3G
UniFormer-L+TLImageNet-1K224x22485.6100M12.6G
UniFormer-S+TLImageNet-1K384x38484.622M11.9G
UniFormer-S†+TLImageNet-1K384x38484.924M13.7G
UniFormer-B+TLImageNet-1K384x38486.050M27.2G
UniFormer-L+TLImageNet-1K384x38486.3100M39.2G

Main results on Kinetics video classification

Please see video_classification for more details.

ModelPretrain#FrameSampling StrideFLOPsK400 Top-1K600 Top-1
UniFormer-SImageNet-1K16x1x44167G80.882.8
UniFormer-SImageNet-1K16x1x48167G80.882.7
UniFormer-SImageNet-1K32x1x44438G82.0-
UniFormer-BImageNet-1K16x1x44387G82.084.0
UniFormer-BImageNet-1K16x1x48387G81.783.4
UniFormer-BImageNet-1K32x1x441036G82.984.5*
ModelPretrain#FrameResolutionFLOPsK400 Top-1
UniFormer-XXSImageNet-1K4x1x11281.0G63.2
UniFormer-XXSImageNet-1K4x1x11601.6G65.8
UniFormer-XXSImageNet-1K8x1x11282.0G68.3
UniFormer-XXSImageNet-1K8x1x11603.3G71.4
UniFormer-XXSImageNet-1K16x1x11284.2G73.3
UniFormer-XXSImageNet-1K16x1x11606.9G75.1
UniFormer-XXSImageNet-1K32x1x116015.4G77.9
UniFormer-XSImageNet-1K32x1x119234.2G78.6

#Frame = #input_frame x #crop x #clip

* Since Kinetics-600 is too large to train (>1 month in single node with 8 A100 GPUs), we provide model trained in multi node (around 2 weeks with 32 V100 GPUs), but the result is lower due to the lack of tuning hyperparameters.

* For UniFormer-XS and UniFormer-XXS, we use sparse sampling.

Main results on Something-Something video classification

Please see video_classification for more details.

ModelPretrain#FrameFLOPsSSV1 Top-1SSV2 Top-1
UniFormer-SK40016x3x1125G57.267.7
UniFormer-SK60016x3x1125G57.669.4
UniFormer-SK40032x3x1329G58.869.0
UniFormer-SK60032x3x1329G59.970.4
UniFormer-BK40016x3x1290G59.170.4
UniFormer-BK60016x3x1290G58.870.2
UniFormer-BK40032x3x1777G60.971.1
UniFormer-BK60032x3x1777G61.071.2

#Frame = #input_frame x #crop x #clip

Main results on UCF101 and HMDB51 video classification

Please see video_classification for more details.

ModelPretrain#FrameSampling StrideFLOPsUCF101 Top-1HMDB51 Top-1
UniFormer-SK40016x3x54625G98.377.5

#Frame = #input_frame x #crop x #clip

* We only report the results in the first split. As for the results in our paper, we run the model in 3 training/validation splits and average the results.

Main results on COCO object detection

Please see object_detection for more details.

Mask R-CNN

BackboneLr Schdbox mAPmask mAP#paramsFLOPs
UniFormer-XXS1x42.839.229.4M-
UniFormer-XS1x44.640.935.6M-
UniFormer-S<sub>h14</sub>1x45.641.641M269G
UniFormer-S<sub>h14</sub>3x+MS48.243.441M269G
UniFormer-B<sub>h14</sub>1x47.443.169M399G
UniFormer-B<sub>h14</sub>3x+MS50.344.869M399G

* The FLOPs are measured at resolution 800Γ—1280.

Cascade Mask R-CNN

BackboneLr Schdbox mAPmask mAP#paramsFLOPs
UniFormer-S<sub>h14</sub>3x+MS52.145.279M747G
UniFormer-B<sub>h14</sub>3x+MS53.846.4107M878G

* The FLOPs are measured at resolution 800Γ—1280.

Main results on ADE20K semantic segmentation

Please see semantic_segmentation for more details.

Semantic FPN

BackboneLr SchdmIoU#paramsFLOPs
UniFormer-XXS80K42.313.5M-
UniFormer-XS80K44.419.7M-
UniFormer-S<sub>h14</sub>80K46.325M172G
UniFormer-B<sub>h14</sub>80K47.054M328G
UniFormer-S<sub>w32</sub>80K45.625M183G
UniFormer-S<sub>h32</sub>80K46.225M199G
UniFormer-S80K46.625M247G
UniFormer-B<sub>w32</sub>80K47.054M310G
UniFormer-B<sub>h32</sub>80K47.754M350G
UniFormer-B80K48.054M471G

* The FLOPs are measured at resolution 512Γ—2048.

UperNet

BackboneLr SchdmIoUMS mIoU#paramsFLOPs
UniFormer-S<sub>h14</sub>160K46.948.052M947G
UniFormer-B<sub>h14</sub>160K48.950.080M1085G
UniFormer-S<sub>w32</sub>160K46.648.452M939G
UniFormer-S<sub>h32</sub>160K47.048.552M955G
UniFormer-S160K47.648.552M1004G
UniFormer-B<sub>w32</sub>160K49.150.680M1066G
UniFormer-B<sub>h32</sub>160K49.550.780M1106G
UniFormer-B160K50.050.880M1227G

* The FLOPs are measured at resolution 512Γ—2048.

Main results on COCO pose estimation

Please see pose_estimation for more details.

Top-Down

BackboneInput SizeAPAP<sup>50</sup>AP<sup>75</sup>AR<sup>M</sup>AR<sup>L</sup>ARFLOPs
UniFormer-S256x19274.090.382.266.876.779.54.7G
UniFormer-S384x28875.990.683.468.679.081.411.1G
UniFormer-S448x32076.290.683.268.679.481.414.8G
UniFormer-B256x19275.090.683.067.877.780.49.2G
UniFormer-B384x28876.790.884.069.379.781.414.8G
UniFormer-B448x32077.491.184.470.280.682.529.6G

⭐ Cite Uniformer

If you find this repository useful, please give us stars and use the following BibTeX entry for citation.

@misc{li2022uniformer,
      title={UniFormer: Unifying Convolution and Self-attention for Visual Recognition}, 
      author={Kunchang Li and Yali Wang and Junhao Zhang and Peng Gao and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
      year={2022},
      eprint={2201.09450},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{li2022uniformer,
      title={UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning}, 
      author={Kunchang Li and Yali Wang and Peng Gao and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
      year={2022},
      eprint={2201.04676},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Contributors and Contact Information

UniFormer is maintained by Kunchang Li.

For help or issues using UniFormer, please submit a GitHub issue.

For other communications related to UniFormer, please contact Kunchang Li (kc.li@siat.ac.cn).