Home

Awesome

HRFormer: High-Resolution Transformer for Dense Prediction, NeurIPS 2021

<img src='HRFormer-20-fps.gif' align="center" width=1024>

Introduction

This is the official implementation of High-Resolution Transformer (HRFormer). We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on human pose estimation and semantic segmentation tasks.

teaser

teaser

Pose estimation

2d Human Pose Estimation

Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset

BackboneInput SizeAPAP<sup>50</sup>AP<sup>75</sup>AR<sup>M</sup>AR<sup>L</sup>ARckptlogscript
HRFormer-S256x19274.0%90.2%81.2%70.4%80.7%79.4%ckptlogscript
HRFormer-S384x28875.6%90.3%82.2%71.6%82.5%80.7%ckptlogscript
HRFormer-B256x19275.6%90.8%82.8%71.7%82.6%80.8%ckptlogscript
HRFormer-B384x28877.2%91.0%83.6%73.2%84.2%82.0%ckptlogscript

Results on COCO test-dev with detector having human AP of 56.4 on COCO val2017 dataset

BackboneInput SizeAPAP<sup>50</sup>AP<sup>75</sup>AR<sup>M</sup>AR<sup>L</sup>ARckptlogscript
HRFormer-S384x28874.5%92.3%82.1%70.7%80.6%79.8%ckptlogscript
HRFormer-B384x28876.2%92.7%83.8%72.5%82.3%81.2%ckptlogscript

The models are first pre-trained on ImageNet-1K dataset, and then fine-tuned on COCO val2017 dataset.

Semantic segmentation

Cityscapes

Performance on the Cityscapes dataset. The models are trained and tested with input size of 512x1024 and 1024x2048 respectively.

MethodsBackboneWindow SizeTrain SetTest SetIterationsBatch SizeOHEMmIoUmIoU (Multi-Scale)Logckptscript
OCRNetHRFormer-S7x7TrainVal800008Yes80.081.0logckptscript
OCRNetHRFormer-B7x7TrainVal800008Yes81.482.0logckptscript
OCRNetHRFormer-B15x15TrainVal800008Yes81.982.6logckptscript

PASCAL-Context

The models are trained with the input size of 520x520, and tested with original size.

MethodsBackboneWindow SizeTrain SetTest SetIterationsBatch SizeOHEMmIoUmIoU (Multi-Scale)Logckptscript
OCRNetHRFormer-S7x7TrainVal6000016Yes53.854.6logckptscript
OCRNetHRFormer-B7x7TrainVal6000016Yes56.357.1logckptscript
OCRNetHRFormer-B15x15TrainVal6000016Yes57.658.5logckptscript

COCO-Stuff

The models are trained with input size of 520x520, and tested with original size.

MethodsBackboneWindow SizeTrain SetTest SetIterationsBatch SizeOHEMmIoUmIoU (Multi-Scale)Logckptscript
OCRNetHRFormer-S7x7TrainVal6000016Yes37.938.9logckptscript
OCRNetHRFormer-B7x7TrainVal6000016Yes41.642.5logckptscript
OCRNetHRFormer-B15x15TrainVal6000016Yes42.443.3logckptscript

ADE20K

The models are trained with input size of 520x520, and tested with original size. The results with window size 15x15 will be updated latter.

MethodsBackboneWindow SizeTrain SetTest SetIterationsBatch SizeOHEMmIoUmIoU (Multi-Scale)Logckptscript
OCRNetHRFormer-S7x7TrainVal1500008Yes44.045.1logckptscript
OCRNetHRFormer-B7x7TrainVal1500008Yes46.347.6logckptscript
OCRNetHRFormer-B13x13TrainVal1500008Yes48.750.0logckptscript
OCRNetHRFormer-B15x15TrainVal1500008Yes-----

Classification

Results on ImageNet-1K

Backboneacc@1acc@5#paramsFLOPsckptlogscript
HRFormer-T78.6%94.2%8.0M1.83Gckptlogscript
HRFormer-S81.2%95.6%13.5M3.56Gckptlogscript
HRFormer-B82.8%96.3%50.3M13.71Gckptlogscript

Citation

If you find this project useful in your research, please consider cite:

@article{YuanFHLZCW21,
  title={HRFormer: High-Resolution Transformer for Dense Prediction},
  author={Yuhui Yuan and Rao Fu and Lang Huang and Weihong Lin and Chao Zhang and Xilin Chen and Jingdong Wang},
  booktitle={NeurIPS},
  year={2021}
}

Acknowledgment

This project is developed based on the Swin-Transformer, openseg.pytorch, and mmpose.

git diff-index HEAD
git subtree add -P pose <url to sub-repo> <sub-repo branch>