Awesome

HRViT

This repo is the official implementation of "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation".

Introduction

HRViT is introduced in arXiv, which is a new vision transformer backbone design for semantic segmentation. It has a multi-branch high-resolution (HR) architecture with enhanced multi-scale representability. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness.

HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction.

teaser

Main Results on ImageNet

model	pretrain	resolution	acc@1	#params	FLOPs
HRViT-b1	ImageNet-1K	224x224	80.5	19.7M	2.7G
HRViT-b2	ImageNet-1k	224x224	82.3	32.5M	5.1G
HRViT-b3	ImageNet-1k	224x224	82.8	37.9M	5.7G

Main Results on Semantic Segmentation

ADE20K Semantic Segmentation (val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	#Params	FLOPs
HRViT-b1	Segformer	ImageNet-1K	512x512	160K	45.88	8.2M	14.6G
HRViT-b2	Segformer	ImageNet-1K	512x512	160K	48.76	20.8M	28.0G
HRViT-b3	Segformer	ImageNet-1K	512x512	160K	50.20	28.7M	67.9G
HRViT-b1	UperNet	ImageNet-1K	512x512	160K	47.19	35.9M	219G
HRViT-b2	UperNet	ImageNet-1K	512x512	160K	49.10	49.7M	233G
HRViT-b3	UperNet	ImageNet-1K	512x512	160K	50.04	55.4M	236G

Cityscapes Semantic Segmentation (val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	#Params	FLOPs
HRViT-b1	Segformer	ImageNet-1K	512x512	160K	81.63	8.1M	14.1G
HRViT-b2	Segformer	ImageNet-1K	512x512	160K	82.81	20.8M	27.4G
HRViT-b3	Segformer	ImageNet-1K	512x512	160K	83.16	28.6M	66.8G

Training code could be found at segmentation

Requirements

timm==0.3.4, pytorch>=1.4, opencv, ... , run:

bash install_req.sh

Data preparation: ImageNet-1K with the following folder structure, you can extract imagenet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Train

Train three variants: HRViT-b1, HRViT-b2, and HRViT-b3. We need 4 nodes/machines, 8 GPUs per node. On machine NODE_RANK={0,1,2,3}, run the following command to train MODEL={HRViT_b1_224, HRViT_b2_224, HRViT_b3_224}

bash train.sh 4 8 <NODE_RANK> --data <data path> --model <MODEL> -b 32 --lr 1e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --drop-path 0.1 --head-drop 0.1 --clip-grad 1 --sync-bn

If the GPU memory is not enough, please use gradient checkpoint '--with-cp'.

Cite HRViT

@misc{gu2021hrvit,
      title={Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation},
        author={Jiaqi Gu and Hyoukjun Kwon and Dilin Wang and Wei Ye and Meng Li and Yu-Hsin Chen and Liangzhen Lai and Vikas Chandra and David Z. Pan},
        year={2021},
        eprint={2111.01236},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
}

Acknowledgement

This repository is built using the timm library, the DeiT repository, the Swin Transformer repository, the CSWin repository, the MMSegmentation repository, and the MMCV repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Meta Open Source Code of Conduct

Contact Information

For help or issues using HRViT, please submit a GitHub issue.

For other communications related to HRViT, please contact Hyoukjun Kwon (hyoukjunkwon@fb.com), Dilin Wang (wdilin@fb.com).

License Information

The majority of HRViT is licensed under CC-BY-NC, however portions of the project are available under separate license terms:

timm is licensed under the Apache-2.0 license
DeiT is licensed under the Apache-2.0 license
Swin Transformer is licensed under the MIT license
CSWin Transformer is licensed under the MIT license
MMSegmentation is licensed under the Apache-2.0 license
MMCV is licensed under the Apache-2.0 license