Home

Awesome

HRViT

This repo is the official implementation of "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation".

Introduction

HRViT is introduced in arXiv, which is a new vision transformer backbone design for semantic segmentation. It has a multi-branch high-resolution (HR) architecture with enhanced multi-scale representability. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness.

HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction.

teaser

Main Results on ImageNet

modelpretrainresolutionacc@1#paramsFLOPs
HRViT-b1ImageNet-1K224x22480.519.7M2.7G
HRViT-b2ImageNet-1k224x22482.332.5M5.1G
HRViT-b3ImageNet-1k224x22482.837.9M5.7G

Main Results on Semantic Segmentation

ADE20K Semantic Segmentation (val)

BackboneMethodpretrainCrop SizeLr SchdmIoU#ParamsFLOPs
HRViT-b1SegformerImageNet-1K512x512160K45.888.2M14.6G
HRViT-b2SegformerImageNet-1K512x512160K48.7620.8M28.0G
HRViT-b3SegformerImageNet-1K512x512160K50.2028.7M67.9G
HRViT-b1UperNetImageNet-1K512x512160K47.1935.9M219G
HRViT-b2UperNetImageNet-1K512x512160K49.1049.7M233G
HRViT-b3UperNetImageNet-1K512x512160K50.0455.4M236G

Cityscapes Semantic Segmentation (val)

BackboneMethodpretrainCrop SizeLr SchdmIoU#ParamsFLOPs
HRViT-b1SegformerImageNet-1K512x512160K81.638.1M14.1G
HRViT-b2SegformerImageNet-1K512x512160K82.8120.8M27.4G
HRViT-b3SegformerImageNet-1K512x512160K83.1628.6M66.8G

Training code could be found at segmentation

Requirements

timm==0.3.4, pytorch>=1.4, opencv, ... , run:

bash install_req.sh

Data preparation: ImageNet-1K with the following folder structure, you can extract imagenet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Train

Train three variants: HRViT-b1, HRViT-b2, and HRViT-b3. We need 4 nodes/machines, 8 GPUs per node. On machine NODE_RANK={0,1,2,3}, run the following command to train MODEL={HRViT_b1_224, HRViT_b2_224, HRViT_b3_224}

bash train.sh 4 8 <NODE_RANK> --data <data path> --model <MODEL> -b 32 --lr 1e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --drop-path 0.1 --head-drop 0.1 --clip-grad 1 --sync-bn

If the GPU memory is not enough, please use gradient checkpoint '--with-cp'.

Cite HRViT

@misc{gu2021hrvit,
      title={Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation},
        author={Jiaqi Gu and Hyoukjun Kwon and Dilin Wang and Wei Ye and Meng Li and Yu-Hsin Chen and Liangzhen Lai and Vikas Chandra and David Z. Pan},
        year={2021},
        eprint={2111.01236},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
}

Acknowledgement

This repository is built using the timm library, the DeiT repository, the Swin Transformer repository, the CSWin repository, the MMSegmentation repository, and the MMCV repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Meta Open Source Code of Conduct

Contact Information

For help or issues using HRViT, please submit a GitHub issue.

For other communications related to HRViT, please contact Hyoukjun Kwon (hyoukjunkwon@fb.com), Dilin Wang (wdilin@fb.com).

License Information

The majority of HRViT is licensed under CC-BY-NC, however portions of the project are available under separate license terms: