Home

Awesome

RankSeg: Adaptive Pixel Classification with Image Category Ranking for Segmentation, ECCV 2022

arXiv visitors visitors

https://user-images.githubusercontent.com/4639578/193441675-eecf93b0-a869-497b-81c4-ab0b3d210104.mp4

News

2023.02.11 We release the code and checkpoints of Segmenter + RankSeg.

2022.10.08 ⛽⛽⛽ [MSRA-VC-Group] is hiring research interns to push the frontier cutting-edge technology of object detection and segmentation.⛽⛽⛽ Contact: yuhui.yuan@microsoft.com

2022.08.20 We release the code and checkpoints of Mask2Former + RankSeg.

2022.07.19 We rename MLSeg to RankSeg to highlight the importance of our rank-oriented design.

2022.07.04 MLSeg has been accepted by ECCV 2022.

Introduction

The segmentation task has traditionally been formulated as a complete-label pixel classification task to predict a class for each pixel from a fixed number of predefined semantic categories shared by all images or videos. Yet, following this formulation, standard architectures will inevitably encounter various challenges under more realistic settings where the scope of categories scales up (e.g., beyond the level of 1k). On the other hand, in a typical image or video, only a few categories, i.e., a small subset of the complete label are present. Motivated by this intuition, in this paper, we propose to decompose segmentation into two sub-problems: (i) image-level or video-level multi-label classification and (ii) pixel-level rank-adaptive selected-label classification. Given an input image or video, our framework first conducts multi-label classification over the complete label, then sorts the complete label and selects a small subset according to their class confidence scores. We then use a rank-adaptive pixel classifier to perform the pixel-wise classification over only the selected labels, which uses a set of rank-oriented learnable temperature parameters to adjust the pixel classifications scores. Our approach is conceptually general and can be used to improve various existing segmentation frameworks by simply using a lightweight multi-label classification head and rank-adaptive pixel classifier. We demonstrate the effectiveness of our framework with competitive experimental results across four tasks, including image semantic segmentation, image panoptic segmentation, video instance segmentation, and video semantic segmentation. Especially, with our RankSeg, Mask$2$Former gains +0.8%/+0.7%/+0.7% on ADE$20$K panoptic segmentation/YouTubeVIS 2019 video instance segmentation/VSPW video semantic segmentation benchmarks respectively.

teaser

Image Semantic & Image Panoptic & Video Semantic & Video Instance Segmentation based on Mask2Former + RankSeg

See the MODEL_ZOO for Mask2Former.

Image Semantic Segmentation based on DeepLabV3/Segmenter/Swin/BEiT + RankSeg

RankSeg + DeepLabV3

MethodDatasetBackboneCrop SizeLr schdmIoUmIoU(ms+flip)configdownload
DeepLabV3 (Official)COCO-StuffR101512x5122000037.338.4--
DeepLabV3 + RankSegCOCO-StuffR101512x5122000038.439.8--
DeepLabV3 (Official)ADE20KR101512x5128000044.145.2--
DeepLabV3 + RankSegADE20KR101512x5128000045.546.6--
DeepLabV3COCO+LVISR101512x51216000011.0---
DeepLabV3 + RankSegCOCO+LVISR101512x51216000012.8---

RankSeg + Segmenter

MethodDatasetBackboneCrop SizeLr schdmIoUmIoU(ms+flip)
SegmenterCOCO-StuffViT-B512x5124000041.943.8
Segmenter + RankSegCOCO-StuffViT-B512x5124000044.946.2
SegmenterCOCO-StuffViT-B512x5128000043.445.2
Segmenter + RankSegCOCO-StuffViT-B512x5128000045.746.7
SegmenterCOCO-StuffViT-L640x6404000045.547.1
Segmenter + RankSegCOCO-StuffViT-B640x6404000046.747.9
SegmenterPascal-Context60ViT-B480x4808000053.854.6
Segmenter + RankSegPascal-Context60ViT-B480x4808000054.755.4
SegmenterADE20KViT-B512x51216000048.850.7
Segmenter + RankSegADE20KViT-B512x51216000049.751.4
SegmenterADE20KViT-L640x64016000052.053.6
Segmenter + RankSegADE20KViT-L640x64016000052.654.4
SegmenterADE20KFullViT-B512x51216000017.8-
Segmenter + RankSegADE20KFullViT-B512x51216000018.8-
SegmenterCOCO+LVISViT-B512x51232000019.4-
Segmenter + RankSegCOCO+LVISViT-B512x51232000021.3-
SegmenterCOCO+LVISViT-B640x64032000023.7-
Segmenter + RankSegCOCO+LVISViT-B640x64032000024.6-

RankSeg + Swin

MethodDatasetBackboneCrop SizeLr schdmIoUmIoU(ms+flip)configdownload
SwinCOCO-StuffSwin-B512x5124000045.747.2--
Swin + RankSegCOCO-StuffSwin-B512x5124000046.647.9--
Swin (Official)ADE20KSwin-B512x51216000050.852.4--
Swin + RankSegADE20KSwin-B512x51216000051.453.0--
SwinCOCO+LVISSwin-B512x51216000020.3---
Swin + RankSegCOCO+LVISSwin-B512x51216000020.8---

RankSeg + BEiT

MethodDatasetBackboneCrop SizeLr schdmIoUmIoU(ms+flip)configdownload
BEiT (Official)ADE20KBEiT-L640x64016000056.757.0--
RankSeg + BEiTADE20KBEiT-L640x64016000057.057.8--
BEiT (Official)COCO-StuffBEiT-L640x64016000049.749.9--
RankSeg + BEiTCOCO-StuffBEiT-L640x64016000049.950.3--

Image Semantic & Panoptic Segmentation based on MaskFormer + RankSeg

Semantic Segmentation

MethodDatasetBackboneCrop SizeLr schdmIoUmIoU(ms+flip)configdownload
MaskFormerADE20KSwin-B512x51216000052.753.9--
MaskFormer + RankSegADE20KSwin-B512x51216000053.955.1--

Panoptic Segmentation

MethodDatasetBackboneCrop SizeLr schdPQPQ-thPQ-stRQRQ-thRQ-stSQSQ-thSQ-stconfigdownload
MaskFormerADE20KR50640x64072000034.732.239.742.840.148.176.776.976.3--
MaskFormer + RankSegADE20KR50640x64072000036.534.540.644.942.848.976.877.176.0--
MaskFormer + RankSeg + GTADE20KR50640x64072000044.339.753.554.549.564.679.678.681.7--

Citation

If you find this project useful in your research, please consider cite:

@article{he2022mlseg,
  title={MLSeg: Image and Video Segmentation as Multi-Label Classification and Selected-Label Pixel Classification},
  author={He, Haodi and Yuan, Yuhui and Yue, Xiangyu and Hu, Han},
  journal={arXiv preprint arXiv:2203.04187},
  year={2022}
}