Home

Awesome

<div align="center"> <h1>WeakTr </h1> <h3>Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation</h3>

Lianghui Zhu<sup>1</sup> *, Yingyue Li<sup>1</sup> *, Jiemin Fang<sup>1</sup>, Yan Liu<sup>2</sup>, Hao Xin<sup>2</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1 :email:</sup>

<sup>1</sup> School of EIC, Huazhong University of Science & Technology, <sup>2</sup> Ant Group

(*) equal contribution, (<sup>:email:</sup>) corresponding author.

ArXiv Preprint (arXiv 2304.01184)

</div>

Highlight

<div align="center">

PWC PWC PWC PWC

</div> <div align=center><img src="img/miou_compare.png" width="400px"></div>

Introduction

This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects.

Step1: End-to-End CAM Generation

<div align=center><img src="img/WeakTr.png" width="800px"></div>

Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.5% mIoU on the val set of VOC12 and 51.1% mIoU on the val set of COCO14.

Step2: Online Retraining with Gradient Clipping Decoder

<div align=center><img src="img/clip_grad_decoder.png" width="400px"></div>

News

Getting Started

Main results

Step1: End-to-End CAM Generation

DatasetMethodBackboneCheckpointCAM_LabelTrain mIoU
VOC12WeakTrDeiT-SGoogle DriveGoogle Drive69.4%
COCO14WeakTrDeiT-SGoogle DriveGoogle Drive42.6%

Step2: Online Retraining with Gradient Clipping Decoder

DatasetMethodBackboneCheckpointVal mIoUPseudo-maskTrain mIoU
VOC12WeakTrDeiT-SGoogle Drive74.0%Google Drive76.5%
VOC12WeakTrDINOv2-SGoogle Drive75.8%Google Drive78.1%
VOC12WeakTrViT-SGoogle Drive78.4%Google Drive80.3%
VOC12WeakTrEVA-02-SGoogle Drive78.5%Google Drive80.0%
COCO14WeakTrDeiT-SGoogle Drive46.9%Google Drive48.9%
COCO14WeakTrDINOv2-SGoogle Drive48.9%Google Drive50.7%
COCO14WeakTrViT-SGoogle Drive50.3%Google Drive51.3%
COCO14WeakTrEVA-02-SGoogle Drive51.1%Google Drive52.2%

Citation

If you find this repository/work helpful in your research, welcome to cite the paper and give a ⭐.

@article{zhu2023weaktr,
      title={WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation}, 
      author={Lianghui Zhu and Yingyue Li and Jiemin Fang and Yan Liu and Hao Xin and Wenyu Liu and Xinggang Wang},
      year={2023},
      journal={arxiv:2304.01184},
}