Home

Awesome

<!-- # <p align=center>`awesome gan-inversion`</p> -->

Awesome PR's Welcome <br />

<p align="center"> <h1 align="center">Transformer-Based Visual Segmentation: A Survey</h1> <p align="center"> T-PAMI, 2024 <br /> <a href="https://lxtgh.github.io/"><strong>Xiangtai Li (Project Lead)</strong></a> · <a href="https://henghuiding.github.io/"><strong>Henghui Ding</strong></a> · <a href="https://yuanhaobo.me/"><strong>Haobo Yuan</strong></a> · <a href="http://zhangwenwei.cn/"><strong>Wenwei Zhang</strong></a> · <a href="https://sites.google.com/view/guangliangcheng"><strong>Guangliang Cheng</strong></a> <br /> <a href="https://oceanpang.github.io/"><strong>Jiangmiao Pang</strong></a> . <a href="https://chenkai.site/"><strong>Kai Chen</strong></a> . <a href="https://liuziwei7.github.io/"><strong>Ziwei Liu</strong></a> . <a href="https://www.mmlab-ntu.com/person/ccloy/"><strong>Chen Change Loy</strong></a> </p> <p align="center"> <a href='https://arxiv.org/abs/2304.09854'> <img src='https://img.shields.io/badge/Paper-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <a href='https://www.mmlab-ntu.com/project/seg_survey/index.html' style='padding-left: 0.5rem;'> <img src='https://img.shields.io/badge/Project-Page-blue?style=flat&logo=Google%20chrome&logoColor=blue' alt='S-Lab Project Page'> </a> <a href='https://ieeexplore.ieee.org/abstract/document/10613466'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> </p> <br />

This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods, as a supplement to our survey.
If you find any work missing or have any suggestions (papers, implementations and other resources), feel free to pull requests. We will add the missing papers to this repo ASAP.

🔥News

[-] Accepted by T-PAMI-2024.

[-] Add several CVPR-24 works on this directions. 2024-03. You are welcome to add your CVPR works in our repos!

[-] The third version is on arxiv. survey More benchmark and methods are included!!. 2023-12.

[-] The second draft is on arxiv. 2023-06.

🔥Highlight!!

[1], Previous transformer surveys divide the methods by the different tasks and settings. Different from them, we re-visit and group the existing transformer-based methods from the technical perspective.

[2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.

[3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets.

[4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query.

Introduction

In this survey, we present the first detailed survey on Transformer-Based Segmentation.

Alt Text

Summary of Contents

Methods: A Survey

Meta-Architecture

YearVenueAcronymPaper TitleCode/Project
2020ECCVDETREnd-to-End Object Detection with TransformersCode
2021ICLRDeformable DETRDeformable DETR: Deformable Transformers for End-to-End Object DetectionCode
2021CVPRMax-DeeplabMaX-DeepLab: End-to-End Panoptic Segmentation with Mask TransformersCode
2021NeurIPSMaskFormerMaskFormer: Per-Pixel Classification is Not All You Need for Semantic SegmentationCode
2021NeurIPSK-NetK-Net: Towards Unified Image SegmentationCode
2023CVPRLite-DETRLite detr: An interleaved multi-scale encoder for efficient detrCode

Strong Representation

Better ViTs Design

YearVenueAcronymPaper TitleCode/Project
2021CVPRSETRRethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with TransformersCode
2021ICCVMviT-V1Multiscale vision transformersCode
2022CVPRMviT-V2MViTv2: Improved Multiscale Vision Transformers for Classification and DetectionCode
2021NeurIPSXCITXcit: Crosscovariance image transformersCode
2021ICCVPyramid VITPyramid vision transformer: A versatile backbone for dense prediction without convolutionsCode
2021ICCVCorssViTCrossViT: Cross-Attention Multi-Scale Vision Transformer for Image ClassificationCode
2021ICCVCoaTCo-Scale Conv-Attentional Image TransformersCode
2022CVPRMPViTMPViT: Multi-Path Vision Transformer for Dense PredictionCode
2022NeurIPSSegViTSegViT: Semantic Segmentation with Plain Vision TransformersCode
2022arxivRSSegRepresentation Separation for SemanticSegmentation with Vision TransformersN/A

Hybrid CNNs/Transformers/MLPs

YearVenueAcronymPaper TitleCode/Project
2021ICCVSwinSwin transformer: Hierarchical vision transformer using shifted windowsCode
2022CVPRSwin-v2Swin Transformer V2: Scaling Up Capacity and ResolutionCode
2021NeurIPSSegformerSegFormer: Simple and Efficient Design for Semantic Segmentation with TransformersCode
2022CVPRCMTCMT: Convolutional Neural Networks Meet Vision TransformersCode
2021NeurIPSTwinsTwins: Revisiting the Design of Spatial Attention in Vision TransformersCode
2021ICCVCvTCvT: Introducing Convolutions to Vision TransformersCode
2021NeurIPSVitaeVitae: Vision transformer advanced by exploring intrinsic inductive biasCode
2022CVPRConvNextA ConvNet for the 2020sCode
2022NeurIPSSegNextSegNeXt:Rethinking Convolutional Attention Design for Semantic SegmentationCode
2022CVPRPoolFormerPoolFormer: MetaFormer Is Actually What You Need for VisionCode
2023ICLRSTMDemystify Transformers & Convolutions in Modern Image Deep NetworksCode

Self-Supervised Learning

YearVenueAcronymPaper TitleCode/Project
2021ICCVMOCOV3An Empirical Study of Training Self-Supervised Vision TransformersCode
2022ICLRBeitBeit: Bert pre-training of image transformersCode
2022CVPRMaskFeatMasked Feature Prediction for Self-Supervised Visual Pre-TrainingCode
2022CVPRMAEMasked Autoencoders Are Scalable Vision LearnersCode
2022NeurIPSConvMAEMCMAE: Masked Convolution Meets Masked AutoencodersCode
2023ICLRSparkSparK: the first successful BERT/MAE-style pretraining on any convolutional networksCode
2022CVPRFLIPScaling Language-Image Pre-training via MaskingCode
2023arxivConvNeXt V2ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoencodersCode

Interaction Design in Decoder

Improved Cross Attention Design

YearVenueAcronymPaper TitleCode/Project
2021CVPRSparse R-CNNSparse R-CNN: End-to-End Object Detection with Learnable ProposalsCode
2022CVPRAdaMixerAdaMixer: A Fast-Converging Query-Based Object DetectorCode
2021CVPRMaX-DeepLabMaX-DeepLab: End-to-End Panoptic Segmentation with Mask TransformersCode
2021NeurIPSK-NetK-Net: Towards Unified Image SegmentationCode
2022CVPRMask2FormerMasked-attention Mask Transformer for Universal Image SegmentationCode
2022ECCVkMaX-DeepLabk-means Mask TransformerCode
2021ICCVQueryInstInstances as queriesCode
2021arxivISTRISTR: End-to-End Instance Segmentation via TransformersCode
2021NeurIPSSOLQSolq: Segmenting objects by learning queriesCode
2022CVPRPanoptic SegformerPanoptic SegFormer: Delving Deeper into Panoptic Segmentation with TransformersCode
2022CVPRCMT-DeeplabCMT-DeepLab: Clustering Mask Transformers for Panoptic SegmentationN/A
2022CVPRSparseInstSparse Instance Activation for Real-Time Instance SegmentationCode
2022CVPRSAM-DETRAccelerating DETR Convergence via Semantic-Aligned MatchingCode
2021ICCVSMCA-DETRFast Convergence of DETR with Spatially Modulated Co-AttentionCode
2021BMVCACT-DETREnd-to-End Object Detection with Adaptive Clustering TransformerCode
2021ICCVDynamic DETRDynamic DETR: End-to-End Object Detection with Dynamic AttentionN/A
2022ICLRSparse DETRSparse DETR: Efficient End-to-End Object Detection with Learnable SparsityCode
2023CVPRFastInstFastInst: A Simple Query-Based Model for Real-Time Instance SegmentationCode

Spatial-Temporal Cross Attention Design

YearVenueAcronymPaper TitleCode/Project
2021CVPRVisTRVisTR: End-to-End Video Instance Segmentation with TransformersCode
2021NeurIPSIFCVideo instance segmentation using inter-frame communication transformersCode
2022CVPRSlotVPSSlot-VPS: Object-centric Representation Learning for Video Panoptic SegmentationN/A
2022CVPRTubeFormer-DeepLabTubeFormer-DeepLab: Video Mask TransformerN/A
2022CVPRVideo K-NetVideo K-Net: A Simple, Strong, and Unified Baseline for Video SegmentationCode
2022CVPRTeViTTemporally efficient vision transformer for video instance segmentationCode
2022ECCVSeqformerSeqFormer: Sequential Transformer for Video Instance SegmentationCode
2022arxivMask2Former-VISMask2Former for Video Instance SegmentationCode
2022PAMITransVODTransVOD: End-to-End Video Object Detection with Spatial-Temporal TransformersCode
2022NeurIPSVITAVITA: Video Instance Segmentation via Object Token AssociationCode

Optimizing Object Query

Adding Position Information into Query

YearVenueAcronymPaper TitleCode/Project
2021ICCVConditional-DETRConditional DETR for Fast Training ConvergenceCode
2022arxivConditional-DETR-v2Conditional detr v2:Efficient detection transformer with box queriesCode
2022AAAIAnchor DETRAnchor detr: Query design for transformer-based detectorCode
2022ICLRDAB-DETRDAB-DETR: Dynamic Anchor Boxes are Better Queries for DETRCode
2021arxivEfficient DETREfficient detr: improving end-to-end object detector with dense priorN/A

Adding Extra Supervision into Query

YearVenueAcronymPaper TitleCode/Project
2022ECCVDE-DETRTowards Data-Efficient Detection TransformersCode
2022CVPRDN-DETRDndetr:Accelerate detr training by introducing query denoisingCode
2023ICLRDINODINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object DetectionCode
2023CVPRMp-FormerMp-former: Mask-piloted transformer for image segmentationCode
2023CVPRMask-DINOMask DINO: Towards A Unified Transformer-based Framework for Object Detection and SegmentationCode
2022NeurIPSN/ALearning equivariant segmentation with instance-unique queryingCode
2023CVPRH-DETRDETRs with Hybrid MatchingCode
2023ICCVGroup-DETRGroup detr: Fast detr training with group-wise one-to-many assignmentN/A
2023ICCVCo-DETRDetrs with collaborative hybrid assignments trainingCode

Using Query For Association

Query as Instance Association

YearVenueAcronymPaper TitleCode/Project
2022CVPRTrackFormerTrackFormer: Multi-Object Tracking with TransformerCode
2021arxivTransTrackTransTrack: Multiple Object Tracking with TransformerCode
2022ECCVMOTRMOTR: End-to-End Multiple-Object Tracking with TRansformerCode
2022NeurIPSMinVISMinVIS: A Minimal Video Instance Segmentation Framework without Video-based TrainingCode
2022ECCVIDOLIn defense of online models for video instance segmentationCode
2022CVPRVideo K-NetVideo K-Net: A Simple, Strong, and Unified Baseline for Video SegmentationCode
2023CVPRGenVISA Generalized Framework for Video Instance SegmentationCode
2023ICCVTube-LinkTube-Link: A Flexible Cross Tube Framework for Universal Video SegmentationCode
2023ICCVCTVISCTVIS: Consistent Training for Online Video Instance SegmentationCode
2023CVPR-WVideo-kMaXVideo-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic SegmentationN/A

Query as Linking Multi-Tasks

YearVenueAcronymPaper TitleCode/Project
2022ECCVPanoptic-PartFormerPanoptic-PartFormer: Learning a Unified Model for Panoptic Part SegmentationCode
2022ECCVPolyphonicFormerPolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic SegmentationCode
2022CVPRPanopticDepthPanopticdepth: A unified framework for depth-aware panoptic segmentationCode
2022ECCVFashionformerFashionformer: A simple, effective and unified baseline for human fashion segmentation and recognitionCode
2022ECCVInvPTInvPT: Inverted Pyramid Multi-task Transformer for Dense Scene UnderstandingCode
2023CVPRUNINEXTUniversal Instance Perception as Object Discovery and RetrievalCode
2024CVPRGLEEGLEE: General Object Foundation Model for Images and Videos at ScaleCode
2024CVPRUniVSUniVS: Unified and Universal Video Segmentation with Prompts as QueriesCode
2024CVPROMG-SegOMG-Seg: Is One Model Good Enough For All Segmentation?Code

Conditional Query Generation

Conditional Query Fusion on Language Features

YearVenueAcronymPaper TitleCode/Project
2021ICCVVLTVision-Language Transformer and Query Generation for Referring SegmentationCode
2022CVPRLAVTLavt: Language-aware vision transformer for referring image segmentationCode
2022CVPRRestrRestr:Convolution-free referring image segmentation using transformersN/A
2022CVPRCrisCris: Clip-driven referring image segmentationCode
2022CVPRMTTREnd-to-End Referring Video Object Segmentation with Multimodal TransformersCode
2022CVPRLBDTLanguage-Bridged Spatial-Temporal Interaction for Referring Video Object SegmentationCode
2022CVPRReferFormerLanguage as queries for referring video object segmentationCode
2024CVPRMaskGroundingMask Grounding for Referring Image SegmentationCode

Conditional Query Fusion on Cross Image Features

YearVenueAcronymPaper TitleCode/Project
2021NeurIPSCyCTRFew-Shot Segmentation via Cycle-Consistent TransformerCode
2022CVPRMatteFormerMatteFormer: Transformer-Based Image Matting via Prior-TokensCode
2022ECCVSegdeformerA Transformer-based Decoder for Semantic Segmentation with Multi-level Context MiningCode
2022arxivStructTokenStructToken : Rethinking Semantic Segmentation with Structural PriorN/A
2022NeurIPSMM-FormerMask Matching Transformer for Few-Shot SegmentationCode
2022ECCVAAFormerAdaptive Agent Transformer for Few-shot SegmentationN/A
2023arxivReferenceTwiceReference Twice: A Simple and Unified Baseline for Few-Shot Instance SegmentationCode

Tuning Foundation Models

Vision Adapter

YearVenueAcronymPaper TitleCode/Project
2022CVPRCoCoOpConditional Prompt Learning for Vision-Language ModelsCode
2022ECCVTip-AdapterTip-Adapter: Training-free Adaption of CLIP for Few-shot ClassificationCode
2022ECCVEVLFrozen CLIP Models are Efficient Video LearnersCode
2023ICLRViT-AdapterVision Transformer Adapter for Dense PredictionsCode
2022CVPRDenseCLIPDenseCLIP: Language-Guided Dense Prediction with Context-Aware PromptingCode
2022CVPRCLIPSegImage Segmentation Using Text and Image PromptsCode
2023CVPROneFormerOneFormer: One Transformer to Rule Universal Image SegmentationCode

Open Vocabulary Learning

YearVenueAcronymPaper TitleCode/Project
2021CVPROVR-CNNOpen-Vocabulary Object Detection Using CaptionsCode
2022ICLRViLDOpen-vocabulary Object Detection via Vision and Language Knowledge DistillationCode
2022ECCVDeticDetecting Twenty-thousand Classes using Image-level SupervisionCode
2022ECCVOV-DETROpen-Vocabulary DETR with Conditional MatchingCode
2023ICLRF-VLMF-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language ModelsCode
2022ECCVMViTClass-agnostic Object Detection with Multi-modal TransformerCode
2022ECCVOpenSegScaling Open-Vocabulary Image Segmentation with Image-Level LabelsCode
2022ICLRLSegLanguage-driven Semantic SegmentationCode
2022ECCVSimSegA Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelCode
2022ECCVDenseCLIPExtract Free Dense Labels from CLIPCode
2021ICCVUVOUnidentified Video Objects: A Benchmark for Dense, Open-World SegmentationProject
2023arXivCGGBetrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance SegmentationCode
2022TPAMIESOpen-World Entity SegmentationCode
2022CVPROW-DETROW-DETR: Open-world Detection TransformerCode
2023CVPRPROBPROB: Probabilistic Objectness for Open World Object DetectionCode

Related Domains and Beyond

Point Cloud Segmentation

YearVenueAcronymPaper TitleCode/Project
2021ICCVPoint TransformerPoint TransformerN/A
2021CVMPCTPCT: Point cloud transformerCode
2022CVPRStratified TransformerStratified Transformer for 3D Point Cloud SegmentationCode
2022CVPRPoint-BERTPoint-BERT: Pre-training 3D Point Cloud Transformers with Masked Point ModelingCode
2022ECCVPoint-MAEMasked Autoencoders for Point Cloud Self-supervised LearningCode
2022NeurIPSPoint-M2AEPoint-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-trainingCode
2022ICRAMask3DMask3D for 3D Semantic Instance SegmentationCode
2023AAAISPFormerSuperpoint Transformer for 3D Scene Instance SegmentationCode
2023AAAIPUPSPUPS: Point Cloud Unified Panoptic SegmentationN/A

Domain-aware Segmentation

YearVenueAcronymPaper TitleCode/Project
2022CVPRDAFormerDAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic SegmentationCode
2022ECCVHRDAHRDA: Context-Aware High-Resolution Domain-Adaptive Semantic SegmentationCode
2023CVPRMICMIC: Masked Image Consistency for Context-Enhanced Domain AdaptationCode
2021ACM MMSFAExploring Sequence Feature Alignment for Domain Adaptive Detection TransformersCode
2023CVPRDA-DETRDA-DETR: Domain Adaptive Detection Transformer with Information FusionN/A
2022ECCVMTTransMTTrans: Cross-Domain Object Detection with Mean-Teacher TransformerCode
2022arXivSentence-SegThe devil is in the labels: Semantic segmentation from sentencesN/A
2023ICLRLMSegLMSeg: Language-guided Multi-dataset SegmentationN/A
2022CVPRUniDetSimple multi-dataset detectionCode
2023CVPRDetection HubDetection Hub: Unifying Object Detection Datasets via Query Adaptation on Language EmbeddingN/A
2022CVPRWD2Unifying Panoptic Segmentation for Autonomous DrivingData
2023arXivTarVISTarViS: A Unified Approach for Target-based Video SegmentationCode

Label and Model Efficient Segmentation

YearVenueAcronymPaper TitleCode/Project
2022CVPRMCTformerMulti-class Token Transformer for Weakly Supervised Semantic SegmentationCode
2020CVPRPCMSelf-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic SegmentationCode
2022ECCVViT-PCMMax Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentationCode
2021ICCVDINOEmerging Properties in Self-Supervised Vision TransformersCode
2021BMVCLOSTLocalizing Objects with Self-Supervised Transformers and no LabelsCode
2022ICLRSTEGOUnsupervised Semantic Segmentation by Distilling Feature CorrespondencesCode
2022NeurIPSReCoReCo: Retrieve and Co-segment for Zero-shot TransferCode
2022arXivMaskDistillDiscovering Object Masks with Transformers for Unsupervised Semantic SegmentationN/A
2022CVPRFreeSOLOFreeSOLO: Learning to Segment Objects without AnnotationsCode
2023CVPRCutLERCut and Learn for Unsupervised Object Detection and Instance SegmentationCode
2022CVPRTokenCutSelf-Supervised Transformers for Unsupervised Object Discovery using Normalized CutCode
2022ICLRMobileViTMobileViT: Light-weight, General-purpose, and Mobile-friendly Vision TransformerCode
2023arXivEMORethinking Mobile Block for Efficient Neural ModelsCode
2022CVPRTopFormerTopFormer: Token Pyramid Transformer for Mobile Semantic SegmentationCode
2023ICLRSeaFormerSeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic SegmentationCode

Class Agnostic Segmentation and Tracking

YearVenueAcronymPaper TitleCode/Project
2022CVPRTransfinerMask Transfiner for High-Quality Instance SegmentationCode
2022ECCVVMTVideo Mask Transfiner for High-Quality Video Instance SegmentationCode
2022arXivSimpleClickSimpleClick: Interactive Image Segmentation with Simple Vision TransformersCode
2023ICLRPatchDCTPatchDCT: Patch Refinement for High Quality Instance SegmentationCode
2019ICCVSTMVideo Object Segmentation using Space-Time Memory NetworksCode
2021NeurIPSAOTAssociating Objects with Transformers for Video Object SegmentationCode
2021NeurIPSSTCNRethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object SegmentationCode
2022ECCVXMemXMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory ModelCode
2022CVPRPCVOSPer-Clip Video Object SegmentationCode
2023CVPRN/ALook Before You Match: Instance Understanding Matters in Video Object SegmentationN/A

Medical Image Segmentation

YearVenueAcronymPaper TitleCode/Project
2020BIBMCellDETRAttention-Based Transformers for Instance Segmentation of Cells in MicrostructuresCode
2021arXivTransUNetTransUNet: Transformers Make Strong Encoders for Medical Image SegmentationCode
2022ECCV WorkshopSwin-UnetSwin-Unet: Unet-like Pure Transformer for Medical Image SegmentationCode
2021MICCAITransFuseTransFuse: Fusing Transformers and CNNs for Medical Image SegmentationCode
2022WACVUNETRUNETR: Transformers for 3D Medical Image SegmentationCode

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{li2023transformer,
    author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},
    title={Transformer-Based Visual Segmentation: A Survey},
    journal={T-PAMI},
    year={2024}
  }

Contact

xiangtai94@gmail.com (main)
lxtpku@pku.edu.cn

Related Repo For Segmentation and Detection

Attention Model Repo by Min-Hung (Steve) Chen.

Detection Transformer Repo by IDEA.

Open Vocabulary Learning Repo by PKU and NTU.