Awesome
Ultimate-Awesome-Transformer-Attention
This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites. <br> This list is maintained by Min-Hung Chen. (Actively keep updating)
If you find some ignored papers, feel free to create pull requests, open issues, or email me. <br> Contributions in any form to make this list more comprehensive are welcome.
If you find this repository useful, please consider citing and ★STARing this list. <br> Feel free to share this list with others!
[Update: January, 2024] Added all the related papers from NeurIPS 2023! <br> [Update: December, 2023] Added all the related papers from ICCV 2023! <br> [Update: September, 2023] Split the multi-modal paper list to README_multimodal.md <br> [Update: June, 2023] Added all the related papers from ICML 2023! <br> [Update: June, 2023] Added all the related papers from CVPR 2023! <br> [Update: February, 2023] Added all the related papers from ICLR 2023! <br> [Update: December, 2022] Added attention-free papers from Networks Beyond Attention (GitHub) made by Jianwei Yang <br> [Update: November, 2022] Added all the related papers from NeurIPS 2022! <br> [Update: October, 2022] Split the 2nd half of the paper list to README_2.md <br> [Update: October, 2022] Added all the related papers from ECCV 2022! <br> [Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer! <br> [Update: June, 2022] Added all the related papers from CVPR 2022!
Overview
- Citation
- Survey
- Image Classification / Backbone
- Detection
- Segmentation
- Video (High-level)
- References
------ (The following papers are moved to README_multimodal.md) ------
------ (The following papers are moved to README_2.md) ------
- Other High-level Vision Tasks
- Transfer / X-Supervised / X-Shot / Continual Learning
- Low-level Vision Tasks
- Reinforcement Learning
- Medical
- Other Tasks
- Attention Mechanisms in Vision/NLP
Citation
If you find this repository useful, please consider citing this list:
@misc{chen2022transformerpaperlist,
title = {Ultimate awesome paper list: transformer and attention},
author = {Chen, Min-Hung},
journal = {GitHub repository},
url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
year = {2022},
}
Survey
- "A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 (Purdue). [Paper][GitHub]
- "Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 (Tencent). [Paper][GitHub]
- "From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 (Newcastle University, UK). [Paper][GitHub]
- "When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 (Oxford). [Paper][GitHub]
- "Foundation Models for Video Understanding: A Survey", arXiv, 2024 (Aalborg University, Denmark). [Paper][GitHub]
- "Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 (Chongqing University). [Paper][GitHub]
- "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 (GigaAI, China). [Paper][GitHub]
- "Video Diffusion Models: A Survey", arXiv, 2024 (Bielefeld University, Germany). [Paper][GitHub]
- "Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 (Lehigh + UPenn). [Paper]
- "Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 (NUS). [Paper][GitHub]
- "A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 (HKUST). [Paper][GitHub]
- "State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 (Anhui University). [Paper][GitHub]
- "Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 (IIT Patna). [Paper]
- "From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 (UIUC). [Paper][GitHub]
- "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 (Northeastern). [Paper]
- "Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 (Kyung Hee University). [Paper]
- "Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 (Beijing University of Posts and Telecommunications). [Paper][GitHub]
- "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 (Lehigh University, Pennsylvania). [Paper][GitHub]
- "Large Multimodal Agents: A Survey", arXiv, 2024 (CUHK). [Paper][GitHub]
- "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 (BIGAI). [Paper][GitHub]
- "Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 (Qufu Normal University, China). [Paper]
- "The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper]
- "Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 (Westlake University, China). [Paper][GitHub]
- "Transformer for Object Re-Identification: A Survey", arXiv, 2024 (Wuhan University). [Paper]
- "Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 (Huawei). [Paper][GtiHub]
- "MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 (Tencent). [Paper]
- "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 (Shanghai AI Lab). [Paper]
- "A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 (Huawei). [Paper]
- "A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 (Motional, Massachusetts). [Paper]
- "A Survey on Transformer Compression", arXiv, 2024 (Huawei). [Paper]
- "Vision + Language Applications: A Survey", CVPRW, 2023 (Ritsumeikan University, Japan). [Paper][GitHub]
- "Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (Tsinghua & Oxford). [Paper]
- "A Survey of Visual Transformers", TNNLS, 2023 (CAS). [Paper][GitHub]
- "Video Understanding with Large Language Models: A Survey", arXiv, 2023 (University of Rochester). [Paper][GitHub]
- "Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 (NTU, Singapore). [Paper]
- "A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 (Huawei). [Paper][GitHub]
- "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 (Tencent). [Paper]GitHub]
- "Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 (JHU). [Paper]
- "Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 (Institute for Research in Fundamental Sciences (IPM), Iran). [Paper]
- "Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 (Tencent). [Paper][GitHub (in construction)]
- "Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 (York University). [Paper]
- "Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 (valeo.ai, France). [Paper][GitHub]
- "A Survey on Video Diffusion Models", arXiv, 2023 (Fudan). [Paper][GitHub]
- "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 (Microsoft). [Paper]
- "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 (Microsoft). [Paper]
- "Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 (University of Western Australia). [Paper]
- "RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (University of Sydney). [Paper]
- "A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (The University of Sydney). [Paper]
- "From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (UESTC). [Paper]
- "Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (MBZUAI). [Paper][GitHub]
- "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (Oxford). [Paper]
- "Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (Xi'an Jiaotong University). [Paper]
- "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (HKUST). [Paper]
- "Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (Mila). [Paper]
- "Vision Language Transformers: A Survey", arXiv, 2023 (Boise State University, Idaho). [Paper]
- "Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (Peking). [Paper][GitHub]
- "Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (Microsoft). [Paper]
- "A Survey on Multimodal Large Language Models", arXiv, 2023 (USTC). [Paper][GitHub]
- "2D Object Detection with Transformers: A Review", arXiv, 2023 (German Research Center for Artificial Intelligence, Germany). [Paper]
- "Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (Eldorado’s Institute of Technology, Brazil). [Paper]
- "Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (NYU). [Paper]
- "Visual Tuning", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
- "Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (Fudan University). [Paper]
- "Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (University of Peradeniya, Sri Lanka). [Paper]
- "A Review of Deep Learning for Video Captioning", arXiv, 2023 (Deakin University, Australia). [Paper]
- "Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (NTU, Singapore). [Paper][GitHub]
- "Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (?). [Paper][GitHub (in construction)]
- "Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (KAIST). [Paper]
- "Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (Berkeley + Google). [Paper]
- "Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (RWTH Aachen University, Germany). [Paper][GitHub]
- "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (IBM). [Paper][GitHub]
- "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (Indian Institute of Information Technology). [Paper]
- "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (Pengcheng Laboratory). [Paper][GitHub]
- "A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
- "Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 (Tsinghua University, China). [Paper][Springer][Github]
- "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
- "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
- "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (Covenant University, Nigeria). [Paper]
- "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (Sejong University). [Paper]
- "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
- "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
- "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
- "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
- "Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
- "Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
- "3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
- "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
- "Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
- "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
- "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
- "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
- "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
- "Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
- "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
- "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
- "Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
- "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
- "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
- "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
- "Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
- "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
- "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
- "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
- "A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
- "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]
Image Classification / Backbone
Replace Conv w/ Attention
Pure Attention
- LR-Net: "Local Relation Networks for Image Recognition", ICCV, 2019 (Microsoft). [Paper][PyTorch (gan3sh500)]
- SASA: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (Google). [Paper][PyTorch-1 (leaderj1001)][PyTorch-2 (MerHS)]
- Axial-Transformer: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (Google). [Paper][PyTorch (lucidrains)]
- SAN: "Exploring Self-attention for Image Recognition", CVPR, 2020 (CUHK + Intel). [Paper][PyTorch]
- Axial-DeepLab: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (Google). [Paper][PyTorch]
Conv-stem + Attention
- GSA-Net: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (lucidrains)]
- HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper][PyTorch (lucidrains)]
- CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
- HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][PyTorch (in construction)]
Conv + Attention
- AA: "Attention Augmented Convolutional Networks", ICCV, 2019 (Google). [Paper][PyTorch (leaderj1001)][Tensorflow (titu1994)]
- GCNet: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (Microsoft). [Paper][PyTorch]
- LambdaNetworks: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (Google). [Paper][PyTorch-1 (lucidrains)][PyTorch-2 (leaderj1001)]
- BoTNet: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (Google). [Paper][PyTorch-1 (lucidrains)][PyTorch-2 (leaderj1001)]
- GCT: "Gaussian Context Transformer", CVPR, 2021 (Zhejiang University). [Paper]
- CoAtNet: "CoAtNet: Marrying Convolution and Attention for All Data Sizes", NeurIPS, 2021 (Google). [Paper]
- ACmix: "On the Integration of Self-Attention and Convolution", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
Vision Transformer
General Vision Transformer
- ViT: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (Google). [Paper][Tensorflow][PyTorch (lucidrains)][JAX (conceptofmind)]
- Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
- PiT: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (NAVER). [Paper][PyTorch]
- VT: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (Facebook). [Paper][PyTorch (tahmid0007)]
- PVT: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
- iRPE: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
- CaiT: "Going deeper with Image Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
- Swin-Transformer: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (Microsoft). [Paper][PyTorch][PyTorch (berniwal)]
- T2T-ViT: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (Yitu). [Paper][PyTorch]
- FFNBN: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (Microsoft). [Paper]
- DPT: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (CAS). [Paper][PyTorch]
- Focal: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
- XCiT: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (Facebook). [Paper]
- Twins: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (Meituan). [Paper][PyTorch)]
- ARM: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (Amazon). [Paper][GitHub (in construction)]
- DVT: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch]
- Aug-S: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (Huawei). [Paper]
- TNT: "Transformer in Transformer", NeurIPS, 2021 (Huawei). [Paper][PyTorch][PyTorch (lucidrains)]
- ViTAE: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (The University of Sydney). [Paper][PyTorch]
- DeepViT: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (NUS + ByteDance). [Paper][Code]
- So-ViT: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (Dalian University of Technology). [Paper][PyTorch]
- LV-ViT: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (ByteDance). [Paper][PyTorch]
- NesT: "Aggregating Nested Transformers", arXiv, 2021 (Google). [Paper][Tensorflow]
- KVT: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (Alibaba). [Paper]
- Refined-ViT: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
- Shuffle-Transformer: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (Tencent). [Paper]
- CAT: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (KuaiShou). [Paper][PyTorch]
- V-MoE: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (Google). [Paper]
- P2T: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (Nankai University). [Paper]
- PvTv2: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (Nanjing University). [Paper][PyTorch]
- LG-Transformer: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (IIAI, UAE). [Paper]
- ViP: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (Oxford). [Paper]
- Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]
- LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][PyTorch]
- DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][PyTorch (in construction)]
- RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][PyTorch]
- CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][PyTorch]
- ?: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]
- ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]
- CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][PyTorch]
- Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][PyTorch]
- DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][PyTorch (in construction)]
- MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]
- DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
- Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][PyTorch]
- NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][PyTorch]
- Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][PyTorch]
- PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][PyTorch]
- X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]
- ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][PyTorch]
- UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code (in construction)]
- Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][PyTorch]
- DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]
- MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
- VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
- ?: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (Microsoft). [Paper]
- Ortho: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (CAS). [Paper]
- PerViT: "Peripheral Vision Transformer", NeurIPS, 2022 (POSTECH). [Paper]
- LITv2: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (Monash University). [Paper][PyTorch]
- BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]
- O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]
- MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][PyTorch]
- BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]
- ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]
- HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
- PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]
- DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]
- NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][PyTorch]
- ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][PyTorch (in construction)]
- SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]
- EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]
- LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]
- Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][PyTorch]
- MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]
- MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]
- AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]
- GrafT: "Grafting Vision Transformers", arXiv, 2022 (Stony Brook). [Paper]
- ?: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (The University of Sydney). [Paper]
- LTH-ViT: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (Northeastern University, China). [Paper]
- TT: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
- INTERN: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (Shanghai AI Lab). [Paper][Website]
- GGeM: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (NAVER). [Paper]
- GPViT: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (University of Edinburgh, Scotland + UCSD). [Paper][PyTorch]
- CPVT: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (Meituan). [Paper][Code (in construction)]
- LipsFormer: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (IDEA, China). [Paper][Code (in construction)]
- BiFormer: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (CUHK). [Paper][PyTorch]
- AbSViT: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (Berkeley). [Paper][PyTorch][Website]
- DependencyViT: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (MIT). [Paper][Code (in construction)]
- ResFormer: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (Fudan). [Paper][PyTorch (in construction)]
- SViT: "Vision Transformer with Super Token Sampling", CVPR, 2023 (CAS). [Paper]
- PaCa-ViT: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (NC State). [Paper][PyTorch]
- GC-ViT: "Global Context Vision Transformers", ICML, 2023 (NVIDIA). [Paper][PyTorch]
- MAGNETO: "MAGNETO: A Foundation Transformer", ICML, 2023 (Microsoft). [Paper]
- Fcaformer: "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 (Intellifusion, China). [Paper][PyTorch]
- SMT: "Scale-Aware Modulation Meet Transformer", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- FLatten-Transformer: "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
- Path-Ensemble: "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 (Alibaba). [Paper]
- SG-Former: "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 (NUS). [Paper][PyTorch]
- SimPool: "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 (National Technical University of Athens). [Paper]
- LaPE: "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 (Peking). [Paper][PyTorch]
- CB: "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 (NAVER). [Paper]
- STL: "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
- ClusterFormer: "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 (Rochester Institute of Technology (RIT)). [Paper]
- SVT: "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 (Microsoft). [Paper][PyTorch][Website]
- CrossFormer++: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (Zhejiang University). [Paper][PyTorch]
- QFormer: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (The University of Sydney). [Paper][Code (in construction)]
- ViT-Calibrator: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (Zhejiang University). [Paper]
- SpectFormer: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
- UniNeXt: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (Alibaba). [Paper]
- CageViT: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (Southern University of Science and Technology). [Paper]
- ?: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (UIUC). [Paper]
- 2-D-SSM: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (Tel Aviv). [Paper][PyTorch]
- NaViT: "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 (DeepMind). [Paper]
- DAT++: "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
- ?: "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 (DeepMind). [Paper]
- RMT: "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 (CAS). [Paper]
- reg: "Vision Transformers Need Registers", arXiv, 2023 (Meta). [Paper]
- ChannelViT: "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 (Insitro, CA). [Paper]
- EViT: "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 (Nankai University). [Paper]
- ViR: "ViR: Vision Retention Networks", arXiv, 2023 (NVIDIA). [Paper]
- abs-win: "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 (Meta). [Paper]
- FMViT: "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
- GroupMixFormer: "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 (HKU). [Paper][PyTorch]
- PGT: "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 (DeepMind). [Paper]
- SCHEME: "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 (UCSD). [Paper]
- Agent-Attention: "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
- ViTamin: "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 (ByteDance). [Paper][PyTorch]
- HIRI-ViT: "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 (HiDream.ai, China). [Paper]
- SPFormer: "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 (JHU). [Paper]
- manifold-K: "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 (University of Oslo, Norway). [Paper]
- BiXT: "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 (University of Melbourne). [Paper]
- VisionLLaMA: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 (Meituan). [Paper][Code (in construction)]
- xT: "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 (Berkeley). [Paper]
- ACC-ViT: "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 (Purdue). [Paper]
- ViTAR: "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 (CAS). [Paper]
- iLLaMA: "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 (Shanghai AI Lab). [Paper]
Efficient Vision Transformer
- DeiT: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (Facebook). [Paper][PyTorch]
- ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook). [Paper][Code]
- ?: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (NavInfo Europe, Netherlands). [Paper]
- PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII). [Paper]
- HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University). [Paper][PyTorch]
- CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM). [Paper][PyTorch]
- ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft). [Paper][PyTorch]
- Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (Beihang University). [Paper][PyTorch]
- MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (Aarhus University, Denmark). [Paper][Tensorflow]
- SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
- DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii). [Paper][PyTorch]
- GG-Transformer: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (JHU). [Paper][Code (in construction)]
- DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
- ResT: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (Nanjing University). [Paper][PyTorch]
- Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
- SOFT: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan). [Paper][PyTorch][Website]
- IA-RED<sup>2</sup>: "IA-RED<sup>2</sup>: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM). [Paper][Website]
- LocalViT: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (ETHZ). [Paper][PyTorch]
- CCT: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
- DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook). [Paper][PyTorch]
- SL-ViT: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (Aarhus University). [Paper]
- ?: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (Aarhus University, Denmark). [Paper]
- ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay). [Paper]
- Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
- WideNet: "Go Wider Instead of Deeper", arXiv, 2021 (NUS). [Paper]
- Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm). [Paper]
- IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK). [Paper]
- DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University). [Paper][PyTorch]
- UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao). [Paper]
- Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][PyTorch]
- PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
- ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][PyTorch]
- EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][PyTorch]
- QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
- Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][PyTorch]
- QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]
- LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][PyTorch]
- A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
- PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
- Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][PyTorch-1][PyTorch-2]
- AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
- DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
- ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
- EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][PyTorch]
- SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][PyTorch]
- SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][PyTorch]
- DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]
- M<sup>3</sup>ViT: "M<sup>3</sup>ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][PyTorch]
- ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][PyTorch]
- DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]
- EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][PyTorch]
- GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][PyTorch]
- ?: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]
- TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
- MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
- ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
- CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][PyTorch]
- EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
- SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
- TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
- SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
- Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][PyTorch]
- EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
- VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code (in construction)]
- SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][PyTorch]
- MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
- LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
- Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
- XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
- PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
- ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
- DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][PyTorch]
- MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][PyTorch]
- ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]
- Token-Pooling: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (Apple). [Paper]
- Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code (in construction)]
- ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
- ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
- HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
- ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][PyTorch]
- HiViT: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (CAS). [Paper][PyTorch]
- STViT: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (Alibaba). [Paper][PyTorch]
- SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (MIT). [Paper][Website]
- Slide-Transformer: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (Tsinghua University). [Paper][Code (in construction)]
- RIFormer: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
- EfficientViT: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (Meta). [Paper]
- ViT-Ti: "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 (UMich). [Paper]
- Sparsifiner: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (University of Toronto). [Paper]
- ?: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (Baidu). [Paper]
- LTMP: "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 (Ghent University, Belgium). [Paper][PyTorch][Website]
- ReViT: "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 (Midea Grou, China). [Paper]
- EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 (MIT). [Paper][PyTorch]
- MPCViT: "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 (Peking). [Paper][PyTorch]
- MST: "Masked Spiking Transformer", ICCV, 2023 (HKUST). [Paper]
- EfficientFormerV2: "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 (Snap). [Paper][PyTorch]
- DiffRate: "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- ElasticViT: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 (Microsoft). [Paper]
- FastViT: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 (Apple). [Paper][PyTorch]
- SeiT: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 (NAVER). [Paper][PyTorch]
- TokenReduction: "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 (Aalborg University, Denmark). [Paper][PyTorch][Website]
- LGViT: "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 (Beijing Institute of Technology). [Paper]
- LBP-WHT: "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 (UT Austin). [Paper]
- FAT: "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 (CAS). [Paper][PyTorch]
- MCUFormer: "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
- SoViT: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 (DeepMind). [Paper]
- CloFormer: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (CAS). [Paper]
- Quadformer: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (Tel Aviv). [Paper][Code (in construction)]
- SparseFormer: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (NUS). [Paper][Code (in construction)]
- EMO: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (Tencent). [Paper][PyTorch]
- ByteFormer: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (Apple). [Paper]
- ?: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (Jilin University). [Paper]
- FasterViT: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (NVIDIA). [Paper]
- NextViT: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (Baidu). [Paper]
- SkipAt: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (Qualcomm). [Paper]
- MSViT: "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 (Qualcomm). [Paper]
- DiT: "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 (Meituan). [Paper][Code (in construction)]
- ?: "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 (German Research Center for Artificial Intelligence (DFKI)). [Paper][PyTorch]
- Mobile-V-MoEs: "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 (Apple). [Paper]
- PPT: "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 (Huawei). [Paper]
- MatFormer: "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 (Google). [Paper]
- SparseFormer: "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 (NUS). [Paper][PyTorch]
- GTP-ViT: "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 (CSIRO Data61, Australia). [Paper][PyTorch]
- ToFu: "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 (Samsung). [Paper]
- Cached-Transformer: "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 (CUHK). [Paper]
- LF-ViT: "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 (Harbin Institute of Technology). [Paper][PyTorch]
- EfficientMod: "Efficient Modulation for Vision Networks", ICLR, 2024 (Microsoft). [Paper][PyTorch]
- NOSE: "MLP Can Be A Good Transformer Learner", CVPR, 2024 (MBZUAI). [Paper][PyTorch]
- SLAB: "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 (Huawei). [Paper][PyTorch]
- S<sup>2</sup>: "When Do We Not Need Larger Vision Models?", arXiv, 2024 (Berkeley). [Paper][PyTorch]
Conv + Transformer
- LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook). [Paper][PyTorch]
- CeiT: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime). [Paper][PyTorch (rishikksh20)]
- Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS). [Paper][PyTorch]
- CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD). [Paper][PyTorch]
- CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft). [Paper][Code]
- ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook). [Paper]
- ConTNet: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (ByteDance). [Paper][PyTorch]
- SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft). [Paper]
- MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][PyTorch]
- CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
- Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch (in construction)]
- TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
- ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][PyTorch]
- ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI). [Paper][PyTorch]
- DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code (in construction)]
- iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][PyTorch]
- DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]
- CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][PyTorch]
- ConvMixer: "Patches Are All You Need?", arXiv, 2022 (CMU). [Paper][PyTorch]
- MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][PyTorch]
- UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][PyTorch]
- EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?). [Paper]
- MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
- DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
- ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
- Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]
- MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][PyTorch]
- STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][Code (in construction)]
- ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]
- VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
- SD-MAE: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][PyTorch (in construction)]
- SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][PyTorch (in construction)]
- SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][PyTorch]
- MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]
- InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (Shanghai AI Laboratory). [Paper][PyTorch]
- SwiftFormer: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 (MBZUAI). [Paper][PyTorch]
- SCSC: "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 (Megvii). [Paper]
- PSLT: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (Sun Yat-sen University). [Paper][Website]
- RepViT: "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
- ?: "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 (NTU, Singapore). [Paper]
- UPDP: "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 (AMD). [Paper]
Training + Transformer
- iGPT: "Generative Pretraining From Pixels", ICML, 2020 (OpenAI). [Paper][Tensorflow]
- CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (OpenAI). [Paper][PyTorch]
- MoCo-V3: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper]
- DINO: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
- drloc: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (University of Trento). [Paper][PyTorch]
- CARE: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (Tencent). [Paper][PyTorch]
- MST: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (SenseTime). [Paper]
- SiT: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (University of Surrey). [Paper][PyTorch]
- MoBY: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (Microsoft). [Paper][PyTorch]
- ?: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (Pune Institute of Computer Technology, India). [Paper]
- Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]
- BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][PyTorch]
- EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]
- iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][PyTorch]
- MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]
- AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code (in construction)]
- MAE: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][PyTorch][PyTorch (pengzhiliang)]
- SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][PyTorch]
- Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
- TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][PyTorch]
- PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]
- SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (Meta). [Paper]
- MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]
- RelViT: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]
- data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][PyTorch]
- SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code (in construction)]
- MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper][PyTorch]
- CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]
- BootMAE: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][PyTorch]
- ?: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][PyTorch]
- HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
- IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]
- AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][PyTorch]
- SLIP: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (Berkeley + Meta). [Paper][Pytorch]
- mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (Peking University). [Paper]
- SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin). [Paper][PyTorch]
- TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][PyTorch]
- PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (Arizona State University). [Paper]
- GreenMIM: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (The University of Tokyo). [Paper][PyTorch]
- DP-CutMix: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (Yonsei University). [Paper]
- ?: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][PyTorch (rwightman)]
- PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
- RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
- Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code (in construction)]
- Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
- DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]
- DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]
- ?: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]
- ConvMAE: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][PyTorch (in construction)]
- UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
- GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][PyTorch]
- SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]
- SupMAE: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][PyTorch]
- LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]
- SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]
- ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]
- ?: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]
- ?: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]
- Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][PyTorch][Website]
- BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][PyTorch (in construction)]
- PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][PyTorch]
- dBOT: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (ByteDance). [Paper]
- PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]
- Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]
- AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code (in construction)]
- LOCA: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (Google). [Paper]
- FT-CLIP: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
- MixPro: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (Beijing University of Chemical Technology). [Paper][PyTorch (in construction)]
- ConMIM: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (Tencent). [Paper][Pytorch]
- ccMIM: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (Shanghai Jiao Tong). [Paper]
- CIM: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (Microsoft). [Paper]
- MFM: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (NTU, Singapore). [Paper][Website]
- Mask3D: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (Meta). [Paper]
- VisualAtom: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][PyTorch][Website]
- MixedAE: "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 (Huawei). [Paper]
- TBM: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
- LGSimCLR: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (UMich). [Paper][PyTorch]
- DisCo-CLIP: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (IDEA). [Paper][PyTorch (in construction)]
- MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
- MAGE: "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (Google). [Paper][PyTorch]
- MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (SenseTime). [Paper][PyTorch]
- iTPN: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (CAS). [Paper][PyTorch]
- DropKey: "DropKey for Vision Transformer", CVPR, 2023 (Meitu). [Paper]
- FlexiViT: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (Google). [Paper][Tensorflow]
- RA-CLIP: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (Alibaba). [Paper]
- CLIPPO: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (Google). [Paper][JAX]
- DMAE: "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 (JHU + UC Santa Cruz). [Paper][PyTorch]
- HPM: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (CAS). [Paper][PyTorch]
- LocalMIM: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (Peking University). [Paper]
- MaskAlign: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- RILS: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (Tencent). [Paper][Code (in construction)]
- RelaxMIM: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (Megvii). [Paper]
- FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
- ?: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (Google). [Paper]
- OpenCLIP: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (LAION). [Paper][PyTorch]
- DiHT: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (Meta). [Paper][PyTorch]
- M3I-Pretraining: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- SN-Net: "Stitchable Neural Networks", CVPR, 2023 (Monash University). [Paper][PyTorch]
- MAE-Lite: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (Megvii). [Paper][PyTorch]
- ViT-22B: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (Google). [Paper]
- GHN-3: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 (Samsung). [Paper][PyTorch]
- A<sup>2</sup>MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (Westlake University, China). [Paper][PyTorch]
- PQCL: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (Alibaba). [Paper][PyTorch]
- DreamTeacher: "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 (NIVIDA). [Paper][Website]
- OFDB: "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][PyTorch]
- MFF: "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- TL-Align: "Token-Label Alignment for Vision Transformers", ICCV, 2023 (Tsinghua University). [Paper][PyTorch]
- SMMix: "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 (Xiamen University). [Paper][PyTorch]
- DiffMAE: "Diffusion Models as Masked Autoencoders", ICCV, 2023 (Meta). [Paper][Website]
- MAWS: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 (Meta). [Paper][PyTorch]
- CountBench: "Teaching CLIP to Count to Ten", ICCV, 2023 (Google). [Paper]
- CLIPpy: "Perceptual Grouping in Vision-Language Models", ICCV, 2023 (Apple). [Paper]
- CiT: "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 (Meta). [Paper][PyTorch]
- I-JEPA: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 (Meta). [Paper]
- EfficientTrain: "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 (Tsinghua). [Paper][PyTorch]
- StableRep: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 (Google). [Paper][PyTorch]
- LaCLIP: "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 (Google). [Paper][PyTorch]
- DesCo: "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 (UCLA). [Paper]
- ?: "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 (UW). [Paper]
- CapPa: "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 (DeepMind). [Paper][JAX]
- IV-CL: "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 (Google). [Paper]
- CLIPA: "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 (UC Santa Cruz). [Paper][PyTorch]
- Hummingbird: "Towards In-context Scene Understanding", NeurIPS, 2023 (DeepMind). [Paper]
- RevColV2: "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 (Megvii). [Paper][PyTorch]
- ALIA: "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 (Berkeley). [Paper][PyTorch]
- ?: "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 (UW). [Paper]
- CCViT: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (Wuhan University). [Paper]
- SoftCLIP: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (Tencent). [Paper]
- RECLIP: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (Google). [Paper]
- DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (Meta). [Paper]
- ?: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (Meta). [Paper]
- Filter: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (Apple). [Paper]
- ?: "Improved baselines for vision-language pre-training", arXiv, 2023 (Meta). [Paper]
- 3T: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (Google). [Paper]
- ADDP: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (CUHK + Tsinghua). [Paper]
- MOFI: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (Apple). [Paper]
- MaPeT: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (UniMoRE, Italy). [Paper][PyTorch]
- RECO: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (Google). [Paper]
- CLIPA-v2: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (UC Santa Cruz). [Paper][PyTorch]
- PatchMixing: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (Boston). [Paper][Website]
- SN-Netv2: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (Monash University). [Paper][PyTorch (in construction)]
- CLIP-GPT: "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 (Dublin City University, Ireland). [Paper]
- FlexPredict: "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 (Meta). [Paper]
- Soft-MoE: "From Sparse to Soft Mixtures of Experts", arXiv, 2023 (DeepMind). [Paper]
- DropPos: "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 (CAS). [Paper][PyTorch]
- MIRL: "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 (Baidu). [Paper]
- CMM: "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 (Nanjing University). [Paper]
- LC-MAE: "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 (NAVER). [Paper]
- SILC: "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 (ETHZ). [Paper]
- CLIPTex: "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 (Apple). [Paper]
- NxTP: "Object Recognition as Next Token Prediction", arXiv, 2023 (Meta). [Paper][PyTorch]
- ?: "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 (Google). [Paper][PyTorch]
- SynCLR: "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 (Google). [Paper][PyTorch]
- EWA: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
- DTM: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (NAVER). [Paper]
- SSAT: "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 (UNC Charlotte). [Paper][Code (in construction)]
- FEC: "Neural Clustering based Visual Representation Learning", CVPR, 2024 (Zhejiang). [Paper]
- EfficientTrain++: "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 (Tsinghua). [Paper][PyTorch]
- DVT: "Denoising Vision Transformers", arXiv, 2024 (USC). [Paper][PyTorch][Website]
- AIM: "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 (Apple). [Paper][PyTorch]
- DDM: "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 (Meta). [Paper]
- CrossMAE: "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 (Berkeley). [Paper][PyTorch][Website]
- IWM: "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 (Meta). [Paper]
- ?: "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 (Vector Institute). [Paper]
Robustness + Transformer
- ViT-Robustness: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (Google). [Paper]
- SAGA: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (University of Connecticut). [Paper]
- ?: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (KAIST). [Paper][PyTorch]
- ViTs-vs-CNNs: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (JHU + UC Santa Cruz). [Paper][PyTorch]
- T-CNN: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (Facebook). [Paper]
- Transformer-Attack: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (Xi'an Jiaotong). [Paper]
- ?: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (University of Rennes). [Paper]
- ?: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (ANU). [Paper][PyTorch]
- ?: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (University of Pittsburgh). [Paper]
- Token-Attack: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (New York University). [Paper]
- ?: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (Google). [Paper]
- ?: "Vision Transformers are Robust Learners", AAAI, 2022 (PyImageSearch + IBM). [Paper][Tensorflow]
- PNA: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (Fudan + Maryland). [Paper][PyTorch]
- MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]
- Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (Rice University). [Paper][PyTorch]
- Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]
- ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]
- Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]
- Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]
- APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]
- Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][PyTorch]
- RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][PyTorch]
- Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]
- VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][PyTorch]
- FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][PyTorch]
- CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][PyTorch]
- ?: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][PyTorch]
- ?: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]
- AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]
- ?: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (TUM). [Paper]
- ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][PyTorch]
- ?: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
- PAR: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (Tianjin University). [Paper]
- RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch]
- ?: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (Google). [Paper]
- NVD: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (Boston). [Paper]
- ?: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (UW-Madison). [Paper]
- MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]
- ?: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]
- ?: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
- FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]
- Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code (in construction)]
- ?: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]
- ?: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
- ?: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]
- CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]
- ?: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]
- ?: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]
- C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]
- ?: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]
- RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]
- MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]
- model-soup: "Revisiting adapters with adversarial training", ICLR, 2023 (DeepMind). [Paper]
- ?: "Budgeted Training for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
- RobustCNN: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 (UC Santa Cruz + JHU). [Paper][PyTorch]
- DMAE: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 (Peking). [Paper][PyTorch]
- TGR: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (CUHK). [Paper][PyTorch]
- TrojViT: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (Indiana University Bloomington). [Paper]
- RSPC: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (MPI). [Paper]
- TORA-ViT: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (The University of Sydney). [Paper]
- BadViT: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
- ?: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (University of Pittsburgh). [Paper]
- RobustMAE: "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 (USTC). [Paper][PyTorch (in construction)]
- ?: "Efficiently Robustify Pre-trained Models", ICCV, 2023 (IIT Roorkee, India). [Paper]
- ?: "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 (Tsinghua). [Paper]
- CleanCLIP: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 (UCLA). [Paper][PyTorch]
- QBBA: "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 (Oxford). [Paper]
- RBFormer: "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 (HKUST). [Paper]
- PreLayerNorm: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (POSTECH). [Paper]
- CertViT: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (INRIA). [Paper][PyTorch]
- RoCLIP: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (UCLA). [Paper]
- DeepMIM: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
- TAP-ADL: "Robustifying Token Attention for Vision Transformers", ICCV, 2023 (MPI). [Paper][PyTorch]
- EWA: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (Fudan). [Paper]
- SlowFormer: "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 (UC Davis). [Paper][PyTorch]
- DTM: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (NAVER). [Paper]
- SWARM: "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 (Zhejiang). [Paper][Code (in construction)]
- ?: "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 (Shanghai AI Lab). [Paper]
Model Compression + Transformer
- ViT-quant: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
- VTP: "Visual Transformer Pruning", arXiv, 2021 (Huawei). [Paper]
- MD-ViT: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (Princeton). [Paper]
- FQ-ViT: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (Megvii). [Paper][PyTorch]
- UVC: "Unified Visual Transformer Compression", ICLR, 2022 (UT Austin). [Paper][PyTorch]
- MiniViT: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- Auto-ViT-Acc: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (Northeastern University). [Paper]
- APQ-ViT: "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 (Beihang University). [Paper]
- SPViT: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (Northeastern University). [Paper][PyTorch]
- PSAQ-ViT: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (CAS). [Paper][PyTorch]
- PTQ4ViT: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (Peking University). [Paper]
- EAPruning: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (Meituan). [Paper]
- Q-ViT: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (Beihang University). [Paper][PyTorch]
- SAViT: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 (Hikvision). [Paper]
- VTC-LFC: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 (Alibaba). [Paper][PyTorch]
- Q-ViT: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (Megvii). [Paper]
- VAQF: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (Northeastern University). [Paper]
- VTP: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (UCLA). [Paper]
- SiDT: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (UC Irvine). [Paper]
- PSAQ-ViT-V2: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (CAS). [Paper][PyTorch]
- AS: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (Baidu). [Paper]
- SaiT: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (Samsung). [Paper]
- oViT: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (IST Austria). [Paper]
- CPT-V: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 (UT Austin). [Paper]
- TPS: "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 (Megvii). [Paper][PyTorch]
- GPUSQ-ViT: "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 (Fudan). [Paper]
- X-Pruner: "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 (James Cook University, Australia). [Paper][PyTorch (in construction)]
- NoisyQuant: "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 (Nanjing University). [Paper]
- NViT: "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 (NVIDIA). [Paper]
- BinaryViT: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 (Huawei). [Paper][PyTorch]
- OFQ: "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 (HKUST). [Paper][PyTorch]
- UPop: "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- COMCAT: "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 (Rutgers). [Paper][PyTorch]
- Evol-Q: "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 (UT Austin). [Paper][Code (in construction)]
- BiViT: "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 (Zhejiang University). [Paper]
- I-ViT: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 (CAS). [Paper][PyTorch]
- RepQ-ViT: "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 (CAS). [Paper][PyTorch]
- LLM-FP4: "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 (HKUST). [Paper][Code (in construction)]
- Q-HyViT: "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 (Electronics and Telecommunications Research Institute (ETRI), Korea). [Paper]
- Bi-ViT: "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 (Beihang University). [Paper]
- BinaryViT: "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 (CAS). [Paper]
- Zero-TP: "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 (Princeton). [Paper]
- ?: "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 (Qualcomm). [Paper]
- VVTQ: "Variation-aware Vision Transformer Quantization", arXiv, 2023 (HKUST). [Paper][PyTorch]
- DIMAP: "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 (A*STAR). [Paper][Code (in construction)]
- MADTP: "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 (Fudan). [Paper][Code (in construction)]
- DC-ViT: "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 (Nanjing University). [Paper]
Attention-Free
MLP-Series
- RepMLP: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (Megvii). [Paper][PyTorch]
- EAMLP: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua University). [Paper]
- Forward-Only: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (Oxford). [Paper][PyTorch]
- ResMLP: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (Facebook). [Paper]
- ?: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (Tsinghua). [Paper]
- ViP: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
- CCS: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (Baidu). [Paper]
- S<sup>2</sup>-MLPv2: "S<sup>2</sup>-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (Baidu). [Paper]
- RaftMLP: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (Rikkyo University, Japan). [Paper][PyTorch]
- Hire-MLP: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (Huawei). [Paper]
- Sparse-MLP: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (NUS). [Paper]
- ConvMLP: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
- sMLP: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (Microsoft). [Paper]
- MLP-Mixer: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (Google). [Paper][Tensorflow][PyTorch-1 (lucidrains)][PyTorch-2 (rishikksh20)]
- gMLP: "Pay Attention to MLPs", NeurIPS, 2021 (Google). [Paper][PyTorch (antonyvigouret)]
- S<sup>2</sup>-MLP: "S<sup>2</sup>-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (Baidu). [Paper]
- CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][PyTorch]
- AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][PyTorch]
- Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][PyTorch]
- DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][PyTorch]
- STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]
- AMixer: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (Tsinghua University). [Paper]
- MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]
- ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]
- MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][PyTorch]
- PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][PyTorch]
- SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][PyTorch]
- gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]
- ?: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]
- AFFNet: "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 (Microsoft). [Paper]
- Strip-MLP: "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 (Southern University of Science and Technology). [Paper][PyTorch]
Other Attention-Free
- DWNet: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Nankai Univerisy). [Paper][PyTorch]
- PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][PyTorch]
- ConvNext: "A ConvNet for the 2020s", CVPR, 2022 (Facebook). [Paper][PyTorch]
- RepLKNet: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (Megvii). [Paper][MegEngine][PyTorch]
- FocalNet: "Focal Modulation Networks", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
- HorNet: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch][Website]
- S4ND: "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 (Stanford). [Paper]
- Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]
- MogaNet: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (Westlake University, China). [Paper]
- Conv2Former: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (ByteDance). [Paper]
- CoC: "Image as Set of Points", ICLR, 2023 (Northeastern). [Paper][PyTorch]
- SLaK: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (UT Austin). [Paper][PyTorch]
- ConvNeXt-V2: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 (Meta). [Paper][PyTorch]
- SPANet: "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 (Korea Institute of Science and Technology). [Paper][Code (in construction)][Website]
- DFFormer: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (Rikkyo University, Japan). [Paper][Code (in construction)]
- ?: "ConvNets Match Vision Transformers at Scale", arXiv, 2023 (DeepMind). [Paper]
- VMamba: "VMamba: Visual State Space Model", arXiv, 2024 (CAS). [Paper][PyTorch]
- Vim: "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][[PyTorch](https://github.com/hustvl/Vim
- VRWKV: "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
- LocalMamba: "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 (University of Sydney). [Paper][PyTorch]
- SiMBA: "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 (Microsoft). [Paper][PyTorch]
- PlainMamba: "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 (University of Edinburgh, Scotland). [Paper][PyTorch]
- EfficientVMamba: "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 (The University of Sydney). [Paper][PyTorch]
- RDNet: "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 (NAVER). [Paper]
- MambaOut: "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 (NUS). [Paper][PyTorch]
Analysis for Transformer
- Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
- Transformer-Explainability: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (Tel Aviv). [Paper][PyTorch]
- ?: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (Princeton). [Paper]
- ?: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (HKU). [Paper]
- ?: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (Google). [Paper]
- ?: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (MBZUAI). [Paper][PyTorch]
- FoveaTer: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (UCSB). [Paper]
- ?: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (Microsoft). [Paper]
- ?: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (Google). [Paper]
- ?: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (Horizon Robotic). [Paper]
- ?: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (Temple University). [Paper][PyTorch]
- FDSL: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (AIST). [Paper][PyTorch][Website]
- AlterNet: "How Do Vision Transformers Work?", ICLR, 2022 (Yonsei University). [Paper][PyTorch]
- ?: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (Google). [Paper][Tensorflow]
- ?: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (Stanford). [Paper]
- ?: "Three things everyone should know about Vision Transformers", ECCV, 2022 (Meta). [Paper]
- ?: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 (Princeton). [Paper]
- AWD-ViT: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (JD). [Paper]
- ?: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (Quintic AI, CA). [Paper][Code]
- MJP: "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 (Tencent). [Paper][PyTorch]
- ?: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
- ?: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
- ?: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (Technion Israel Institute Of Technology). [Paper]
- ProtoPFormer: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- ICLIP: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (HKUST). [Paper][Code (in construction)]
- ?: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (Google). [Paper]
- ?: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (Monash University). [Paper][PyTorch]
- ViT-CX: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 (HKUST). [Paper]
- ?: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
- IAV: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 (KAIST). [Paper]
- ViT-Shapley: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 (UW). [Paper][PyTorch]
- ImageNet-X: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 (Meta). [Paper]
- ?: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 (Rensselaer Polytechnic Institute, NY). [Paper]
- ?: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 (NAVER). [Paper][PyTorch (in construction)]
- ?: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 (Stanford). [Paper]
- CLIP-Dissect: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 (UCSD). [Paper]
- ?: "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 (CMU). [Paper]
- ?: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 (Maryland). [Paper][PyTorch][Website]
- ?: "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 (Apple). [Paper]
- ?: "On Data Scaling in Masked Image Modeling", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- ?: "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 (Microsoft). [Paper]
- Vision-DiffMask: "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 (University of Amsterdam). [Paper][PyTorch]
- ?: "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 (University of Mannheim, Germany). [Paper][PyTorch]
- ?: "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 (Goethe University Frankfurt, Germany). [Paper]
- BoB: "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 (NYU). [Paper][PyTorch]
- ViT-CoT: "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 (Indiana University Bloomington, Indiana). [Paper]
- AtMan: "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 (Aleph Alpha, Germany). [Paper][PyTorch]
- AttentionViz: "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 (Harvard). [Paper][Website]
- ?: "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 (POSTECH). [Paper]
- ?: "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 (Maryland). [Paper]
- ViT-ReciproCAM: "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 (Intel). [Paper]
- Eureka-moment: "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 (Bosch). [Paper]
- INTR: "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 (OSU). [Paper][PyTorch]
- ?: "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 (Korea Institute of Science and Technology (KIST)). [Paper][PyTorch]
- RelatiViT: "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 (Tsinghua). [Paper][Code (in construction)][Website]
- TokenTM: "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 (Illinois Institute of Technology). [Paper]
- SaCo: "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 (Illinois Institute of Technology). [Paper]
- ?: "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 (Meta). [Paper][Code (in construction)]
- LeGrad: "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 (University of Bonn, Germany). [Paper][PyTorch]
Detection
Object Detection
- General:
- CNN-based backbone:
- DETR: "End-to-End Object Detection with Transformers", ECCV, 2020 (Facebook). [Paper][PyTorch]
- Deformable DETR: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (SenseTime). [Paper][PyTorch]
- UP-DETR: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch]
- SMCA: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (CUHK). [Paper][PyTorch]
- Conditional-DETR: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (Microsoft). [Paper]
- PnP-DETR: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (Yitu). [Paper][Code (in construction)]
- TSP: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (CMU). [Paper]
- Dynamic-DETR: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (Microsoft). [Paper]
- ViT-YOLO: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (Xidian University). [Paper]
- ACT: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (Peking + CUHK). [Paper][PyTorch]
- DIL-ViT: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (Monash University Malaysia). [Paper]
- Efficient-DETR: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (Megvii). [Paper]
- CA-FPN: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (CAS). [Paper]
- DETReg: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (Tel-Aviv + Berkeley). [Paper][Website]
- GQPos: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (Megvii). [Paper]
- Anchor-DETR: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (Megvii). [Paper][PyTorch]
- Sparse-DETR: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (Kakao). [Paper][PyTorch]
- DAB-DETR: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (IDEA, China). [Paper][PyTorch]
- DN-DETR: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (International Digital Economy Academy (IDEA), China). [Paper][PyTorch]
- SAM-DETR: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
- AdaMixer: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (Nanjing University). [Paper][Code (in construction)]
- DESTR: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (Oregon State). [Paper]
- REGO: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
- ?: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (Ant Group). [Paper]
- DE-DETR: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (JD). [Paper][PyTorch]
- DFFT: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (Tencent). [Paper]
- Cornerformer: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (Huawei). [Paper]
- ?: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (Microsoft). [Paper][Code (in construction)]
- Obj2Seq: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 (CAS). [Paper][PyTorch]
- KA: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (Zhejiang University). [Paper]
- TCC: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (The University of Sydney). [Paper]
- Conditional-DETR-V2: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (Peking University). [Paper]
- SAM-DETR++: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
- ComplETR: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (Amazon). [Paper]
- Pair-DETR: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 (Amazon). [Paper]
- Group-DETR-v2: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 (Baidu). [Paper]
- KD-DETR: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 (Baidu). [Paper]
- D<sup>3</sup>ETR: "D<sup>3</sup>ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 (Peking University). [Paper]
- each-DETR: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 (CUHK). [Paper][Code (in construction)]
- DETA: "NMS Strikes Back", arXiv, 2022 (UT Austin). [Paper][PyTorch]
- ViT-Adapter: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- Lite-DETR: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (IDEA). [Paper][Code (in construction)]
- DDQ: "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- SiameseDETR: "Siamese DETR", CVPR, 2023 (SenseTime). [Paper][PyTorch]
- SAP-DETR: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 (CAS). [Paper]
- Q-DETR: "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 (Beihang University). [Paper][Code (in construction)]
- Lite-DETR: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (IDEA). [Paper][PyTorch]
- H-DETR: "DETRs with Hybrid Matching", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- MaskDINO: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 (IDEA, China). [Paper][PyTorch]
- IMFA: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 (NTU, Singapore). [Paper][Code (in construction)]
- SQR: "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 (CMU). [Paper][PyTorch]
- DQ-Det: "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 (ByteDance). [Paper]
- SpeedDETR: "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 (Northeastern University). [Paper]
- AlignDet: "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 (ByteDance). [Paper][PyTorch][Website]
- Focus-DETR: "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 (Huawei). [Paper][PyTorch][MindSpore]
- Plain-DETR: "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 (Microsoft). [Paper][Code (in construction)]
- ASAG: "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
- MIMDet: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 (Tencent). [Paper][PyTorch]
- Stable-DINO: "Detection Transformer with Stable Matching", ICCV, 2023 (IDEA). [Paper][Code (in construction)]
- imTED: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 (CAS). [Paper][PyTorch]
- Group-DETR: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 (Baidu). [Paper][Code (in construction)]
- Co-DETR: "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 (SenseTime). [Paper][PyTorch]
- DETRDistill: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 (USTC). [Paper]
- Decoupled-DETR: "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 (SenseTime). [Paper]
- StageInteractor: "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 (Nanjing University). [Paper]
- Rank-DETR: "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
- Cal-DETR: "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 (MBZUAI). [Paper][PyTorch]
- KS-DETR: "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 (Toyota Technological Institute). [Paper][PyTorch]
- FeatAug-DETR: "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 (CUHK). [Paper][Codee (in construction)]
- RT-DETR: "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 (Baidu). [Paper]
- Align-DETR: "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 (Megvii). [Paper][PyTorch]
- Box-DETR: "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch (in construction)]
- RefineBox: "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 (CAS). [Paper][Code (in construction)]
- ?: "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 (Toronto). [Paper]
- Gen2Det: "Gen2Det: Generate to Detect", arXiv, 2023 (Meta). [Paper]
- ViT-CoMer: "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 (Baidu). [Paper][PyTorch]
- Salience-DETR: "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 (Xi'an Jiaotong University). [Paper][PyTorch]
- MS-DETR: "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 (Baidu). [Paper][Code (in construction)]
- Transformer-based backbone:
- ViT-FRCNN: "Toward Transformer-Based Object Detection", arXiv, 2020 (Pinterest). [Paper]
- WB-DETR: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (CAS). [Paper]
- YOLOS: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (Horizon Robotics). [Paper][PyTorch]
- ?: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (Facebook). [Paper]
- ViDT: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (NAVER). [Paper][PyTorch]
- FP-DETR: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (USTC). [Paper]
- DETR++: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (Google). [Paper]
- ViTDet: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (Meta). [Paper]
- UViT: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (Google). [Paper]
- CFDT: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 (Huawei). [Paper]
- D<sup>2</sup>ETR: "D<sup>2</sup>ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (Alibaba). [Paper]
- DINO: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 (IDEA, China). [Paper][PyTorch]
- SimPLR: "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 (UvA). [Paper]
3D Object Detection
- AST-GRU: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (Baidu). [Paper][Code (in construction)]
- Pointformer: "3D Object Detection with Pointformer", arXiv, 2020 (Tsinghua). [Paper]
- CT3D: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (Alibaba). [Paper][Code (in construction)]
- Group-Free-3D: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (Microsoft). [Paper][PyTorch]
- VoTr: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (CUHK + NUS). [Paper]
- 3DETR: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
- DETR3D: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (MIT). [Paper]
- M3DETR: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (University of Maryland). [Paper][PyTorch]
- MonoDTR: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (NTU). [Paper][Code (in construction)]
- VoxSeT: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- TransFusion: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (HKUST). [Paper][PyTorch]
- CAT-Det: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (Beihang University). [Paper]
- TokenFusion: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (Tsinghua). [Paper]
- SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
- LIFT: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (Shanghai Jiao Tong University). [Paper]
- BoxeR: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (University of Amsterdam). [Paper][PyTorch]
- BrT: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (Tsinghua). [Paper]
- VISTA: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
- STRL: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (Bosch). [Paper]
- MTrans: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (HKU). [Paper][PyTorch]
- CenterFormer: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (TuSimple). [Paper][Code (in construction)]
- BUTD-DETR: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (CMU). [Paper][PyTorch][Website]
- SpatialDETR: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (Mercedes-Benz). [Paper][PyTorch]
- CramNet: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (Waymo). [Paper]
- SWFormer: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (Waymo). [Paper]
- EMMF-Det: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 (Hikvision). [Paper]
- UVTR: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 (CUHK). [Paper][PyTorch]
- MsSVT: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
- DeepInteraction: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 (Fudan). [Paper][PyTorch]
- PETR: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (Megvii). [Paper]
- Graph-DETR3D: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (University of Science and Technology of China). [Paper]
- PolarFormer: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (Fudan University). [Paper][Code (in construction)]
- AST-GRU: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (Beijing Institute of Technology). [Paper]
- SEFormer: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (Tsinghua University). [Paper]
- CRAFT: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (KAIST). [Paper]
- CrossDTR: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (NTU). [Paper][Code (in construction)]
- Focal-PETR: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 (Beijing Institute of Technology). [Paper]
- Li3DeTr: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 (University of Coimbra, Portugal). [Paper]
- PiMAE: "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 (Peking University). [Paper][PyTorch]
- OcTr: "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 (Beihang University). [Paper]
- MonoATT: "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
- PVT-SSD: "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- ConQueR: "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 (CUHK). [Paper][PyTorch][Website]
- FrustumFormer: "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 (CAS). [Paper][PyTorch (in construction)]
- DSVT: "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 (Peking University). [Paper][PyTorch]
- AShapeFormer: "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 (Hunan University). [Paper][Code (in construction)]
- MV-JAR: "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- FocalFormer3D: "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 (NVIDIA). [Paper][PyTorch]
- 3DPPE: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 (Houmo AI, China). [Paper][PyTorch]
- PARQ: "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 (Northeastern). [Paper][PyTorch][Website]
- CMT: "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 (Megvii). [Paper][PyTorch]
- MonoDETR: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 (Shanghai AI Laboratory). [Paper][PyTorch]
- DTH: "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 (Cruise). [Paper]
- PETRv2: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 (Megvii). [Paper][PyTorch]
- MV2D: "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 (Beihang University). [Paper]
- ?: "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 (CMU). [Paper]
- Uni3DETR: "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 (Tsinghua). [Paper][PyTorch]
- Diffusion-SS3D: "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 (NYCU). [Paper][PyTorch]
- STEMD: "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 (CUHK). [Paper][[Code (in construction)(https://github.com/Eaphan/STEMD)]]
- V-DETR: "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
- 3DiffTection: "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 (NVIDIA). [Paper][Code (in construction)][Website]
- PTT: "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 (UC Merced). [Paper][Code (in construction)]
- Point-DETR3D: "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 (USTC). [Paper]
- MixSup: "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 (CAS). [Paper][PyTorch]
- QAF2D: "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 (Nullmax, China). [Paper]
- ScatterFormer: "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
- MsSVT++: "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 (Beijing Institute of Technology). [Paper][PyTorch]
Multi-Modal Detection
- OVR-CNN: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (Snap). [Paper][PyTorch]
- MDETR: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (NYU). [Paper][PyTorch][Website]
- FETNet: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (Tsinghua). [Paper]
- MEDUSA: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (Google). [Paper][PyTorch]
- StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (Baidu). [Paper]
- MAVL: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
- OWL-ViT: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (Google). [Paper][JAX][Hugging Face]
- X-DETR: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (Amazon). [Paper]
- simCrossTrans: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (The City University of New York). [Paper][PyTorch]
- ?: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (USC). [Paper]
- YONOD: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (CUNY). [Paper][PyTorch]
- OmDet: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (Binjiang Institute of Zhejiang University). [Paper]
- ContFormer: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (Peking University). [Paper]
- DQ-DETR: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 (International Digital Economy Academy (IDEA)). [Paper][Code (in construction)]
- F-VLM: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 (Google). [Paper][Website]
- OV-3DET: "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 (Peking University). [Paper][PyTorch]
- Detection-Hub: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 (Fudan + Microsoft). [Paper]
- OmniLabel: "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 (NEC). [Paper][GitHub][Website]
- MM-OVOD: "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 (Oxford). [Paper][Code (in construction)][Website]
- CoDA: "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 (HKUST). [Paper][PyTorch][Website]
- ContextDET: "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
- Object2Scene: "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 (Shanghai AI Lab). [Paper]
HOI Detection
- HOI-Transformer: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (Megvii). [Paper][PyTorch]
- HOTR: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (Kakao + Korea University). [Paper][PyTorch]
- MSTR: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (Kakao). [Paper]
- SSRT: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (Amazon). [Paper]
- CPC: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (Korea University). [Paper][PyTorch (in construction)]
- DisTR: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (Baidu). [Paper]
- STIP: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (JD). [Paper][PyTorch]
- DOQ: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (South China University of Technology). [Paper]
- UPT: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (Australian Centre for Robotic Vision). [Paper][PyTorch][Website]
- CATN: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (Huazhong University of Science and Technology). [Paper]
- GEN-VLKT: "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 (Alibaba). [Paper][PyTorch]
- HQM: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (South China University of Technology). [Paper][PyTorch]
- Iwin: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (Shanghai Jiao Tong). [Paper]
- RLIP: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 (Alibaba). [Paper][PyTorch]
- TUTOR: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 (Shanghai Jiao Tong). [Paper]
- ?: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
- ?: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 (KU Leuven). [Paper]
- HOICLIP: "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 (ShanghaiTech). [Paper][Code (in construction)]
- ViPLO: "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 (mAy-I, Korea). [Paper][PyTorch]
- OpenCat: "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 (Renmin University of China). [Paper]
- CQL: "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 (Megvii). [Paper][Code (in construction)]
- RmLR: "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 (Southeast University, China). [Paper]
- PViC: "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 (Microsoft). [Paper][PyTorch]
- AGER: "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)]
- RLIPv2: "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- EgoPCA: "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 (Shanghai Jiao Tong). [Paper][Website]
- UniHOI: "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 (Southeast University). [Paper][Code (in construction)]
- LogicHOI: "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 (University of Technology Sydney). [Paper][Code (in construction)]
- ?: "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 (KU Leuven). [Paper]
- DP-HOI: "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 (South China University of Technology). [Paper][Code (in construction)]
- HOI-Ref: "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 (University of Bristol, UK). [Paper][PyTorch][Website]
Salient Object Detection
- VST: "Visual Saliency Transformer", ICCV, 2021 (Northwestern Polytechincal University). [Paper]
- ?: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (Baidu). [Paper]
- SwinNet: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (Anhui University). [Paper][Code]
- SOD-Transformer: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
- GLSTR: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (South China University of Technology). [Paper]
- TriTransNet: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (Anhui University). [Paper]
- AbiU-Net: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (Nankai University). [Paper]
- TranSalNet: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (Cardiff University, UK). [Paper]
- DFTR: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (Tencent). [Paper]
- GroupTransNet: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (Nankai university). [Paper]
- SelfReformer: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (NTU, Singapore). [Paper]
- DTMINet: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (CUHK). [Paper]
- MCNet: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
- SiaTrans: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (Shandong University of Science and Technology). [Paper]
- PSFormer: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
- RMFormer: "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 (Dalian University of Technology). [Paper]
Other Detection Tasks
- X-supervised:
- LOST: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (Valeo.ai). [Paper][PyTorch]
- Omni-DETR: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (Amazon). [Paper][PyTorch]
- TokenCut: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
- WS-DETR: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (Microsoft). [Paper]
- TRT: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- TokenCut: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
- Semi-DETR: "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 (Baidu). [Paper][Paddle (in construction)][PyTorch (JCZ404)]
- MoTok: "Object Discovery from Motion-Guided Tokens", CVPR, 2023 (Toyota). [Paper][PyTorch][Website]
- CutLER: "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
- ISA-TS: "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 (Google). [Paper]
- MOST: "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
- GenPromp: "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 (CAS). [Paper][PyTorch]
- SAT: "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 (USTC). [Paper][PyTorch]
- ALWOD: "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 (Rutgers). [Paper][Code (in construction)]
- HASSOD: "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 (UIUC). [Paper][PyTorch][Website]
- SeqCo-DETR: "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 (SenseTime). [Paper]
- R-MAE: "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 (Meta). [Paper]
- SimDETR: "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 (Samsung). [Paper]
- U2Seg: "Unsupervised Universal Image Segmentation", arXiv, 2023 (Berkely). [Paper][PyTorch]
- CuVLER: "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 (Technion - Israel Institute of Technology). [Paper][PyTorch]
- Sparse-Semi-DETR: "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 (DFKI, Germany). [Paper]
- X-Shot Object Detection:
- AIT: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (Academia Sinica). [Paper]
- Meta-DETR: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (NTU Singapore). [Paper][PyTorch]
- CAT: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
- FCT: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (Columbia). [Paper]
- SaFT: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (Microsoft). [Paper]
- TENET: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (ANU). [Paper][PyTorch]
- Meta-DETR: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (NTU, Singapore). [Paper]
- Incremental-DETR: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (NUS). [Paper]
- FS-DETR: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 (Samsung). [Paper]
- Meta-ZSDETR: "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 (Fudan). [Paper]
- ?: "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 (CMU). [Paper]
- Open-World/Vocabulary:
- OW-DETR: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (IIAI). [Paper][PyTorch]
- DetPro: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
- RegionCLIP: "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- PromptDet: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (Meituan). [Paper][PyTorch][Website]
- OV-DETR: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (NTU, Singapore). [Paper]
- VL-PLM: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 (Rutgers University). [Paper][PyTorch][Website]
- DetCLIP: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (HKUST). [Paper]
- WWbL: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch][Demo]
- P<sup>3</sup>OVD: "P<sup>3</sup>OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 (Sun Yat-sen University). [Paper]
- Open-World-DETR: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 (NUS). [Paper]
- BARON: "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch]
- CapDet: "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 (Sun Yat-sen University). [Paper]
- CORA: "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 (CUHK). [Paper][PyTorch]
- UniDetector: "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 (Tsinghua University). [Paper][PyTorch]
- DetCLIPv2: "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 (Huawei). [Paper]
- RO-ViT: "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 (Google). [Paper]
- CAT: "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 (Northeast University, China). [Paper][PyTorch]
- CondHead: "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 (Sichuan University). [Paper]
- OADP: "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 (Beihang University). [Paper][PyTorch]
- OVAD: "Open-vocabulary Attribute Detection", CVPR, 2023 (University of Freiburg, Germany). [Paper][Website]
- OvarNet: "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 (Xiaohongshu). [Paper][Website][PyTorch]
- ALLOW: "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 (Beihang University). [Paper][PyTorch]
- PROB: "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 (Stanford). [Paper][PyTorch][Website]
- RandBox: "Random Boxes Are Open-world Object Detectors", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch]
- Cascade-DETR: "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 (ETHZ + HKUST). [Paper][Pytorch]
- EdaDet: "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 (ShanghaiTech). [Paper][Website]
- V3Det: "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 (Shanghai AI Lab). [Paper][GitHub][Website]
- CoDet: "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
- DAMEX: "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 (Georgia Tech). [Paper][Code (in construction)]
- OWL-ST: "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 (DeepMind). [Paper]
- MQ-Det: "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 (Tencent). [Paper][PyTorch]
- Grounding-DINO: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 (IDEA). [Paper]
- GridCLIP: "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 (Queen Mary University of London). [Paper]
- ?: "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 (DeepMind). [Paper]
- PCL: "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 (Kakao). [Paper]
- Prompt-OVD: "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 (NAVER). [Paper]
- LOWA: "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 (Mineral, California). [Paper]
- SGDN: "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 (Monash University). [Paper]
- SAS-Det: "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 (NEC). [Paper]
- DE-ViT: "Detect Every Thing with Few Examples", arXiv, 2023 (Rutgers). [Paper][PyTorch]
- CLIPSelf: "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 (NTU, Singapore). [Papewr][PyTorch]
- DST-Det: "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 (NTU, Singapore). [Paper][Code (in consgtruction)]
- DITO: "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 (DeepMind). [Paper]
- RegionSpot: "Recognize Any Regions", arXiv, 2023 (University Of Surrey, England). [Paper][Code (in construction)]
- DECOLA: "Language-conditioned Detection Transformer", arXiv, 2023 (UT Austin). [Paper][PyTorch]
- PLAC: "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 (Kakao). [Paper]
- FOMO: "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 (Stanford). [Paper][Website]
- LP-OVOD: "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 (VinAI, Vietnam). [Paper]
- ProxyDet: "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 (NAVER). [Paper]
- WSOVOD: "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 (Xiamen University). [Paper][Code (i construction)]
- CLIM: "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 (NTU, Singapore). [Paper][PyTorch]
- SS-OWFormer: "Semi-supervised Open-World Object Detection", AAAI, 2024 (MBZUAI). [Paper][PyTorch]
- DVDet: "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 (NTU, Singapore). [Paper]
- GenerateU: "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 (Monash University). [Paper][PyTorch]
- DetCLIPv3: "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 (Huawei). [Paper]
- RALF: "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 (Korea University). [Paper][Code (in construction)]
- SHiNe: "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 (NAVER). [Paper]
- MM-Grounding-DINO: "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
- YOLO-World: "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 (Tencent). [Paper][Code (in construction)]
- T-Rex2: "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 (IDEA). [Paper][PyTorch][Website]
- Grounding-DINO-1.5: "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 (IDEA). [Paper][Code]
- Pedestrian Detection:
- PED: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (Tsinghua). [Paper][PyTorch]
- ?: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 (ICL). [Paper]
- Pedestron: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (IIAI). [Paper][PyTorch]
- VLPD: "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 (University of Science and Technology Beijing). [Paper][PyTorch]
- Lane Detection:
- LSTR: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (Xi'an Jiaotong). [Paper][PyTorch]
- LETR: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (UCSD). [Paper][PyTorch]
- Laneformer: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (Huawei). [Paper]
- TLC: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (Peking University). [Paper]
- PersFormer: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (Shanghai AI Laboratory). [Paper][PyTorch]
- MHVA: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 (Beihang University). [Paper]
- PriorLane: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (Zhejiang Lab). [Paper][PyTorch]
- CurveFormer: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (NullMax, China). [Paper]
- LATR: "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 (CUHK). [Paper][PyTorch]
- O2SFormer: "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 (Southeast University, China). [Paper][PyTorch]
- Lane2Seq: "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 (Southeast University, China). [Paper]
- Object Localization:
- TS-CAM: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (CAS). [Paper]
- LCTR: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (Xiamen University). [Paper]
- ViTOL: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (Mercedes-Benz). [Paper][PyTorch]
- SCM: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (CUHK). [Paper][PyTorch]
- CaFT: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper]
- CoW: "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 (Columbia). [Paper][PyTorch][Website]
- ESC: "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 (UCSC). [Paper]
- Relation Detection:
- PST: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (Amazon). [Paper]
- PST: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (Amazon). [Paper]
- TROI: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (NUS, Singapore). [Paper]
- RelTransformer: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (KAUST). [Paper][PyTorch]
- VReBERT: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (ANU). [Paper]
- UniVRD: "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 (Google). [Paper][Code (in construction)]
- RECODE: "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 (Zhejiang University). [Paper]
- SG-ViT: "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 (DeepMind). [Paper]
- Anomaly Detection:
- VT-ADL: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (University of Udine, Italy). [Paper]
- InTra: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (Fujitsu). [Paper]
- AnoViT: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (Korea University). [Paper]
- WinCLIP: "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 (Amazon). [Paper]
- M3DM: "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 (Tencent). [Paper][PyTorch]
- Cross-Domain:
- SSTN: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
- MTTrans: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (Beihang University). [Paper]
- OAA-OTA: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
- SSTA: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
- DETR-GA: "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 (Beihang University). [Paper]
- DA-DETR: "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 (NTU, Singapore). [Paper]
- ?: "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 (EPFL). [Paper][PyTorch]
- PM-DETR: "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 (Peking). [Paper]
- Co-Salient Object Detection:
- CoSformer: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (Nanjing University). [Paper]
- Oriented Object Detection:
- O<sup>2</sup>DETR: "Oriented Object Detection with Transformer", arXiv, 2021 (Baidu). [Paper]
- AO2-DETR: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (Peking University). [Paper]
- ARS-DETR: "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 (Harbin Institude of Technology). [Paper][PyTorch]
- RHINO: "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 (SI Analytics). [Paper]
- Multiview Detection:
- MVDeTr: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (ANU). [Paper]
- Polygon Detection:
- ?: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (Delft University of Technology, Netherlands). [Paper]
- Drone-view:
- TPH: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (Beihang University). [Paper]
- TransVisDrone: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (UCF). [Paper][Code (in construction)]
- Infrared:
- ?: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (Chongqing University of Posts and Telecommunications). [Paper]
- MiPa: "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 (ETS Montreal). [Paper][Code (in construction)]
- Text Detection:
- SwinTextSpotter: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
- TESTR: "Text Spotting Transformers", CVPR, 2022 (UCSD). [Paper][PyTorch]
- TTS: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (Amazon). [Paper]
- oCLIP: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 (ByteDance). [Paper]
- TransDETR: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- ?: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
- ?: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (University of Science and Technology Beijing). [Paper][Code (in construction)]
- DPTNet: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (Xiamen University). [Paper]
- ATTR: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 (Fudan). [Paper]
- DPText-DETR: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 (JD). [Paper][PyTorch]
- TCM: "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
- DeepSolo: "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 (JD). [Paper][PyTorch]
- ESTextSpotter: "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 (South China University of Technology). [Paper][PyTorch]
- PBFormer: "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 (Huawei). [Paper]
- DeepSolo++: "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 (JD). [Paper][PyTorch]
- FastTCM: "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
- SRFormer: "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- TGA: "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 (Microsoft). [Paper]
- SwinTextSpotter-v2: "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 (South China University of Technology). [Paper]
- Change Detection:
- Edge Detection:
- EDTER: "EDTER: Edge Detection with Transformer", CVPR, 2022 (Beijing Jiaotong University). [Paper][Code (in construction)]
- HEAT: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (Simon Fraser). [Paper][PyTorch][Website]
- Person Search:
- Manipulation Detection:
- ObjectFormer: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (Fudan University). [Paper]
- Mirror Detection:
- Shadow Detection:
- SCOTCH-SODA: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 (University of Cambridge). [Paper]
- Keypoint Detection:
- SalViT: "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 (ANU). [Paper]
- Continual Learning:
- CL-DETR: "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 (MPI). [Paper]
- Visual Query Detection/Localization:
- Task-Driven Object Detection:
- CoTDet: "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 (ShanghaiTech). [Paper]
- Diffusion:
- DiffusionEngine: "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 (ByteDance). [Paper][PyTorch][Website]
- TADP: "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 (CalTech). [Paper][Website]
- InstaGen: "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 (Meituan). [Paper][Code (in construction)][Website]
Segmentation
Semantic Segmentation
- SETR: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch][Website]
- TrSeg: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (Korea University). [Paper][PyTorch]
- CWT: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (University of Surrey, UK). [Paper][PyTorch]
- Segmenter: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (INRIA). [Paper][PyTorch]
- UN-EPT: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (Amazon). [Paper][PyTorch]
- FTN: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (Baidu). [Paper]
- SegFormer: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
- MaskFormer: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 (UIUC + Facebook). [Paper][Website]
- OffRoadTranSeg: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (IISER. India). [Paper]
- TRFS: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (ETHZ). [Paper]
- Flying-Guide-Dog: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (KIT, Germany). [Paper][Code (in construction)]
- VSPW: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (Xiaomi). [Paper]
- SDTP: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (?). [Paper]
- TopFormer: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
- HRViT: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch]
- GReaT: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (HKUST). [Paper]
- SegDeformer: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (Shanghai Jiao Tong + Huawei). [Paper][PyTorch]
- PAUMER: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 (Idiap, Switzerland). [Paper]
- SegViT: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (The University of Adelaide, Australia). [Paper][PyTorch]
- RTFormer: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (Baidu). [Paper][Paddle]
- SegNeXt: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 (Tsinghua University). [Paper]
- Lawin: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
- PFT: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (CUHK + SenseTime). [Paper]
- DFlatFormer: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (OPPO). [Paper]
- FeSeFormer: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (Baidu). [Paper]
- StructToken: "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (Shanghai AI Lab). [Paper]
- HILA: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (University of Toronto). [Paper][Website][PyTorch]
- HLG: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (Fudan University). [Paper][PyTorch]
- SSformer: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
- NamedMask: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (Oxford). [Paper][PyTorch][Website]
- IncepFormer: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
- SeaFormer: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 (Tencent). [Paper]
- PPL: "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 (Yonsei). [Paper]
- AFF: "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 (Apple). [Paper]
- CTS: "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 (Eindhoven University of Technology, Netherlands). [Paper][PyTorch][Website]
- TSG: "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 (Monash University, Australia). [Paper]
- FASeg: "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 (Monash University, Australia). [Paper][PyTorch]
- HFD-BSD: "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 (HKUST). [Paper]
- DToP: "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 (South China University of Technology + The University of Adelaide). [Paper]
- FreeMask: "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 (HKU). [Paper][PyTorch]
- AiluRus: "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 (Huawei). [Paper][Code (in construction)]
- SegViTv2: "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (The University of Adelaide, Australia). [Paper][PyTorch]
- DoViT: "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 (Alibaba). [Paper]
- CFT: "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 (Huawei). [Paper]
- ICPC: "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 (Alibaba). [Paper]
- Superpixel-Association: "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 (Google). [Paper]
- PlainSeg: "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
- SCTNet: "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 (Meituan). [Paper][Code (in construction)]
- ?: "Region-Based Representations Revisited", arXiv, 2024 (UIUC). [Paper]
Depth Estimation
- DPT: "Vision Transformers for Dense Prediction", ICCV, 2021 (Intel). [Paper][PyTorch]
- TransDepth: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (Haerbin Institute of Technology + University of Trento). [Paper][PyTorch]
- ASTransformer: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (USTC). [Paper][PyTorch]
- MT-SfMLearner: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
- DepthFormer: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (Toyota). [Paper]
- GuideFormer: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (Agency for Defense Development, Korea). [Paper]
- SparseFormer: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (Meta). [Paper]
- DEST: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (NVIDIA). [Paper]
- MonoViT: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (University of Bologna, Italy). [Paper][PyTorch]
- Spike-Transformer: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (Peking University). [Paper][PyTorch]
- ?: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 (IIT Madras). [Paper]
- GLPanoDepth: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (Nanjing University). [Paper]
- DepthFormer: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
- BinsFormer: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
- SideRT: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (Meituan). [Paper]
- MonoFormer: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (DGIST, Korea). [Paper]
- Depthformer: "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (Indian Institute of Technology Delhi). [Paper]
- TODE-Trans: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (USTC). [Paper][Code (in construction)]
- ObjCAViT: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 (ICL). [Paper]
- ROIFormer: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 (OPPO). [Paper]
- TST: "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 (KAIST). [Paper]
- CompletionFormer: "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 (University of Bologna, Italy). [Paper][PyTorch][Website]
- Lite-Mono: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 (University of Twente, Netherlands). [Paper][PyTorch]
- EGformer: "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 (SNU). [Paper]
- ZeroDepth: "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 (Toyota). [Paper][PyTorch][Website]
- Win-Win: "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 (NAVER). [Paper]
- ?: "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 (Southern University of Science and Technology). [Paper]
- DeCoTR: "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 (Qualcomm). [Paper]
- Depth-Anything: "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 (TikTok). [Paper][PyTorch][Website]
Object Segmentation
- SOTR: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (China Agricultural University). [Paper][PyTorch]
- Trans4Trans: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
- Trans2Seg: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (HKU + SenseTime). [Paper][PyTorch]
- SOIT: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (Hikvision). [Paper][PyTorch]
- CAST: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (Berkeley). [Paper]
- ?: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 (Aalto University, Finland). [Paper]
- MSMFormer: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 (UT Dallas). [Paper][PyTorch]
Other Segmentation Tasks
- Any-X/Every-X:
- SAM: "Segment Anything", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
- SEEM: "Segment Everything Everywhere All at Once", NeurIPS, 2023 (Microsoft). [Paper][PyTorch]
- HQ-SAM: "Segment Anything in High Quality", NeurIPS, 2023 (ETHZ). [Paper][PyTorch]
- ?: "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 (UCSB). [Paper]
- ?: "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 (HKUST). [Paper]
- SAD: "SAD: Segment Any RGBD", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
- ?: "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 (Kyung Hee University, Korea). [Paper]
- ?: "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 (Kyung Hee University). [Paper]
- FastSAM: "Fast Segment Anything", arXiv, 2023 (CAS). [Paper][PyTorch]
- MobileSAM: "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 (Kyung Hee University). [Paper][PyTorch]
- Semantic-SAM: "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
- Follow-Anything: "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 (MIT). [Paper]
- DINOv: "Visual In-Context Prompting", arXiv, 2023 (Microsoft). [Paper][Code (in construction)]
- Stable-SAM: "Stable Segment Anything Model", arXiv, 2023 (Kuaishou). [Paper][Code (in construction)]
- EfficientSAM: "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 (Meta). [Paper]
- EdgeSAM: "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- RepViT-SAM: "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 (Tsinghua). [Paper][PyTorch]
- SlimSAM: "0.1% Data Makes Segment Anything Slim", arXiv, 2023 (NUS). [Paper][PyTorch]
- FIND: "Interfacing Foundation Models' Embeddings", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)][Website]
- SqueezeSAM: "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 (Meta). [Paper]
- TAP: "Tokenize Anything via Prompting", arXiv, 2023 (BAAI). [Paper][PyTorch]
- MobileSAMv2: "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 (Kyung Hee University). [Paper][PyTorch]
- TinySAM: "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 (Huawei). [Paper][PyTorch]
- Conv-LoRA: "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 (Amazon). [Paper][PyTorch]
- PerSAM: "Personalize Segment Anything Model with One Shot", ICLR, 2024 (CUHK). [Paper][PyTorch]
- VRP-SAM: "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 (Baidu). [Paper]
- UAD: "Unsegment Anything by Simulating Deformation", CVPR, 2024 (NUS). [Paper][PyTorch]
- ASAM: "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 (vivo). [Paper][PyTorch][Website]
- PTQ4SAM: "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 (Beihang). [Paper][PyTorch]
- BA-SAM: "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 (Shanghai Jiao Tong). [Paper]
- OV-SAM: "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch][Website]
- SSPrompt: "Learning to Prompt Segment Anything Models", arXiv, 2024 (NTU, Singapore). [Paper]
- RAP-SAM: "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch][Website]
- PA-SAM: "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 (OPPO). [Paper][PyTorch]
- Grounded-SAM: "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 (IDEA). [Paper][PyTorch]
- EfficientViT-SAM: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 (NVIDIA). [Paper][PyTorch]
- DeiSAM: "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 (TU Darmstadt, Germany). [Paper]
- CAT-SAM: "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch (in construction)][Website]
- BLO-SAM: "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 (UCSD). [Paper][PyTorch]
- P<sup>2</sup>SAM: "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 (UMich). [Paper]
- RA: "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 (UIUC). [Paper]
- Vision-Language:
- LSeg: "Language-driven Semantic Segmentation", ICLR, 2022 (Cornell). [Paper][PyTorch]
- ZegFormer: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
- CLIPSeg: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (University of Göttingen, Germany). [Paper][PyTorch]
- DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (Tsinghua University). [Paper][PyTorch][Website]
- GroupViT: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (NVIDIA). [Paper][Website][PyTorch]
- MaskCLIP: "Extract Free Dense Labels from CLIP", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
- ViewCo: "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
- LMSeg: "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 (Alibaba). [Paper]
- VL-Fields: "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 (University of Edinburgh, UK). [Paper][Website]
- X-Decoder: "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
- IFSeg: "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 (KAIST). [Paper][PyTorch]
- SAZS: "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
- CLIP-S<sup>4</sup>: "CLIP-S<sup>4</sup>: Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 (Bosch). [Paper]
- D<sup>2</sup>Zero: "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
- PADing: "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch][Website]
- LD-ZNet: "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 (Amazon). [Paper][PyTorch][Website]
- MAFT: "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 (Picsart). [Paper][PyTorch]
- PGSeg: "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 (Shanghai Jiao Tong). [Paper][PyTorch]
- MESS: "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (IBM). [Paper][PyTorch][Website]
- ZegOT: "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 (KAIST). [Paper]
- SimCon: "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 (Amazon). [Paper]
- DiffusionSeg: "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- ASCG: "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 (ByteDance). [Paper]
- ClsCLIP: "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 (Eastern Institute for Advanced Study, China). [Paper]
- CLIPTeacher: "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 (Nagoya University). [Paper]
- SAM-CLIP: "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 (Apple). [Paper]
- GEM: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 (University of Bonn, Germany). [Paper][PyTorch]
- CaR: "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 (Google). [Paper][Code (in construction)][Website]
- SPT: "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 (Beijing University of Posts and Telecommunications). [Paper][PyTorch (in construction)]
- FMbSeg: "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 (Toyota). [Paper]
- Open-World/Vocabulary:
- ViL-Seg: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (CUHK). [Paper]
- OVSS: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- OpenSeg: "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 (Google). [Paper]
- Fusioner: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 (Shanghai Jiao Tong University). [Paper][Website]
- OVSeg: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
- ZegCLIP: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 (The University of Adelaide, Australia). [Paper][PyTorch]
- TCL: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 (Kakao). [Paper][PyTorch]
- ODISE: "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 (NVIDIA). [Paper][PyTorch][Website]
- Mask-free-OVIS: "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 (Salesforce). [Paper][PyTorch (in construction)]
- FreeSeg: "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 (ByteDance). [Paper]
- SAN: "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- OVSegmentor: "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 (Fudan University). [Paper][PyTorch][Website]
- PACL: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 (Meta). [Paper]
- MaskCLIP: "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 (UCSD). [Paper][Website]
- SegCLIP: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 (JD). [Paper][PyTorch]
- SWORD: "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 (HKU). [Paper]
- Grounded-Diffusion: "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 (Shanghai Jiao Tong). [Paper][PyTorch][Website]
- SegPrompt: "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 (Zhejiang). [Paper][PyTorch]
- CGG: "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 (SenseTime). [Paper][PyTorch][Website]
- OpenSeeD: "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 (IDEA). [Paper][PyTorch]
- OPSNet: "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 (HKU). [Paper]
- GKC: "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 (ByteDance). [Paper]
- ZeroSeg: "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 (Meta). [Paper]
- MasQCLIP: "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 (UCSD). [Paper][PyTorch][Website]
- VLPart: "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 (HKU). [Paper][PyTorch]
- DeOP: "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 (Meituan). [Paper]][PyTorch]
- MixReorg: "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 (Sun Yat-sen University). [Paper]
- OV-PARTS: "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (Shanghai AI Lab). [Paper][PyTorch]
- HIPIE: "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
- ?: "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 (Shanghai Jiao Tong). [Paper]
- FC-CLIP: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
- WLSegNet: "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 (IIT, New Delhi). [Paper]
- CAT-Seg: "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Korea University). [Paper][PyTorch][Website]
- MVP-SEG: "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Xiaohongshu, China). [Paper]
- TagCLIP: "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 (CUHK). [Paper]
- OVDiff: "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 (Oxford). [Paper][Website]
- UOVN: "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 (Monash University). [Paper]
- CLIP-DIY: "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 (Warsaw University of Technology, Poland). [Paper]
- Entity: "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 (Harbin Engineering University). [Paper][PyTorch]
- OSM: "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 (ByteDance). [Paper][PyTorch]
- SED: "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (Tianjin). [Paper][PyTorch (in construction)]
- PnP-OVSS: "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 (NTU, Singapore). [Paper]
- SCLIP: "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 (JHU). [Paper]
- GranSAM: "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 (UC Riverside). [Paper]
- Sambor: "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 (Huawei). [Paper][Code (in construction)]
- SCAN: "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
- Self-Seg: "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 (UvA). [Paper]
- OpenSD: "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 (OPPO). [Paper]
- CLIP-DINOiser: "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 (Warsaw University of Technology, Poland). [Paper][PyTorch]
- TagAlign: "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 (Ant Group). [Paper][PyTorch][Website]
- OVFoodSeg: "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 (Singapore Management University (SMU)). [Paper]
- FreeDA: "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 (University of Modena and Reggio Emilia (UniMoRe), Italy). [Paper][Website]
- S-Seg: "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 (Oxford). [Paper][Code (in construction)]
- PosSAM: "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 (Qualcomm). [Paper]][Code (in construction)][Website]
- LLM-based:
- LISA: "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 (CUHK). [Paper][PyTorch]
- PixelLM: "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 (ByteDance). [Paper][Code (in construction)][Website]
- PixelLLM: "Pixel Aligned Language Models", arXiv, 2023 (Google). [Paper][Website]
- GSVA: "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 (Tsinghua). [Paper]
- LISA++: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 (CUHK). [Paper]
- GROUNDHOG: "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 (Amazon). [Paper][Website]
- PSALM: "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 (Huazhong University of Science and Technology). [Paper][PyTorch]
- LLaVASeg: "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 (vivo). [Paper]
- LaSagnA: "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 (Meituan). [Paper][PyTorch]
- Universal Segmentation:
- K-Net: "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 (NTU, Singapore). [Paper][PyTorch]
- Mask2Former: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
- MP-Former: "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 (IDEA). [Paper][Code (in construction)]
- OneFormer: "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 (Oregon). [Paper][PyTorch][Website]
- UNINEXT: "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 (ByteDance). [Paper][PyTorch]
- ClustSeg: "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 (Rochester Institute of Technology). [Paper]
- DaTaSeg: "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 (Google). [Paper]
- DFormer: "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 (Tianjin University). [Paper][Code (in construction)]
- ?: "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 (OMRON SINIC X, Japan). [Paper]
- Mask2Anomaly: "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 (Politecnico di Torino, Italy). [Paper]
- SegGen: "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 (Adobe). [Paper][Code (in construction)][Website]
- PolyMaX: "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 (Google). [Paper][Tensorflow]
- PEM: "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
- OMG-Seg: "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 (NTU, Singapore). [Paper][PyTorch][Website]
- Uni-OVSeg: "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 (University of Sydney). [Paper][PyTorch (in construction)]
- PRO-SCALE: "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 (NEC). [Paper]
- Multi-Modal:
- UCTNet: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (Lehigh University, Pennsylvania). [Paper]
- CMX: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
- DeLiVER: "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch][Website]
- DFormer: "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 (Nankai University). [Paper][PyTorch]
- Sigma: "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 (CMU). [Paper][PyTorch]
- Panoptic Segmentation:
- MaX-DeepLab: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (Google). [Paper][PyTorch (conradry)]
- SIAin: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (SI Analytics, South Korea). [Paper]
- VPS-Transformer: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
- CMT-DeepLab: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (Google). [Paper]
- Panoptic-SegFormer: "Panoptic SegFormer", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
- kMaX-DeepLab: "k-means Mask Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
- Panoptic-PartFormer: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (Peking). [Paper][PyTorch]
- CoMFormer: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 (Sorbonne Université, France). [Paper]
- YOSO: "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 (Xiamen University). [Paper][PyTorch]
- Pix2Seq-D: "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 (DeepMind). [Paper][Tensorflow2]
- DeepDPS: "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
- ReMaX: "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 (Google). [Paper][Tensorflow2]
- PanopticPartFormer++: "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 (Peking). [Paper][PyTorch]
- MaXTron: "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 (ByteDance). [Paper]
- ECLIPSE: "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 (NAVER). [Paper][Code (in construction)]
- Instance Segmentation:
- ISTR: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (Xiamen University). [Paper][PyTorch]
- Mask-Transfiner: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch][Website]
- BoundaryFormer: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (UCSD). [Paper]
- PPT: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (ByteDance). [Paper]
- TOIST: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
- MAL: "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 (NVIDIA). [Paper][PyTorch]
- FastInst: "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 (Alibaba). [Paper][PyTorch]
- SP: "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 (Northwestern Polytechnical University, China). [Paper]
- X-Paste: "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 (USTC). [Paper][PyTorch]
- DynaMITe: "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 (RWTH Aachen University, Germany). [Paper][PyTorch][Website]
- Mask-Frozen-DETR: "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 (Microsoft). [Paper]
- Optical Flow:
- CRAFT: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (A*STAR, Singapore). [Paper][PyTorch]
- KPA-Flow: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (Megvii). [Paper][PyTorch (in construction)]
- GMFlowNet: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (Rutgers). [Paper][PyTorch]
- FlowFormer: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (CUHK). [Paper][Website]
- TransFlow: "TransFlow: Transformer as Flow Learner", CVPR, 2023 (Rochester Institute of Technology). [Paper]
- FlowFormer++: "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 (CUHK). [Paper]
- Panoramic Semantic Segmentation:
- Trans4PASS: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
- SGAT4PASS: "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 (Tencent). [Paper][Code (in construction)]
- FlowFormer: "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 (CUHK). [Paper]
- X-Shot:
- CyCTR: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (University of Technology Sydney). [Paper]
- CATrans: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (Baidu). [Paper]
- VAT: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (Korea University). [Paper][PyTorch][Website]
- DCAMA: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (Tencent). [Paper]
- AAFormer: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (USTC). [Paper]
- IPMT: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (Northwestern Polytechnical University). [Paper][PyTorch]
- TAFT: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (KAIST). [Paper]
- MSANet: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (AiV Research Group, Korea). [Paper][PyTorch]
- MuHS: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 (Zhejiang University). [Paper]
- VTM: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 (KAIST). [Paper][PyTorch]
- SegGPT: "SegGPT: Segmenting Everything In Context", ICCV, 2023 (BAAI). [Paper][PyTorch]
- AMFormer: "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 (ISTC). [Paper][Code (in construction)]
- RefT: "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- ?: "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- SPINO: "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 (University of Freiburg, Germany). [Paper][Website]
- ?: "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 (UBC). [Paper]
- RefLDM-Seg: "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 (NTU, Singapore). [Paper][Code (in construction)][Website]
- Chameleon: "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 (KAIST). [Paper]
- X-Supervised:
- MCTformer: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (The University of Western Australia). [Paper][Code (in construction)]
- AFA: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
- HSG: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (Berkeley). [Paper][PyTorch]
- CLIMS: "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 (Shenzhen University). [Paper][PyTorch]
- ?: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (Université Paris-Saclay, France). [Paper]
- SegSwap: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (École des Ponts ParisTech). [Paper][PyTorch][Website]
- ViT-PCM: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (Sapienza University, Italy). [Paper][Tensorflow]
- TransFGU: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
- TransCAM: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (University of Toronto). [Paper][PyTorch]
- WegFormer: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Tongji University, China). [Paper]
- MaskDistill: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (KU Leuven). [Paper][PyTorch]
- eX-ViT: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (La Trobe University, Australia). [Paper]
- TCC: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (Alibaba). [Paper]
- SemFormer: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Shenzhen University). [Paper][PyTorch]
- CLIP-ES: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
- ToCo: "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 (JD). [Paper][PyTorch]
- DPF: "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 (Tsinghua). [Paper][PyTorch]
- SemiCVT: "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 (Zhejiang University). [Paper]
- AttentionShift: "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (CAS). [Paper]
- MMCST: "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 (The University of Western Australia). [Paper]
- SimSeg: "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
- SIM: "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch (in construction)]
- AttentionShift: "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (CAS). [Paper]
- Point2Mask: "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 (Zhejiang). [Paper][PyTorch]
- BoxSnake: "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 (Tencent). [Paper]
- QA-CLIMS: "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 (Shenzhen University). [Paper][Code (in construction)]
- CoCu: "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
- APro: "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 (Zhejiang). [Paper][PyTorch][Website]
- PaintSeg: "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 (Microsoft). [Paper]
- SmooSeg: "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 (NTU, Singapore). [Paper][PyTorch]
- VLOSS: "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 (Harbin Institute of Technology). [Paper]
- MECPformer: "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Tongji University). [Paper][Code (in construction)]
- WeakTr: "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 (Huazhong University of Science and Technology). [Paper][PyTorch]
- SAM-WSSS: "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 (ANU). [Paper]
- ?: "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang University + Nankai University). [Paper]
- AReAM: "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang University). [Paper]
- SEPL: "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 (OSU). [Paper][Code (in construction)]
- MIMIC: "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 (UW). [Paper][PyTorch]
- POLE: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 (ETS Montreal, Canada). [Paper][PyTorch]
- GD: "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 (Meta). [Paper]
- MCTformer+: "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (The University of Western Australia). [Paper][PyTorch]
- MMC: "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 (University of Surrey, UK). [Paper]
- CRATE: "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 (Berkeley). [Paper][PyTorch]
- ?: "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 (Singapore Management University). [Paper]
- MCC: "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang Lab, China). [Paper][PyTorch]
- CRATE: "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
- SAMS: "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 (Zhejiang). [Paper]
- SemiVL: "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 (Google). [Paper][PyTorch]
- Self-reinforcement: "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 (Zhejiang Lab). [Paper][PyTorch]
- FeatUp: "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 (MIT). [Paper]
- Zip-Your-CLIP: "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 (ShanghaiTech). [Paper][PyTorch]
- SeCo: "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 (Fudan). [Paper][Code (in construction)]
- AllSpark: "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 (HKUST). [Paper][PyTorch]
- CPAL: "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 (Monash University). [Paper][Code (in construction)]
- DuPL: "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 (Shanghai University). [Paper][PyTorch]
- CoDe: "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 (NTU). [Paper][Code (in construction)]
- SemPLeS: "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 (NVIDIA). [Paper]
- WeakSAM: "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 (Huazhong University of Science & Technology (HUST)). [Paper][PyTorch]
- CoSA: "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 (Lancaster University, UK). [Paper][Code (in construction)]
- CoBra: "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 (Yonsei). [Paper]
- Cross-Domain:
- DAFormer: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch]
- HGFormer: "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 (Wuhan University). [Paper][Code (in construction)]
- UniDAformer: "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 (NTU, Singapore). [Paper]
- MIC: "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 (ETHZ). [Paper][PyTorch]
- CDAC: "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 (Boston). [Paper][PyTorch]
- EDAPS: "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 (ETHZ). [Paper][PyTorch]
- PTDiffSeg: "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
- Rein: "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 (USTC). [Paper]
- Continual Learning:
- TISS: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 (Tencent). [Paper]
- Incrementer: "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 (University of Electronic Science and Technology of China). [Paper]
- Crack Detection:
- CrackFormer: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
- Camouflaged/Concealed Object:
- UGTR: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (Group42, Abu Dhabi). [Paper][PyTorch]
- COD: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (Anhui University, China). [Paper][Code (in construction)]
- OSFormer: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
- FSPNet: "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 (Sichuan Changhong Electric, China). [Paper][PyTorch][Website]
- MFG: "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 (Tsinghua). [Paper][Code (in construction)]
- Background Separation:
- TransBlast: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (University of British Columbia). [Paper]
- Scene Understanding:
- BANet: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (Wuhan University). [Paper]
- Cerberus-Transformer: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
- IRISformer: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (UCSD). [Paper][Code (in construction)]
- 3D Segmentation:
- Stratified-Transformer: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (CUHK). [Paper][PyTorch]
- CodedVTR: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (Tsinghua). [Paper]
- M2F3D: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper][Website]
- 3DSeg: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 (The University of Tokyo). [Paper]
- Analogical-Network: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 (CMU). [Paper]
- VoxFormer: "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 (NVIDIA). [Paper][PyTorch]
- GrowSP: "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- RangeViT: "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 (Valeo.ai, France). [Paper][Code (in construction)]
- MeshFormer: "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 (University of Macau). [Paper]
- MSeg3D: "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
- SGVF-SVFE: "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 (ShanghaiTech). [Paper]
- SVQNet: "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 (Tsinghua). [Paper]
- MAF-Transformer: "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 (CUHK). [Paper][PyTorch]
- UniSeg: "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- MIT: "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 (NTU). [Paper]
- CVSformer: "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 (Tianjin University). [Paper]
- SPT: "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 (Univ Gustave Eiffel, France). [Paper][PyTorch]
- SATR: "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 (KAUST). [Paper][PyTorch][Website]
- 3D-OWIS: "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 (MBZUAI). [Paper]
- SA3D: "Segment Anything in 3D with NeRFs", NeurIPS, 2023 (SJTU). [Paper][PyTorch][Website]
- Contrastive-Lift: "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 (Oxford). [Paper][PyTorch][Website]
- P3Former: "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 (1Shanghai AI Lab). [Paper][Code (in construction)]
- UnScene3D: "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 (TUM). [Paper][Website]
- CNS: "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 (HKU). [Paper][Code (in construction)]
- DCTNet: "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 (University of Waterloo, Waterloo, Canada). [Paper]
- Symphonies: "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 (Horizon Robotics). [Paper][PyTorch]
- TFS3D: "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 (CUHK). [Paper][PyTorch]
- CIP-WPIS: "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 (Australian National University). [Paper]
- ?: "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 (ShanghaiTech). [Paper]
- CSF: "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 (NTU, Singapore). [Paper]
- ?: "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 (Oxford). [Paper]
- OneFormer3D: "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 (Samsung). [Paper]
- SAGA: "Segment Any 3D Gaussians", arXiv, 2023 (SJTU). [Paper][Code (in construction)][Website]
- SANeRF-HQ: "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 (HKUST). [Paper][Code (in construction)][Website]
- SAM-Graph: "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 (Zhejiang). [Paper][Code (in construction)][Website]
- SAI3D: "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 (Peking). [Paper]
- COSeg: "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 (ETHZ). [Paper][Code (in construction)]
- CSC: "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 (East China Normal University). [Paper][Code (in construction)]
- Multi-Task:
- InvPT: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (HKUST). [Paper][PyTorch]
- MTFormer: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 (CUHK). [Paper]
- MQTransformer: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (Wuhan University). [Paper]
- DeMT: "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 (Wuhan University). [Paper][PyTorch]
- TaskPrompter: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 (HKUST). [Paper][PyTorch (in construction)]
- AiT: "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 (Microsoft). [Paper][PyTorch]
- InvPT++: "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 (HKUST). [Paper]
- DeMTG: "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 (Wuhan University). [Paper][PyTorch]
- SRT: "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 (UCLA). [Paper]
- MLoRE: "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 (vivo). [Paper]
- ODIN: "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 (CMU). [Paper][Code (in construction)][Website]
- LiFT: "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 (Maryland). [Paper]
- Forecasting:
- DiffAttn: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (UIUC). [Paper][Code (in construction)]
- LiDAR:
- HelixNet: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (CNRS, France). [Paper][Website][PyTorch]
- Gaussian-Radar-Transformer: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 (University of Bonn, Germany). [Paper]
- MOST: "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 (CMU). [Paper][PyTorch]
- 4D-Former: "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 (Waabi, Canada). [Paper][Website]
- MASK4D: "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 (RWTH Aachen University, Germany). [Paper]
- Co-Segmentation:
- ReCo: "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 (Oxford). [Paper][PyTorch][Website]
- DINO-ViT-feature: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
- LCCo: "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 (Beijing Institute of Technology). [Paper]
- Top-Down Semantic Segmentation:
- Trans4Map: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
- Surface Normal:
- Normal-Transformer: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 (University of Technology Sydney). [Paper]
- Applications:
- FloodTransformer: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (BITS Pilani, India). [Paper]
- Diffusion:
- VPD: "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 (Tsinghua University). [Paper][PyTorch][Website]
- Dataset-Diffusion: "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 (VinAI, Vietnam). [Paper][PyTorch][Website]
- SegRefiner: "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 (ByteDance). [Paper][PyTorch]
- DatasetDM: "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 (Zhejiang). [Paper][PyTorch][Website]
- DiffSeg: "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 (Georgia Tech). [Paper]
- DiffSegmenter: "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 (Beihang University). [Paper]
- ?: "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 (Tsinghua). [Paper]
- LDMSeg: "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 (Segments.ai, Belgium). [Paper][PyTorch]
- Low-Level Structure Segmentation:
- EVP: "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. (Tencent). [Paper][PyTorch]
- EVP: "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 (Tencent). [Paper][PyTorch]
- EmerDiff: "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 (NVIDIA). [Paper][Website]
- Zero-Guidance Segmentation:
- Part Segmentation:
- Entity Segmentation:
- Evaluation:
- Interactive Segmentation:
- InterFormer: "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 (Xiamen University). [Paper][PyTorch]
- SimpleClick: "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 (UNC). [Paper][PyTorch]
- iCMFormer: "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 (University of Twente, Netherlands). [Paper][Code (in construction)]
- MFP: "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 (Korea University). [Paper][Code (in construction)]
- GraCo: "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 (Peking). [Paper][Website]
- Amodal Segmentation:
- AISFormer: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (University of Arkansas, Arkansas). [Paper][PyTorch]
- C2F-Seg: "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 (Fudan). [Paper][Code (in construction)][Website]
- EoRaS: "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 (Fudan). [Paper][Code (in construction)]
- MP3D-Amodal: "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 (Oxford). [Paper][Website (in construction)]
- Amonaly Segmentation:
- In-Context Segmentation:
- SEGIC: "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 (Fudan). [Paper][Code (in construction)]
Video (High-level)
Action Recognition
- RGB mainly
- Action Transformer: "Video Action Transformer Network", CVPR, 2019 (DeepMind). [Paper][Code (ppriyank)]
- ViViT-Ensemble: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (Alibaba). [Paper]
- TimeSformer: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (Facebook). [Paper][PyTorch (lucidrains)]
- MViT: "Multiscale Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
- VidTr: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (Amazon). [Paper][PyTorch]
- ViViT: "ViViT: A Video Vision Transformer", ICCV, 2021 (Google). [Paper][PyTorch (rishikksh20)]
- VTN: "Video Transformer Network", ICCVW, 2021 (Theator). [Paper][PyTorch]
- TokShift: "Token Shift Transformer for Video Classification", ACMMM, 2021 (CUHK). [Paper][PyTorch]
- Motionformer: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (Facebook). [Paper][PyTorch][Website]
- X-ViT: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (Samsung). [Paper][PyTorch]
- SCT: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (Kuaishou). [Paper]
- RSANet: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (POSTECH). [Paper][PyTorch][Website]
- STAM: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (Alibaba). [Paper][Code]
- GAT: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (Samsung). [Paper]
- TokenLearner: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (Google). [Paper]
- VLF: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (The University of Sheffield). [Paper]
- UniFormer: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (CAS + SenstTime). [Paper][PyTorch]
- Video-Swin: "Video Swin Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- DirecFormer: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (University of Arkansas). [Paper][Code (in construction)]
- DVT: "Deformable Video Transformer", CVPR, 2022 (Meta). [Paper]
- MeMViT: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (Meta). [Paper]
- MLP-3D: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (JD). [Paper][PyTorch (in construction)]
- RViT: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (TCL Corporate Research, HK). [Paper]
- SIFA: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (JD). [Paper][PyTorch]
- MViTv2: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (Meta). [Paper][PyTorch]
- MTV: "Multiview Transformers for Video Recognition", CVPR, 2022 (Google). [Paper][Tensorflow]
- ORViT: "Object-Region Video Transformers", CVPR, 2022 (Tel Aviv). [Paper][Website]
- TIME: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (KAIST). [Paper][PyTorch]
- TPS: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (Alibaba). [Paper][PyTorch]
- DualFormer: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
- STTS: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (Fudan University). [Paper][PyTorch]
- Turbo: "Turbo Training with Token Dropout", BMVC, 2022 (Oxford). [Paper]
- MultiTrain: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
- SViT: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 (Tel Aviv). [Paper][Website]
- ST-Adapter: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 (CUHK). [Paper][Code (in construction)]
- ATA: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 (Microsoft). [Paper]
- AIA: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (University of Science and Technology of China). [Paper][PyTorch]
- MSCA: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (Nagoya Institute of Technology). [Paper]
- VAST: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (Samsung). [Paper]
- Video-MobileFormer: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (Microsoft). [Paper]
- MAM<sup>2</sup>: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (Baidu). [Paper]
- ?: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (SenseTime). [Paper]
- STAN: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (The University of Surrey, UK). [Paper]
- PatchBlender: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 (Mila). [Paper]
- DualPath: "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 (Yonsei University). [Paper][PyTorch (in construction)]
- S-ViT: "Streaming Video Model", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
- TubeViT: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 (Google). [Paper]
- AdaMAE: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 (JHU). [Paper][PyTorch]
- ObjectViViT: "How can objects help action recognition?", CVPR, 2023 (Google). [Paper]
- SMViT: "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (Meta). [Paper]
- Hiera: "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (Meta). [Paper][PyTorch]
- Video-FocalNet: "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
- ATM: "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 (Baidu). [Paper][Code (in construction)]
- STA: "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 (Huawei). [Paper]
- Helping-Hands: "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 (Oxford). [Paper][PyTorch]
- SUM-L: "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 (University of Delaware, Delaware). [Paper][Code (in construction)]
- BEAR: "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 (UCF). [Paper][GitHub]
- UniFormerV2: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 (CAS). [Paper][PyTorch]
- CAST: "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 (Kyung Hee University). [Paper][PyTorch][Website]
- PPMA: "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 (IBM). [Paper][PyTorch]
- SVT: "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 (Meta). [Paper]
- PLAR: "Prompt Learning for Action Recognition", arXiv, 2023 (Maryland). [Paper]
- SFA-ViViT: "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 (Google). [Paper]
- TAdaConv: "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 (NUS). [Paper][PyTorch]
- ZeroI2V: "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 (Nanjing University). [Paper]
- MV-Former: "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 (Meta). [Paper][PyTorch]
- GeoDeformer: "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 (HKUST). [Paper]
- Early-ViT: "Early Action Recognition with Action Prototypes", arXiv, 2023 (Amazon). [Paper]
- MCA: "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 (Northeastern University). [Paper][PyTorch]
- StructViT: "Learning Correlation Structures for Vision Transformers", CVPR, 2024 (POSTECH). [Paper]
- VideoMamba: "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
- Video-Mamba-Suite: "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
- Depth:
- Trear: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (Tianjing University). [Paper]
- Pose/Skeleton:
- ST-TR: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (Polytechnic University of Milan). [Paper]
- AcT: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
- STAR: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (UCLA). [Paper]
- GCsT: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (CAS). [Paper]
- GL-Transformer: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
- ?: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (University of Delaware). [Paper]
- FG-STFormer: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (Zhengzhou University). [Paper]
- STTFormer: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (Xidian University). [Paper][Code (in construction)]
- ProFormer: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
- ?: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (Harbin Institute of Technology). [Paper]
- HyperSA: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 (University of Mannheim, Germany). [Paper]
- STAR-Transformer: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (Keimyung University, Korea). [Paper]
- STMT: "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 (CMU). [Paper][Code (in construction)]
- SkeletonMAE: "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
- MAMP: "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 (USTC). [Paper][PyTorch]
- LAC: "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 (INRIA). [Paper][Website]
- SkeleTR: "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 (Amazon). [Paper]
- PCM<sup>3</sup>: "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 (Peking). [Paper][Website]
- PoseAwareVT: "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 (Amazon). [Paper][PyTorch]
- HandFormer: "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 (NUS). [Paper][Code (in construction)][Website]
- SkateFormer: "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 (KAIST). [Paper][Code (in construction)][Website]
- Multi-modal:
- MBT: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (Google). [Paper]
- MM-ViT: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (OPPO). [Paper]
- MMT-NCRC: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (UCF). [Paper][Code (in construction)]
- M&M: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (Google). [Paper]
- VT-CE: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (A*STAR). [Paper]
- Hi-TRS: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (Rutgers). [Paper][PyTorch]
- MVFT: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (Alibaba). [Paper]
- MOV: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (Google). [Paper]
- 3Mformer: "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 (ANU). [Paper]
- UMT: "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 (Tsinghua). [Paper]
- ?: "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 (KU Leuven). [Paper]
- MotionBERT: "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 (Peking University). [Paper][PyTorch][Website]
- TIM: "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 (University of Bristol + Oxford). [Paper][PyTorch][Website]
- Group Activity:
- GroupFormer: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (Sensetime). [Paper]
- ?: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (Hitachi). [Paper]
- GAFL: "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 (Toyota Technological Institute, Japan). [Paper]
Action Detection/Localization
- OadTR: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
- RTD-Net: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
- FS-TAL: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (University of Surrey, UK). [Paper][PyTorch]
- LSTR: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (Amazon). [Paper][PyTorch][Website]
- ATAG: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (Alibaba). [Paper]
- TAPG-Transformer: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (Harbin Institute of Technology). [Paper]
- TadTR: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (Alibaba). [Paper][Code (in construction)]
- Vidpress-Soccer: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (Baidu). [Paper][GitHub]
- MS-TCT: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (INRIA). [Paper][PyTorch]
- UGPT: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
- TubeR: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (Amazon). [Paper]
- DDM-Net: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
- ?: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (ByteDance). [Paper][PyTorch]
- ?: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (Renmin University of China). [Paper]
- EAMAT: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (Beijing Institute of Technology). [Paper][Code (in construction)]
- STPT: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (Monash University, Australia). [Paper]
- TeSTra: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (UT Austin). [Paper][PyTorch]
- TALLFormer: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (UNC). [Paper][PyTorch]
- ?: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
- ActionFormer: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (UW-Madison). [Paper][PyTorch]
- ActionFormer: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper][Pytorch]
- CoOadTR: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (Aarhus University, Denmark). [Paper][PyTorch]
- Temporal-Perceiver: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (Nanjing University). [Paper]
- LocATe: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (Stanford). [Paper]
- HTNet: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (Korea University). [Paper]
- AdaPerFormer: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (Tianjin University). [Paper]
- CWC-Trans: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (Meituan). [Paper]
- HIT: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (NTHU). [Paper][PyTorch]
- LART: "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 (Meta). [Paper][Website]
- TranS4mer: "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 (Comcast). [Paper]
- TTM: "Token Turing Machines", CVPR, 2023 (Google). [Paper][JAX]
- ?: "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 (NAVER). [Paper]
- Self-DETR: "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 (Sungkyunkwan University, Korea). [Paper]
- UnLoc: "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 (Google). [Paper][JAX]
- EVAD: "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
- MS-DETR: "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 (NTU, Singapore). [Paper][PyTorch]
- STAR: "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 (Google). [Paper]
- DiffTAD: "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 (University of Surrey, UK). [Paper][PyTorch (in construction)]
- MNA-ZBD: "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 (Renmin University of China). [Paper]
- PAT: "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 (University of Surrey, UK). [Paper]
- ViT-TAD: "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 (Nanjing University (NJU)). [Paper]
- Cafe: "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 (POSTECH). [Paper][PyTorch][Website]
- ?: "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 (Queen Mary, UK). [Paper]
- SMAST: "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 (University of Virginia). [Paper]
- OV-STAD: "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 (Nanjing University). [Paper]
Action Prediction/Anticipation
- AVT: "Anticipative Video Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
- TTPP: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 (CAS). [Paper]
- HORST: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (NVIDIA). [Paper][PyTorch]
- ?: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (A*STAR). [Paper]
- FUTR: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (POSTECH). [Paper]
- VPTR: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (Polytechnique Montreal, Canada). [Paper][PyTorch]
- Earthformer: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 (Amazon). [Paper]
- InAViT: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 (A*STAR). [Paper]
- VPTR: "Video Prediction by Efficient Transformers", IVC, 2022 (Polytechnique Montreal, Canada). [Paper][Pytorch]
- AFFT: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
- GliTr: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 (McGill University, Canada). [Paper]
- RAFTformer: "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 (Honda). [Paper]
- AdamsFormer: "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 (Honda). [Paper]
- TemPr: "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 (University of Bristol). [Paper][PyTorch][Website]
- MAT: "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 (Nanjing University). [Paper][PyTorch]
- SwinLSTM: "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 (Hainan University). [Paper][PyTorch]
- MVP: "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 (Boston). [Paper]
- DiffAnt: "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 (Karlsruhe Institute of Technology (KIT), Germany). [Paper]
- LALM: "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 (ETHZ). [Paper]
- ?: "Learning from One Continuous Video Stream", arXiv, 2023 (DeepMind). [Paper]
- ObjectPrompt: "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 (Honda). [Paper][Code (in construction)]
Video Object Segmentation
- GC: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (Tencent). [Paper]
- SSTVOS: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (Modiface). [Paper][Code (in construction)]
- JOINT: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
- AOT: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (Zhejiang University). [Paper][PyTorch (yoxu515)][Code (in construction)]
- TransVOS: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (Zhejiang University). [Paper]
- SITVOS: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (JD). [Paper]
- HODOR: "Differentiable Soft-Masked Attention", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper]
- BATMAN: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (Microsoft). [Paper]
- DeAOT: "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch]
- AOT: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- MED-VT: "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 (York University). [Paper][Website]
- ?: "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 (Shanghai Jiao Tong University (SJTU)). [Paper]
- Isomer: "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 (Dalian University of Technology). [Paper][PyTorch]
- SimVOS: "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 (CUHK). [Paper]
- MITS: "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 (Zhejiang University). [Paper][PyTorch]
- VIPMT: "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 (MBZUAI). [Paper][Code (in construction)]
- MOSE: "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 (NTU, Singapore). [Paper][GitHub][Website]
- LVOS: "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 (Fudan). [Paper][GitHub][Website]
- JointFormer: "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 (Nanjing University). [Paper]
- PanoVOS: "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 (Fudan). [Paper][Code (in construction)][Website]
- Cutie: "Putting the Object Back into Video Object Segmentation", arXiv, 2023 (UIUC). [Paper][PyTorch][Website]
- M<sup>3</sup>T: "M<sup>3</sup>T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 (UBC). [Paper]
- ?: "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 (Oxford). [Paper]
- DATTT: "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 (University of Macau). [Paper][PyTorch][Website]
- LLE-VOS: "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 (USTC). [Paper]
- Point-VOS: "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 (RWTH Aachen University, Germany). [Paper][Website]
- MAVOS: "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 (MBZUAI). [Paper][Code (in construction)]
- STMA: "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 (Harbin Institute of Technology). [Paper]
- Flow-SAM: "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 (Oxford). [Paper][Website]
- LVOSv2: "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 (Fudan). [Paper][GitHub][Website]
Video Instance Segmentation
- VisTR: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (Meituan). [Paper][PyTorch]
- IFC: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (Yonsei University). [Paper][PyTorch]
- Deformable-VisTR: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (University at Buffalo). [Paper][Code (in construction)]
- TeViT: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
- GMP-VIS: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (Shandong University). [Paper]
- VMT: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (ETHZ). [Paper][GitHub][Website]
- SeqFormer: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
- MS-STS: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
- MinVIS: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch]
- VITA: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 (Yonsei University). [Paper][PyTorch]
- IFR: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (Microsoft). [Paper]
- DeVIS: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (TUM). [Paper][PyTorch]
- InstanceFormer: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (Ludwig Maximilian University of Munich). [Paper][Code (in construction)]
- MaskFreeVIS: "Mask-Free Video Instance Segmentation", CVPR, 2023 (ETHZ). [Paper][PyTorch]
- MDQE: "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 (Hong Kong Polytechnic University). [Paper][PyTorch]
- GenVIS: "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 (Yonsei). [Paper][PyTorch]
- CTVIS: "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 (Zhejiang University). [Paper][PyTorch]
- TCOVIS: "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 (Tsinghua). [Paper][Code (in construction)]
- DVIS: "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 (Wuhan University). [Paper][PyTorch]
- TMT-VIS: "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 (HKU). [Paper][Code (in construction)]
- BoxVIS: "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 (Hong Kong Polytechnic University). [Paper][Code (in construction)]
- OW-VISFormer: "Video Instance Segmentation in an Open-World", arXiv, 2023 (MBZUAI). [Paper][Code (in construction)]
- GRAtt-VIS: "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 (LMU Munich). [Paper][Code (in construction)]
- RefineVIS: "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 (Microsoft). [Paper]
- VideoCutLER: "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 (Meta). [Paper][PyTorch]
- NOVIS: "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 (TUM). [Paper]
- VISAGE: "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 (Yonsei). [Paper][Code (in construction)]
- OW-VISCap: "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 (UIUC). [Paper][Website]
- PointVIS: "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 (NVIDIA). [Paper]
Other Video Tasks
- Action Segmentation
- ASFormer: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (Peking University). [Paper][PyTorch]
- Bridge-Prompt: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
- SC-Transformer++: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (CAS). [Paper][Code (in construction)]
- UVAST: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (Bosch). [Paper][PyTorch]
- ?: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (TUM). [Paper]
- CETNet: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (Shijiazhuang Tiedao University). [Paper]
- EUT: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (CAS). [Paper]
- SC-Transformer: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (CAS). [Paper]
- DXFormer: "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 (Northeastern University). [Paper][Website (in construction)]
- LTContext: "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 (University of Bonn). [Paper][PyTorch]
- DiffAct: "Diffusion Action Segmentation", ICCV, 2023 (The University of Sydney). [Paper][PyTorch]
- TST: "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 (Shanghai Tech). [Paper]
- Video X Segmentation:
- STT: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (Shanghai Jiao Tong). [Paper]
- CFFM: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (ETH Zurich). [Paper][PyTorch]
- TF-DL: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (Google). [Paper]
- Video-K-Net: "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 (Peking University). [Paper][PyTorch]
- MRCFA: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (ETH Zurich). [Paper][PyTorch]
- PolyphonicFormer: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (Wuhan University). [Paper][Code (in construction)]
- ?: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
- CAROQ: "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 (UIUC). [Paper][PyTorch][Website]
- TarViS: "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 (RWTH Aachen University, Germany). [Paper][PyTorch]
- MEGA: "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 (Amazon). [Paper]
- DEVA: "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 (UIUC). [Paper][PyTorch][Website]
- Tube-Link: "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch]
- THE-Mask: "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 (ETHZ). [Paper][Code (in construction)]
- MPVSS: "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 (Monash University, Australia). [Paper][Code (in construction)]
- Video-kMaX: "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 (Google). [Paper]
- SAM-PT: "Segment Anything Meets Point Tracking", arXiv, 2023 (ETHZ). [Paper][Code (in construction)]
- TTT-MAE: "Test-Time Training on Video Streams", arXiv, 2023 (Berkeley). [Paper][Website]
- UniVS: "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 (OPPO). [Paper][PyTorch][Website]
- DVIS++: "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 (Wuhan University). [Paper][PyTorch]
- SAM-PD: "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 (Zhejiang). [Paper][PyTorch (in construction)]
- OneVOS: "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 (Fudan). [Paper]
- Video Object Detection:
- TransVOD: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (Shanghai Jiao Tong + SenseTime). [Paper][Code (in construction)]
- MODETR: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (Valeo, Egypt). [Paper]
- ST-MTL: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (Valeo, Egypt). [Paper]
- ST-DETR: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (Valeo, Egypt). [Paper]
- PTSEFormer: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
- TransVOD: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (Shanghai Jiao Tong + SenseTime). [Paper]
- ?: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (Zenseact, Sweden). [Paper]
- ClipVID: "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 (University of Adelaide, Australia). [Paper][Code (in construction)]
- OCL: "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 (Amazon). [Paper]
- CETR: "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 (Korea University). [Paper][Code (in construction)][Website]
- Dense Video Tasks (Detection + Segmentation):
- TDViT: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (Queen's University Belfast, UK). [Paper][Code (in construction)]
- FAQ: "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 (UCF). [Paper][PyTorch]
- Video-OWL-ViT: "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 (DeepMind). [Paper]
- Video Retrieval:
- SVRTN: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (Alibaba). [Paper]
- Video Hashing:
- Video-Language:
- ActionCLIP: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
- ?: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (Shanghai Jiao Tong + Oxford). [Paper][PyTorch][Website]
- X-CLIP: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- EVL: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (CUHK). [Paper][PyTorch (in construction)]
- STALE: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (University of Surrey, UK). [Paper][Code (in construction)]
- ?: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 (Beijing Laboratory of Intelligent Information Technology). [Paper]
- VLG: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 (Nanjing University). [Paper]
- InternVideo: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
- PromptonomyViT: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 (Tel Aviv + IBM). [Paper]
- MUPPET: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 (Meta). [Paper][Code (in construction)]
- MovieCLIP: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (USC). [Paper][Website]
- TranZAD: "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 (UC Riverside). [Paper]
- Text4Vis: "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 (Baidu). [Paper][PyTorch]
- AIM: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 (Amazon). [Paper][PyTorch][Website]
- ViFi-CLIP: "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
- LaViLa: "Learning Video Representations from Large Language Models", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
- TVP: "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 (Intel). [Paper]
- Vita-CLIP: "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
- STAN: "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 (Peking University). [Paper][PyTorch]
- CBP-VLP: "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 (Shanghai Jiao Tong). [Paper]
- BIKE: "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
- HierVL: "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 (Meta). [Paper][PyTorch]
- ?: "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 (University of Amsterdam). [Paper][PyTorch][Website]
- Open-VCLIP: "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 (Fudan). [Paper][PyTorch]
- ILA: "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 (Fudan). [Paper][PyTorch]
- OV2Seg: "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 (University of Amsterdam). [Paper][PyTorch]
- DiST: "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- GAP: "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- MAXI: "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 (Graz University of Technology, Austria). [Paper][PyTorch]
- ?: "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 (Unitary, UK). [Paper]
- MAP: "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 (Tencent). [Paper]
- OTI: "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 (CAS). [Paper][Code (in construction)]
- Symbol-LLM: "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 (Shanghai Jiao Tong University (SJTU)). [Paper][Code (in construction)][Website]
- OAP-AOP: "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 (NUS). [Paper][PyTorch (in construction)][Website]
- CLIP-FSAR: "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 (Alibaba). [Paper][PyTorch]
- ?: "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- VicTR: "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 (Google). [Paper]
- OpenVIS: "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 (Fudan). [Paper]
- ALGO: "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 (Oklahoma State University). [Paper]
- ?: "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 (Google). [Paper]
- MSQNet: "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 (University of Surrey, England). [Paper][Code (in construction)]
- AVION: "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 (UT Austin). [Paper][PyTorch]
- Open-VCLIP: "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 (Fudan). [Paper][PyTorch]
- Videoprompter: "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 (UCF). [Paper]
- MM-VID: "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 (Microsoft). [Paper][Website]
- Chat-UniVi: "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 (Peking). [Paper]
- Side4Video: "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
- ALT: "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 (Huawei). [Paper]
- MM-Narrator: "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 (Microsoft). [Paper][Website]
- Spacewalk-18: "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 (Brown). [Paper][Website]
- OST: "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 (Hunan University (HNU)). [Paper][Code (in construction)][Website]
- AP-CLIP: "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 (Xi'an Jiaotong). [Paper]
- EZ-CLIP: "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 (Østfold University College, Norway). [Paper][PyTorch (in construction)]
- M<sup>2</sup>-CLIP: "M<sup>2</sup>-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 (Zhejiang). [Paper]
- FROSTER: "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 (Baidu). [Paper][PyTorch][Website]
- LaIAR: "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 (Xidian University). [Paper][Code (in construction)]
- BriVIS: "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 (Peking). [Paper][PyTorch (in construction)]
- ActionHub: "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 (Sun Yat-sen University). [Paper]
- ZERO: "Zero Shot Open-ended Video Inference", arXiv, 2024 (A*STAR). [Paper]
- SATA: "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 (Sun Yat-sen University). [Paper][Code (in construction)]
- CLIP-VIS: "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 (Shanghai AI Lab). [Paper][PyTorch]
- X-supervised Learning:
- LSTCL: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (Facebook). [Paper]
- SVT: "Self-supervised Video Transformer", CVPR, 2022 (Stony Brook). [Paper][PyTorch][Website]
- BEVT: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- SCVRL: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (Amazon). [Paper]
- VIMPAC: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (UNC). [Paper][PyTorch]
- ?: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (CUHK). [Paper]
- VideoMAE: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 (Tencent). [Paper][Pytorch]
- MAE-ST: "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 (Meta). [Paper][PyTorch]
- ?: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (Georgia Tech). [Paper]
- MaskViT: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 (Stanford). [Paper][Code (in construction)][Website]
- WeakSVR: "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 (ShanghaiTech). [Paper][PyTorch]
- VideoMAE-V2: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- SVFormer: "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 (Fudan University). [Paper][PyTorch]
- OmniMAE: "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 (Meta). [Paper][PyTorch]
- MVD: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 (Fudan Univeristy). [Paper][PyTorch]
- MME: "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 (South China University of Technology). [Paper][PyTorch]
- MGMAE: "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 (Shanghai AI Lab). [Paper]
- MGM: "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 (Amazon). [Paper]
- TimeT: "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 (UvA). [Paper][PyTorch]
- LSS: "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 (Stony Brook). [Paper]
- VITO: "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 (DeepMind). [Paper]
- SiamMAE: "Siamese Masked Autoencoders", NeurIPS, 2023 (Stanford). [Paper][Website]
- ViC-MAE: "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 (Rice University). [Paper]
- LSTA: "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 (Hangzhou Dianzi University). [Paper]
- DoRA: "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 (INRIA). [Paper]
- AMD: "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 (Nanjing University). [Paper]
- SSL-UVOS: "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 (CUHK). [Paper]
- NMS: "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 (Adobe). [Paper][Website]
- VideoMAC: "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 (Nanjing University of Science and Technology). [Paper]
- GPM: "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 (HKUST). [Paper]
- MV2MAE: "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 (Amazon). [Paper][PyTorch]
- V-JEPA: "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 (Meta). [Paper][PyTorch][Website]
- Transfer Learning/Adaptation:
- X-shot:
- ResT: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (Microsoft). [Paper]
- ViSET: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (University of South Florida). [Paper]
- REST: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (Samsung). [Paper]
- MoLo: "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 (Alibaba). [Paper][Code (in construction)]
- MA-CLIP: "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 (Zhejiang). [Paper]
- SA-CT: "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 (Fudan). [Paper]
- CapFSAR: "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 (Alibaba). [Paper]
- Multi-Task:
- EgoPack: "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 (Politecnico di Torino, Italy). [Paper][PyTorch (in construction)][Website]
- Anomaly Detection:
- CT-D2GAN: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (NEC). [Paper]
- ADTR: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (Shanghai Jiao Tong University). [Paper]
- SSMCTB: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (UCF). [Paper][Code (in construction)]
- ?: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (Korea University). [Paper]
- CLIP-TSA: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 (University of Arkansas). [Paper]
- ?: "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 (Konica Minolta, Japan). [Paper]
- TPWNG: "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 (Xidian University). [Paper]
- Relation Detection:
- VidVRD: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (Zhejiang University). [Paper][PyTorch]
- VRDFormer: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (Renmin University of China). [Paper][Code (in construction)]
- VidSGG-BIG: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
- RePro: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 (Zhejiang University). [Paper][PyTorch (in construction)]
- Saliency Prediction:
- STSANet: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (Shanghai University). [Paper]
- UFO: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (South China University of Technology). [Paper][PyTorch]
- DMT: "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 (Northwestern Polytechnical University). [Paper][PyTorch]
- CASP-Net: "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 (Northwestern Polytechnical University). [Paper]
- Video Inpainting Detection:
- FAST: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (Tsinghua University). [Paper]
- Driver Activity:
- TransDARC: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
- ?: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (Jericho High School, NY). [Paper]
- ViT-DD: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (Purdue). [Paper][PyTorch (in construction)]
- Video Alignment:
- DGWT: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (University of New South Wales, Australia). [Paper]
- Sport-related:
- Skating-Mixer: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (Southern University of Science and Technology). [Paper]
- Action Counting:
- Action Quality Assessment:
- Human Interaction:
- IGFormer: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (The University of Melbourne). [Paper]
- Cross-Domain:
- UDAVT: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (University of Trento). [Paper][Code (in construction)]
- AutoLabel: "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 (University of Trento). [Paper][PyTorch]
- DALL-V: "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 (University of Trento). [Paper][PyTorch]
- Multi-Camera Editing:
- TC-Transformer: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (CUHK). [Paper]
- Instructional/Procedural Video:
- ProcedureVRL: "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 (Meta). [Paper]
- Paprika: "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 (Salesforce). [Paper][PyTorch]
- StepFormer: "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 (Samsung). [Paper]
- E3P: "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 (Sun Yat-sen University). [Paper]
- VLaMP: "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 (Meta). [Paper]
- VINA: "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 (Meta). [Paper][Website]
- PREGO: "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 (Sapienza University of Rome, Italy). [Paper][Code (in construction)]
- Continual Learning:
- PIVOT: "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 (KAUST). [Paper]
- 3D:
- Audio-Video:
- AVGN: "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 (KAIST). [Paper]
- Event Camera:
- Long Video:
- EgoSchema: "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 (Berkeley). [Paper][PyTorch][Website]
- KTS: "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 (Meta). [Paper]
- TCR: "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 (Google). [Paper]
- MC-ViT: "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 (DeepMind). [Paper]
- VideoAgent: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 (Stanford). [Paper]
- Video Story:
- Analysis:
References
- Online Resources:
- Papers with Code
- Transformer tutorial (Lucas Beyer)
- CS25: Transformers United (Course @ Stanford)
- The Annotated Transformer (Blog)
- 3D Vision with Transformers (GitHub)
- Networks Beyond Attention (GitHub)
- Practical Introduction to Transformers (GitHub)
- Awesome Transformer Architecture Search (GitHub)
- Transformer-in-Vision (GitHub)
- Awesome Visual-Transformer (GitHub)
- Awesome Transformer for Vision Resources List (GitHub)
- Transformer-in-Computer-Vision (GitHub)
- Transformer Tutorial in ICASSP 2022)