Home

Awesome

Awesome-Attention-Mechanism-in-cv Awesome

Table of Contents

Introduction

This is a list of awesome attention mechanisms used in computer vision, as well as a collection of plug and play modules. Due to limited ability and energy, many modules may not be included. If you have any suggestions or improvements, welcome to submit an issue or PR.

Attention Mechanism

PaperPublishLinkBlog
Squeeze and Excitation NetworkCVPR18SENetzhihu
Global Second-order Pooling Convolutional NetworksCVPR19GSoPNet
Neural Architecture Search for Lightweight Non-Local NetworksCVPR20AutoNL
Selective Kernel NetworkCVPR19SKNetzhihu
Convolutional Block Attention ModuleECCV18CBAMzhihu
BottleNeck Attention ModuleBMVC18BAMzhihu
Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional NetworksMICCAI18scSEzhihu
Non-local Neural NetworksCVPR19Non-Local(NL)zhihu
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and BeyondICCVW19GCNetzhihu
CCNet: Criss-Cross Attention for Semantic SegmentationICCV19CCNet
SA-Net:shuffle attention for deep convolutional neural networksICASSP 21SANetzhihu
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural NetworksCVPR20ECANet
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional NetworksCoRR19SGENet
FcaNet: Frequency Channel Attention NetworksICCV21FcaNet
$A^2\text{-}Nets$: Double Attention NetworksNeurIPS18DANet
Asymmetric Non-local Neural Networks for Semantic SegmentationICCV19APNB
Efficient Attention: Attention with Linear ComplexitiesCoRR18EfficientAttention
Image Restoration via Residual Non-local Attention NetworksICLR19RNAN
Exploring Self-attention for Image RecognitionCVPR20SAN
An Empirical Study of Spatial Attention Mechanisms in Deep NetworksICCV19None
Object-Contextual Representations for Semantic SegmentationECCV20OCRNet
IAUnet: Global text-Aware Feature Learning for Person Re-IdentificationTTNNLS20IAUNet
ResNeSt: Split-Attention NetworksCoRR20ResNeSt
Gather-Excite: Exploiting Feature Context in Convolutional Neural NetworksNeurIPS18GENet
Improving Convolutional Networks with Self-calibrated ConvolutionsCVPR20SCNet
Rotate to Attend: Convolutional Triplet Attention ModuleWACV21TripletAttention
Dual Attention Network for Scene SegmentationCVPR19DANet
Relation-Aware Global Attention for Person Re-identificationCVPR20RGANet
Attentional Feature FusionWACV21AFF
An Attentive Survey of Attention ModelsCoRR19None
Stand-Alone Self-Attention in Vision ModelsNeurIPS19FullAttention
BiSeNet: Bilateral Segmentation Network for Real-time Semantic SegmentationECCV18BiSeNetzhihu
DCANet: Learning Connected Attentions for Convolutional Neural NetworksCoRR20DCANet
An Empirical Study of Spatial Attention Mechanisms in Deep NetworksICCV19None
Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognitionCVPR17 OralRA-CNN
Guided Attention Network for Object Detection and Counting on DronesACM MM20GANet
Attention Augmented Convolutional NetworksICCV19AANet
GLOBAL SELF-ATTENTION NETWORKS FOR IMAGE RECOGNITIONICLR21GSA
Attention-Guided Hierarchical Structure Aggregation for Image MattingCVPR20HAttMatting
Weight Excitation: Built-in Attention Mechanisms in Convolutional Neural NetworksECCV20None
Expectation-Maximization Attention Networks for Semantic SegmentationICCV19 OralEMANet
Dense-and-implicit attention networkAAAI 20DIANet
Coordinate Attention for Efficient Mobile Network DesignCVPR21CoordAttention
Cross-channel Communication NetworksNeurlPS19C3Net
Gated Convolutional Networks with Hybrid Connectivity for Image ClassificationAAAI20HCGNet
Weighted Channel Dropout for Regularization of Deep Convolutional Neural NetworkAAAI19None
BA^2M: A Batch Aware Attention Module for Image ClassificationCVPR21None
EPSANet:An Efficient Pyramid Split Attention Block on Convolutional Neural NetworkCoRR21EPSANet
Stand-Alone Self-Attention in Vision ModelsNeurlPS19SASA
ResT: An Efficient Transformer for Visual RecognitionCoRR21ResT
Spanet: Spatial Pyramid Attention Network for Enhanced Image RecognitionICME20SPANet
Space-time Mixing Attention for Video TransformerCoRR21None
DMSANet: Dual Multi Scale Attention NetworkCoRR21None
CompConv: A Compact Convolution Module for Efficient Feature LearningCoRR21None
VOLO: Vision Outlooker for Visual RecognitionCoRR21VOLO
Interflow: Aggregating Multi-layer Featrue Mappings with Attention MechanismCoRR21None
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence LearningCoRR21None
Polarized Self-Attention: Towards High-quality Pixel-wise RegressionCoRR21PSA
CA-Net: Comprehensive Attention Convolutional Neural Networks for Explainable Medical Image SegmentationTMI21CA-Net
BAM: A Lightweight and Efficient Balanced Attention Mechanism for Single Image Super ResolutionCoRR21BAM
Attention as ActivationCoRR21ATAC
Region-based Non-local Operation for Video ClassificationCoRR21RNL
MSAF: Multimodal Split Attention FusionCoRR21MSAF
All-Attention LayerCoRR19None
Compact Global DescriptorCoRR20CGD
SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural NetworksICML21SimAM
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave ConvolutionICCV19OctConv
Contextual Transformer Networks for Visual RecognitionICCV21CoTNet
Residual Attention: A Simple but Effective Method for Multi-Label RecognitionICCV21CSRA
Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic SegmentationCVPR20SEAM
An Attention Module for Convolutional Neural NetworksICCV2021AW-Conv
Attentive NormalizationArxiv2020None
Person Re-identification via Attention PyramidTIP21APNet
Unifying Nonlocal Blocks for Neural NetworksICCV21SNL
Tiled Squeeze-and-Excite: Channel Attention With Local Spatial ContextICCVW21None
PP-NAS: Searching for Plug-and-Play Blocks on Convolutional Neural NetworkICCVW21PP-NAS
Distilling Knowledge via Knowledge ReviewCVPR21ReviewKD
Dynamic Region-Aware ConvolutionCVPR21None
Encoder Fusion Network With Co-Attention Embedding for Referring Image SegmentationCVPR21None
Introvert: Human Trajectory Prediction via Conditional 3D AttentionCVPR21None
SSAN: Separable Self-Attention Network for Video Representation LearningCVPR21None
Delving Deep into Many-to-many Attention for Few-shot Video Object SegmentationCVPR21DANet
A2 -FPN: Attention Aggregation based Feature Pyramid Network for Instance SegmentationCVPR21None
Image Super-Resolution with Non-Local Sparse AttentionCVPR21None
Keep your Eyes on the Lane: Real-time Attention-guided Lane DetectionCVPR21LaneATT
NAM: Normalization-based Attention ModuleCoRR21NAM
NAS-SCAM: Neural Architecture Search-Based Spatial and Channel Joint Attention Module for Nuclei Semantic Segmentation and ClassificationMICCAI20NAS-SCAM
NASABN: A Neural Architecture Search Framework for Attention-Based NetworksIJCNN20None
Att-DARTS: Differentiable Neural Architecture Search for AttentionIJCNN20Att-Darts
On the Integration of Self-Attention and ConvolutionCoRR21ACMix
BoxeR: Box-Attention for 2D and 3D TransformersCoRR21None
CoAtNet: Marrying Convolution and Attention for All Data SizesNeurlPS21coatnet
Pay Attention to MLPsNeurlPS21gmlp
IC-Conv: Inception Convolution With Efficient Dilation SearchCVPR21 OralIC-Conv
SRM : A Style-based Recalibration Module for Convolutional Neural NetworksICCV19SRM
SPANet: Spatial Pyramid Attention Network for Enhanced Image RecognitionICME20SPANet
Competitive Inner-Imaging Squeeze and Excitation for Residual NetworkCoRR18Competitive-SENet
ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural NetworksWACV20ULSAM
Augmenting Convolutional networks with attention-based aggregationCoRR21None
Context-aware Attentional Pooling (CAP) for Fine-grained Visual ClassificationAAAI21CAP
Instance Enhancement Batch Normalization: An Adaptive Regulator of Batch NoiseAAAI20IEBN
ASR: Attention-alike Structural Re-parameterizationCoRR23None

Dynamic Networks

TitlePublishGithub
Dynamic Neural Networks: A SurveyCoRR21None
CondConv: Conditionally Parameterized Convolutions for Efficient InferenceNeurlPS19CondConv
DyNet: Dynamic Convolution for Accelerating Convolutional Neural NetworksCoRR20None
Dynamic Convolution: Attention over Convolution KernelsCVPR20Dynamic-convolution-Pytorch
WeightNet: Revisiting the Design Space of Weight NetworkECCV20weightNet
Dynamic Filter NetworksNeurlPS20None
Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective executionAAAI17None
SkipNet: Learning Dynamic Routing in Convolutional NetworksECCV18SkipNet
Pay Less Attention with Lightweight and Dynamic ConvolutionsICLR19fairseq
Unified Dynamic Convolutional Network for Super-Resolution with Variational DegradationsCVPR20None
Dynamic Group Convolution for Accelerating Convolutional Neural NetworksECCV20dgc

Plug and Play Module

TitlePublishGithub
ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution BlocksICCV19ACNet
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsTPAMI18ASPP
MixConv: Mixed Depthwise Convolutional KernelsBMCV19MixedConv
Pyramid Scene Parsing NetworkCVPR17PSP
Receptive Field Block Net for Accurate and Fast Object DetectionECCV18RFB
Strip Pooling: Rethinking Spatial Pooling for Scene ParsingCVPR20SPNet
SSH: Single Stage Headless Face DetectorICCV17SSH
GhostNet: More Features from Cheap OperationsCVPR20GhostNet
SlimConv: Reducing Channel Redundancy in Convolutional Neural Networks by Weights FlippingTIP21SlimConv
EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksICML19EfficientNet
CondConv: Conditionally Parameterized Convolutions for Efficient InferenceNeurlPS19CondConv
PP-NAS: Searching for Plug-and-Play Blocks on Convolutional Neural NetworkICCVW21PPNAS
Dynamic Convolution: Attention over Convolution KernelsCVPR20DynamicConv
PSConv: Squeezing Feature Pyramid into One Compact Poly-Scale Convolutional LayerECCV20PSConv
DCANet: Dense Context-Aware Network for Semantic SegmentationECCV20DCANet
Enhancing feature fusion for human pose estimationMVA20SEB
Object Contextual Representation for sematic segmentationECCV2020HRNet-OCR
DO-Conv: Depthwise Over-parameterized Convolutional LayerCoRR20DO-Conv
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual RecognitionCoRR20PyConv
ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural NetworksWACV20ULSAM
Dynamic Group Convolution for Accelerating Convolutional Neural NetworksECCV20DGC

Vision Transformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021, ViT

[paper] [Github]

TitlePublishGithub
Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsICCV21SwinT
CPVT: Conditional Positional Encodings for Vision TransformerCoRR21CPVT
GLiT: Neural Architecture Search for Global and Local Image TransformerCoRR21GLiT
ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesCoRR21ConViT
CeiT: Incorporating Convolution Designs into Visual TransformersCoRR21CeiT
BoTNet: Bottleneck Transformers for Visual RecognitionCVPR21BoTNet
CvT: Introducing Convolutions to Vision TransformersICCV21CvT
TransCNN: Transformer in Convolutional Neural NetworksCoRR21TransCNN
ResT: An Efficient Transformer for Visual RecognitionCoRR21ResT
CoaT: Co-Scale Conv-Attentional Image TransformersCoRR21CoaT
ConTNet: Why not use convolution and transformer at the same time?CoRR21ConTNet
DynamicViT: Efficient Vision Transformers with Dynamic Token SparsificationNeurlPS21DynamicViT
DVT: Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image RecognitionNeurlPS21DVT
CoAtNet: Marrying Convolution and Attention for All Data SizesCoRR21CoAtNet
Early Convolutions Help Transformers See BetterCoRR21None
Compact Transformers: Escaping the Big Data Paradigm with Compact TransformersCoRR21CCT
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision TransformerCoRR21MobileViT
LeViT: a Vision Transformer in ConvNet's Clothing for Faster InferenceCoRR21LeViT
Shuffle Transformer: Rethinking Spatial Shuffle for Vision TransformerCoRR21ShuffleTransformer
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive BiasCoRR21ViTAE
LocalViT: Bringing Locality to Vision TransformersCoRR21LocalViT
DeiT: Training data-efficient image transformers & distillation through attentionICML21DeiT
CaiT: Going deeper with Image TransformersICCV21CaiT
Efficient Training of Visual Transformers with Small-Size DatasetsNeurlPS21None
Vision Transformer with Deformable AttentionCoRR22DAT
MaxViT: Multi-Axis Vision TransformerCoRR22None
Conv2Former: A Simple Transformer-Style ConvNet for Visual RecognitionCoRR22Conv2Former
Rethinking Mobile Block for Efficient Neural ModelsCoRR23EMO
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation LearningECCV22Wave-ViT
Dual Vision TransformerCoRR23Dual-ViT
[CoTNet: Contextual transformer networks for visual recognition](Contextual transformer networks for visual recognition)TPAMI22CoTNet
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoencodersCoRR23ConvNeXt-V2
A Close Look at Spatial Modeling: From Attention to ConvolutionCoRR22FCViT
Scalable Diffusion Models with TransformersCVPR22DiT
Dynamic Grained Encoder for Vision TransformersNeurlPS21vtpack
Segment AnythingCoRR23SAM
Improved robustness of vision transformers via prelayernorm in patch embeddingPR23None
TitlePublishGithubMain Idea
Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsICCV21SwinT
CPVT: Conditional Positional Encodings for Vision TransformerCoRR21CPVT
GLiT: Neural Architecture Search for Global and Local Image TransformerCoRR21GLiTNAS
ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesCoRR21ConViTGPSA
CeiT: Incorporating Convolution Designs into Visual TransformersCoRR21CeiTLCA,LeFF
BoTNet: Bottleneck Transformers for Visual RecognitionCVPR21BoTNetNonBlock-like
CvT: Introducing Convolutions to Vision TransformersICCV21CvTprojection
TransCNN: Transformer in Convolutional Neural NetworksCoRR21TransCNN
ResT: An Efficient Transformer for Visual RecognitionCoRR21ResT
CoaT: Co-Scale Conv-Attentional Image TransformersCoRR21CoaT
ConTNet: Why not use convolution and transformer at the same time?CoRR21ConTNet
DynamicViT: Efficient Vision Transformers with Dynamic Token SparsificationNIPS21DynamicViT
DVT: Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image RecognitionNIPS21DVT
CoAtNet: Marrying Convolution and Attention for All Data SizesCoRR21CoAtNet
Early Convolutions Help Transformers See BetterCoRR21None
Compact Transformers: Escaping the Big Data Paradigm with Compact TransformersCoRR21CCT
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision TransformerCoRR21MobileViT
LeViT: a Vision Transformer in ConvNet's Clothing for Faster InferenceCoRR21LeViT
Shuffle Transformer: Rethinking Spatial Shuffle for Vision TransformerCoRR21ShuffleTransformer
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive BiasCoRR21ViTAE
LocalViT: Bringing Locality to Vision TransformersCoRR21LocalViT
DeiT: Training data-efficient image transformers & distillation through attentionICML21DeiT
CaiT: Going deeper with Image TransformersICCV21CaiT
Efficient Training of Visual Transformers with Small-Size DatasetsNIPS21None
Vision Transformer with Deformable AttentionCoRR22DATDeformConv+SA
MaxViT: Multi-Axis Vision TransformerCoRR22Nonedilated attention
Conv2Former: A Simple Transformer-Style ConvNet for Visual RecognitionCoRR22Conv2Former
Demystify Transformers & Convolutions in Modern Image Deep NetworksCoRR22STM-Evaluationdai jifeng!

Contributing

If you know of any awesome attention mechanism in computer vision resources, please add them in the PRs or issues.

Additional article papers and corresponding code links are welcome in the issue.

Thanks to @dedekinds for pointing out the problem in the DIANet description.