Home

Awesome

Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

Official PyTorch implementation and pretrained models for ICCV 2023 SimPool. [arXiv], [paper], [poster], [demo]

<div align="center"> <img width="100%" alt="SimPool illustration" src=".github/overview.png"> </div>

Overview

Motivation

<div align="center"> <img width="100%" alt="SimPool illustration" src=".github/cnn_vit.png"> </div> <br> <!---In this work, we develop a *generic pooling framework* and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive *SimPool*, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether *supervised* or *self-supervised*, this improves performance on pre-training and downstream tasks and provides attention maps *delineating object boundaries* in all cases. One could thus call SimPool *universal*. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. --->

Approach

We introduce SimPool, a simple attention-based pooling method at the end of network, obtaining clean attention maps under supervision or self-supervision, for both convolutional and transformer encoders.

<div align="center"> <img width="100%" alt="SimPool attention maps" src=".github/attmaps.png"> </div> Note that when using SimPool with Vision Transformers, the [CLS] token is completely discarded. <br> <div align="center"> <img width="100%" alt="SimPool attention maps" src=".github/cnn_attmaps.png"> </div> <br>

:loudspeaker: NOTE: Considering integrating SimPool into your workflow?
Use SimPool when you need high quality attention maps, delineating object boundaries. Use SimPool as an alternative pooling mechanism. It's super easy to try!

SimPool Attention Map Visualizer 🌌

Check out the SimPool interactive [demo] for attention map visualization:

Demo of SimPool Attention Map Visualizer

Integration

SimPool is by definition plug and play.

To integrate SimPool into any architecture (convolutional network or transformer) or any setting (supervised, self-supervised, etc.), follow the steps below:

1. Initialization (__init__ method):

from sp import SimPool

# this part goes into your model's __init___()
self.simpool = SimPool(dim, gamma=None) # dim is depth (channels)

:exclamation: NOTE: Remember to adapt the value of gamma according to the architecture, e.g. gamma=2.0 for convolutional networks. Here we consider the naive case not using gamma.

2. Model Forward Pass (forward method):

Assuming input tensor X has dimensions:

B = batch size, d = depth (channels), H = height of the feature map, W = width of the feature map, N = patch tokens

# this part goes into your model's forward()
cls = self.simpool(x) # (B, d)

:exclamation: NOTE: Remember to integrate the above code snippets into the appropriate locations in your model definition.

Experiments

We provide experiments on ImageNet in both supervised and self-supervised learning. Have a look on the respective folders for pre-trained models, reproduction recipes, etc.

Preliminaries

We use two different Anaconda environments, both utilizing PyTorch. For both, you will first need to download ImageNet.

Self-supervised learning environment

Create this environment for self_supervised learning experiments.

conda create -n simpoolself python=3.8 -y
conda activate simpoolself
pip3 install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip3 install timm==0.3.2 tensorboardX six

Supervised learning environment

Create this environment for supervised learning experiments.

conda create -n simpoolsuper python=3.9 -y
conda activate simpoolsuper
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip3 install pyyaml

Acknowledgement

This repository is built using Attmask, DINO, ConvNeXt, DETR, timm and Metrix repositories.

NTUA thanks NVIDIA for the support with the donation of GPU hardware. Bill thanks IARAI for the hardware support.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citation

If you find this repository useful, please consider giving a star 🌟 and citation:

@InProceedings{psomas2023simpool,
    author    = {Psomas, Bill and Kakogeorgiou, Ioannis and Karantzalos, Konstantinos and Avrithis, Yannis},
    title     = {Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {5350-5360}
}