Home

Awesome

Implementing Stand-Alone Self-Attention in Vision Models using Pytorch (13 Jun 2019)

Method

  1. Replacing Spatial Convolutions<br> - A 2 × 2 average pooling with stride 2 operation follows the attention layer whenever spatial downsampling is required. - This work applies the transform on the ResNet family of architectures. The proposed transform swaps the 3 × 3 spatial convolution with a self-attention layer as defined in Equation 3.
  2. Replacing the Convolutional Stem<br> - The initial layers of a CNN, sometimes referred to as the stem, play a critical role in learning local features such as edges, which later layers use to identify global objects. - The stem performs self-attention within each 4 × 4 spatial block of the original image, followed by batch normalization and a 4 × 4 max pool operation.

Experiments

Setup

DatasetsModelAccuracyParameters (My Model, Paper Model)
CIFAR-10ResNet 2690.94%8.30M, -
CIFAR-10Naive ResNet 2694.29%8.74M
CIFAR-10ResNet 26 + stem90.22%8.30M, -
CIFAR-10ResNet 38 (WORK IN PROCESS)89.46%12.1M, -
CIFAR-10Naive ResNet 3894.93%15.0M
CIFAR-10ResNet 50 (WORK IN PROCESS)16.0M, -
IMAGENETResNet 26 (WORK IN PROCESS)10.3M, 10.3M
IMAGENETResNet 38 (WORK IN PROCESS)14.1M, 14.1M
IMAGENETResNet 50 (WORK IN PROCESS)18.0M, 18.0M

Usage

Requirements

Todo

Reference