MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

最新推荐文章于 2023-03-31 11:48:45 发布

Phoenixtree_DongZhao

最新推荐文章于 2023-03-31 11:48:45 发布

阅读量2.9k

点赞数 4

分类专栏： MyDLNote-Attention deep learning 文章标签：深度学习

本文链接：https://blog.csdn.net/u014546828/article/details/105976904

版权

deep learning 同时被 2 个专栏收录

113 篇文章 8 订阅

订阅专栏

MyDLNote-Attention

40 篇文章 7 订阅

订阅专栏

MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

[PAPER] Exploring Self-attention for Image Recognition

[CODE] https://github.com/hszhao/SAN

MyDLNote - Attention: [2020 CVPR] Exploring Self-attention for Image Recognition

Abstract

Introduction

Self-attention Networks

Pairwise Self-attention

Patchwise Self-attention

Self-attention Block

官方Code

Comparison

Abstract

Recent work has shown that self-attention can serve as a basic building block for image recognition models. We explore variations of self-attention and assess their effectiveness for image recognition. We consider two forms of self-attention. One is pairwise self-attention, which generalizes standard dot-product attention and is fundamentally a set operator. The other is patchwise self-attention, which is strictly more powerful than convolution. Our pairwise self-attention networks match or outperform their convolutional counterparts, and the patchwise models substantially outperform the convolutional baselines. We also conduct experiments that probe the robustness of learned representations and conclude that self-attention networks may have significant benefits in terms of robustness and generalization.

Introduction

开始讲故事啦：

Convolutional networks have revolutionized computer vision. Thirty years ago, they were applied successfully to recognizing handwritten digits [19]. Building directly on this work, convolutional networks were scaled up in 2012 to achieve breakthrough accuracy on the ImageNet dataset, outperforming all prior methods by a large margin and launching the deep learning era in computer vision [18, 29]. Subsequent architectural improvements yielded successively larger and more accurate convolutional networks for image recognition, including GoogLeNet [31], VGG [30], ResNet [12], DenseNet [16], and squeeze-and-excitation [15]. These architectures in turn serve as templates for applications in computer vision and beyond.

All these networks, from LeNet [19] onwards, are based fundamentally on the discrete convolution. The discrete convolution operator ∗ can be defined as follows:

$(F\ast k)(p) = \sum_{s+t=p}F(s) k(t)$ . (1)

Here is a discrete function and is a discrete filter. A key characteristic of the convolution is its translation invariance: the same filter is applied across the image . While the convolution has undoubtedly been effective as the basic operator in modern image recognition, it is not without drawbacks. For example, the convolution lacks rotation invariance. The number of parameters that must be learned grows with the footprint of the kernel . And the stationarity of the filter can be seen as a drawback: the aggregation of information from a neighborhood cannot adapt to its content. Is it possible that networks based on the discrete convolution are a local optimum in the design space of image recognition models? Could other parts of the design space yield models with interesting new capabilities?

前两段：大背景介绍，讲了一下传统卷积的辉煌历史与问题（跟我的博士毕设背景写的好相似呀）。

传统的离散卷积的两个问题：缺乏旋转不变性；邻域的信息聚合不能适应它的内容。

Recent work has shown that self-attention may constitute a viable alternative for building image recognition models [13, 27]. The self-attention operator has been adopted from natural language processing, where it serves as the basis for powerful architectures that have displaced recurrent and convolutional models across a variety of tasks [33, 7, 6, 40]. The development of effective self-attention architectures in computer vision holds the exciting prospect of discovering models with different and perhaps complementary properties to convolutional networks.

第三段：传统离散卷积有问题，self-attention 横空出世，缓解了卷积的缺点。过渡段。

In this work, we explore variations of the self-attention operator and assess their effectiveness as the basic building block for image recognition models. We explore two types of self-attention. The first is pairwise self-attention, which generalizes the standard dot-product attention used in natural language processing [33]. Pairwise attention is compelling because, unlike the convolution, it is fundamentally a set operator, rather than a sequence operator. Unlike the convolution, it does not attach stationary weights to specific locations (s in equation (1)) and is invariant to permutation and cardinality. One consequence is that the footprint of a self-attention operator can be increased (e.g., from a 3×3 to a 7×7 patch) or even made irregular without any impact on the number of parameters. We present a number of variants of pairwise attention that have greater expressive power than dot-product attention while retaining these invariance properties. In particular, our weight computation does not collapse the channel dimension and allows the feature aggregation to adapt to each channel.

第四段：self-attention 这么好，本文的工作就是基于 self-attention 提出的。

第一个工作：pairwise self-attention

与卷积不同，它本质上是一个集合运算符，而不是一个序列运算符；

permutation and cardinality 不变性（不是很清楚这两个概念，希望懂的大牛评论里面指导一下）

比 dot-product attention 更有表现力；

权重计算不会折叠通道维数，并允许特性聚合适应每个通道。（最近我正在考虑这件事，可惜这个文章已经解决这个问题了，晚了一步。）

Next, we explore a different class of operators, which we term patchwise self-attention. These operators, like the convolution, have the ability to uniquely identify specific locations within their footprint. They do not have the permutation or cardinality invariance of pairwise attention, but are strictly more powerful than convolution.

第五段：第二个工作：patchwise self-attention

像卷积一样，在特定区域内计算；

没有 pairwise attention 的排列或基数不变性；

但比卷积更强大。

Our experiments indicate that both forms of self-attention are effective for building image recognition models. We construct self-attention networks that can be directly compared to convolutional ResNet models [12], and conduct experiments on the ImageNet dataset [29]. Our pairwise selfattention networks match or outperform their convolutional counterparts, with similar or lower parameter and FLOP budgets. Controlled experiments also indicate that our vectorial operators outperform standard scalar attention. Furthermore, our patchwise models substantially outperform the convolutional baselines. For example, our mid-sized SAN15 with patchwise attention outperforms the much larger ResNet50, with a 78% top-1 accuracy for SAN15 versus 76.9% for ResNet50, with a 37% lower parameter and FLOP count. Finally, we conduct experiments that probe the robustness of learned representations and conclude that self-attention networks may have significant benefits in terms of robustness and generalization.

最后一段：这个小故事的结论：

pairwise selfattention 相当或优于卷积网络，参数或低或相近，开销也较低；超越标准标量注意；

patchwise models 大大优于卷积；

自注意网络在鲁棒性和泛化方面可能具有显著的优势。

Most closely related to our work are the recent results of Hu et al. [13] and Ramachandran et al. [27]. One of their key innovations is restricting the scope of self-attention to a local patch (for example, 7×7 pixels), in contrast to earlier constructions that applied self-attention globally over a whole feature map [35, 1]. Such local attention is key to limiting the memory and computation consumed by the model, facilitating successful application of self-attention throughout the network, including early high-resolution layers. Our work builds on these results and explores a broader variety of self-attention formulations. In particular, our primary selfattention mechanisms compute a vector attention that adapts to different channels, rather than a shared scalar weight. We also explore a family of patchwise attention operators that are structurally different from the forms used in [13, 27] and constitute strict generalizations of convolution. We show that all the presented forms of self-attention can be implemented at scale, with favorable parameter and FLOP budgets.

[13] Local relation networks for image recognition. In ICCV, 2019

[27] Stand-alone self-attention in vision models. In NeurIPS, 2019

[13] 和 [27] 关键创新之一是将 self-attention 的范围限制在一个局部的 patch 上(例如，7×7像素)；早期在整个特征图上应用。

这样可以明显地减少参数，从而被应用于高分辨率图像任务中。

特别地，我们主要的 self-attention 计算一个适应不同渠道的向量注意，而不是一个共享的标量权重。

结果表明，所提出的各种形式的 self-attention 都可以大规模地实现，且具有良好的参数和每秒浮点运算次数预算。

Self-attention Networks

In convolutional networks for image recognition, the layers of the network perform two functions. The first is feature aggregation, which the convolution operation performs by combining features from all locations tapped by the kernel. The second function is feature transformation, which is performed by successive linear mappings and nonlinear scalar functions: these successive mappings and nonlinear operations shatter the feature space and give rise to complex piecewise mappings.

在用于图像识别的卷积网络中，网络的各层执行两个功能。

第一个是特征聚合，卷积运算是通过将由内核提取的所有位置的特征组合在一起来实现的。

第二个函数是特征变换，它是由连续的线性映射和非线性标量函数来完成的:这些连续的映射和非线性操作打破了特征空间，产生了复杂的分段映射。

One observation that underlies our construction is that these two functions – feature aggregation and feature transformation – can be decoupled. If we have a mechanism that performs feature aggregation, then feature transformation can be performed by perceptron layers that process each feature vector (for each pixel) separately. A perceptron layer consists of a linear mapping and a nonlinear scalar function: this pointwise operation performs feature transformation. Our construction therefore focuses on feature aggregation.

特征聚合和特征转换可以解耦。

如果我们有一个机制来执行特征聚合，那么特征转换可以由感知器层来执行，感知器层分别处理每个特征向量(对于每个像素)。感知器层由一个线性映射和一个非线性标量函数组成:这个点态操作执行特征变换。

因此，我们的构建侧重于特征聚合。

The convolution operator performs feature aggregation by a fixed kernel that applies pretrained weights to linearly combine feature values from a set of nearby locations. The weights are fixed and do not adapt to the content of the features. And since each location must be processed with a dedicated weight vector, the number of parameters scales linearly with the number of aggregated features. We present a number of alternative aggregation schemes and construct high-performing image recognition architectures that interleave feature aggregation (via self-attention) and feature transformation (via elementwise perceptrons).

我们提出了许多备选的聚合方案，并构建了高性能的图像识别体系结构，该体系结构将特征聚合 (通过自我注意) 和特征转换 (通过element-wise 感知器) 交织在一起。

Pairwise Self-attention

We explore two types of self-attention. The first, which we refer to as pairwise, has the following form:

$y_i = \sum_{j\in {R}(i)} \alpha(x_i , x_j ) \odot \beta(x_j )$ , (2)

where $\odot$ is the Hadamard product, is the spatial index of feature vector (i.e., its location in the feature map), and is the local footprint of the aggregation. The footprint is a set of indices that specifies which feature vectors are aggregated to construct the new feature .

The function $\beta$ produces the feature vectors $\beta(x_j)$ that are aggregated by the adaptive weight vectors $\alpha(x_i , x_j )$ . Possible instantiations of this function, along with feature transformation elements that surround self-attention operations in our architecture, are discussed later in this section.

这个公式中， $\alpha$ 相当于注意力图， $\beta$ 相当于输入特征（经过卷积操作）， R(i) 就是卷积窗口大小。

以经典的 self-attention 为例，输入是， $\beta$ 就是 V=W_vX ， $\alpha$ 就是 $Q\times K^T$ 。

Hadamard product 是啥玩意儿？就是元素相乘。

The function $\alpha$ computes the weights $\alpha(x_i , x_j )$ that are used to combine the transformed features $\beta(x_j)$ . To simplify exposition of different forms of self-attention, we decompose $\alpha$ as follows:

$\alpha(x_i , x_j ) = \gamma(\delta(x_i , x_j ))$ . (3)

The relation function $\delta$ outputs a single vector that represents the features and .

The function $\gamma$ then maps this vector into a vector that can be combined with $\beta(x_j)$ as shown in Eq. 2. The function $\gamma$ enables us to explore relations $\delta$ that produce vectors of varying dimensionality that need not match the dimensionality of $\beta(x_j)$ . It also allows us to introduce additional trainable transformations into the construction of the weights $\alpha(x_i , x_j )$ , making this construction more expressive. This function performs a linear mapping, followed by a nonlinearity, followed by another linear mapping; i.e., $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ . The output dimensionality of $\gamma$ does not need to match that of $\beta$ as attention weights can be shared across a group of channels.

We explore multiple forms for the relation function $\delta$ :

Summation: $\delta(x_i , x_j ) =\phi (x_i) + \psi (x_j )$

Subtraction: $\delta(x_i , x_j ) =\phi (x_i) - \psi (x_j )$

Concatenation: $\delta(x_i , x_j ) =\[\phi (x_i) , \psi (x_j )]$

Hadamard product: $\delta(x_i , x_j ) =\phi (x_i) \odot \psi (x_j )$

Dot product: $\delta(x_i , x_j ) =\phi (x_i)^T \psi (x_j )$

Here $\phi$ and $\psi$ are trainable transformations such as linear mappings, and have matching output dimensionality. With summation, subtraction, and Hadamard product, the dimensionality of $\delta(x_i , x_j )$ is the same as the dimensionality of the transformation functions. With concatenation, the dimensionality of $\delta(x_i , x_j )$ will be doubled. With the dot product, the dimensionality of $\delta(x_i , x_j )$ is 1.

这一段呢，提出的 pairwise self-attention 与传统的 self-attention 有两个不同：

1）把传统的 self-attention 的点积部分，用不同的形式可以替换；

2）没有用 sigmoid 或者 softmax，而是用 $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ 。

最后还分析了每种方法输出的维度是多大。注意到，点乘方法将空间注意力对通道上进行了共享；加、减、元素相乘计算的空间注意力对每个通道计算；级联（Concatenation）产生两倍于输入的通道，需要再经过卷积将通道融合，一般是将通道再缩放一倍，这样计算的空间注意力也是为每个通道计算一个，没有共享。

另外一个困扰我很久的问题来了：像 CBAM、self-attention、non-local attention，几乎之前的空间注意力模型，都是计算一个空间注意力图，对所有通道进行共享。为什么呢？为什么不为每个通道计算一个空间注意力图呢？

在 RAM: Residual Attention Module for Single Image Super-Resolution 一文中，作者用 depth-wise 卷积的方式为每个通道计算了空间注意力图，但我在图像去雾任务上进行实验，对网络性能的影响是负面的！！！这篇文章也确实被 CVPR 据稿了，引用量也很低，或许审稿人也发现这个方法的问题了。

那么，为什么这篇文章再次提出每个通道计算空间注意力，却是有效的呢？

空间注意力图要不要在每个通道上共享，还希望大家一起在留言处讨论！！！

Position encoding

A distinguishing characteristic of pairwise attention is that feature vectors are processed independently and the weight computation $\alpha(x_i , x_j )$ cannot incorporate information from any location other than and . To provide some spatial context to the model, we augment the feature maps with position information. The position is encoded as follows. The horizontal and vertical coordinates along the feature map are first normalized to the range [−1, 1] in each dimension. These normalized two-dimensional coordinates are then passed through a trainable linear layer, which can map them to an appropriate range for each layer in the network. This linear mapping outputs a two-dimensional position feature for each location in the feature map. For each pair such that $j \in R(i)$ , we encode the relative position information by calculating the difference . The output of $\delta(x_i , x_j )$ is augmented by concatenating prior to the mapping $\gamma$ .

这段呢，介绍了如何对位置进行编码。这个问题起源于 self-attention 缺乏空间方向（相对位置）的识别机制。在NLP 领域中，Universal Transformer 和 Transformer-XL 就是为了解决 Transformer 缺少相对位置编码的问题。

这个过程在哪里用呢？

$\delta(x_i , x_j )$ 的输出和级联再输入映射 $\gamma$ 。

Patchwise Self-attention

The other type of self-attention we explore is referred to as patchwise and has the following form:

$y_i = \sum_{j\in R(i)} \alpha(x_{R(i)})_j \odot \beta(x_j )$ , (4)

where $x_{R(i)}$ is the patch of feature vectors in the footprint . $\alpha(x_{R(i)})$ is a tensor of the same spatial dimensionality as the patch $x_{R(i)}$ . $\alpha(x_{R(i)})_j$ is the vector at location in this tensor, corresponding spatially to the vector in $x_{R(i)}$ .

patchwise attention 的一般形式不难理解，注意力 $\alpha$ 是在一个局部 patch R(i) 内计算的。

In patchwise self-attention, we allow the construction of the weight vector that is applied to $\beta(x_j )$ to refer to and incorporate information from all feature vectors in the footprint . Note that, unlike pairwise self-attention, patchwise self-attention is no longer a set operation with respect to the features . It is not permutation-invariant or cardinality-invariant: the weight computation $\alpha(x_{R(i)})$ can index the feature vectors individually, by location, and can intermix information from feature vectors from different locations within the footprint. Patchwise self-attention is thus strictly more powerful than convolution.

这一段就是对 patchwise attention 的分析。

黑体字部分，还不是很理解，希望大家一起讨论！

We decompose $\alpha(x_{R(i)})$ as follows:

$\alpha(x_{R(i)})=\gamma(\delta(x_{R(i)}))$ . (5)

The function $\gamma$ maps a vector produced by $\delta(x_{R(i)})$ to a tensor of appropriate dimensionality. This tensor comprises weight vectors for all locations . The function $\delta$ combines the feature vectors from the patch $x_{R(i)}$ . We explore the following forms for this combination:

Star-product : $\delta(x_{R(i)}) = [\phi(x_i) ^T\psi(x_j )]_{\forall j\in R(i)}$

Clique-product : $\delta(x_{R(i)}) = [\phi(x_j) ^T\psi(x_k )]_{\forall j,k\in R(i)}$

Concatenation : $\delta(x_{R(i)}) = [\phi(x_i) ,[\psi(x_j )]_{\forall j\in R(i)}]$

Self-attention Block

The self-attention operations described in Sections 3.1 and 3.2 can be used to construct residual blocks [12] that perform both feature aggregation and feature transformation. Our self-attention block is illustrated in Figure 1. The input feature tensor (channel dimensionality ) is passed through two processing streams. The left stream evaluates the attention weights $\alpha$ by computing the function $\delta$ (via the mappings $\phi$ and $\psi$ ) and a subsequent mapping $\gamma$ . The right stream applies a linear transformation $\beta$ that transforms the input features and reduces their dimensionality for efficient processing. The outputs of the two streams are then aggregated via a Hadamard product. The combined features are passed through a normalization and an elementwise nonlinearity, and are processed by a final linear layer that expands their dimensionality back to .

这一段描述的就是图1。

注意1：aggregation 就是 Hadamard product，要求两个张量维度相同。

注意2：最左侧支路输入通过 linear 层，输出 C/r_1 个通道；而中间这一路输入经过 linear 层输出 C/r_2 个通道；它们是怎么做 Hadamard product 的呢？是通过 $\gamma={Linear\rightarrow ReLU\rightarrow Linear}$ 转换的。那么，是第一个 linear 还是的第二个 linear 将 C/r_1 转换到 C/r_2 呢？看代码去吧。

Figure 1. Our self-attention block. is the channel dimensionality. The left stream evaluates the attention weights $\alpha$ , the right stream transforms the features via a linear mapping $\beta$ . Both streams reduce the channel dimensionality for efficient processing. The outputs of the streams are aggregated via a Hadamard product and the dimensionality is subsequently expanded back to .

Network Architectures

Our network architectures generally follow residual networks, which we will use as baselines [12]. Table 1 presents three architectures obtained by stacking self-attention blocks at different resolutions. These architectures – SAN10, SAN15, and SAN19 – are in rough correspondence with ResNet26, ResNet38, and ResNet50. The number X in SANX refers to the number of self-attention blocks. Our architectures are based fully on self-attention.

Table 1. Self-attention networks for image recognition. ‘C-d linear’ means that the output dimensionality of the linear layer is ‘C’. ‘C-d sa’ stands for a self-attention operation with output dimensionality ‘C’. SAN10, SAN15, and SAN19 are in rough correspondence with ResNet26, ResNet38, and ResNet50, respectively. The number X in SANX refers to the number of self-attention blocks. Our architectures are based fully on self-attention.

提出的 SAN 由三部分组成：SA Block；Transition；Classification。

Backbone. The backbone of SAN has five stages, each with different spatial resolution, yielding a resolution reduction factor of 32. Each stage comprises multiple self-attention blocks. Consecutive stages are bridged by transition layers that reduce spatial resolution and expand channel dimensionality. The output of the last stage is processed by a classification layer that comprises global average pooling, a linear layer, and a softmax.

Transition. Transition layers reduce spatial resolution, thus reducing the computational burden and expanding receptive field. The transition comprises a batch normalization layer, a ReLU [25], 2×2 max pooling with stride 2, and a linear mapping that expands channel dimensionality

Footprint of self-attention. The local footprint controls the amount of context gathered by a self-attention operator from the preceding feature layer. We set the footprint size to 7×7 for the last four stages of SAN. The footprint is set to 3×3 in the first stage due to the high resolution of that stage and the consequent memory consumption. Note that increasing the footprint size has no impact on the number of parameters in pairwise self-attention. We will study the effect of footprint size on accuracy, capacity, and FLOPs in Section 5.3.

Instantiations. The number of self-attention blocks in each stage can be adjusted to obtain networks with different capacities. In the networks presented in Table 1, the number of self-attention blocks used in the last four stages is the same as the number of residual blocks in ResNet26, ResNet38, and ResNet50, respectively.

网络结构表1 已经很清楚了。

官方Code

import torch
import torch.nn as nn

from lib.sa.modules import Subtraction, Subtraction2, Aggregation


def conv1x1(in_planes, out_planes, stride=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)


def position(H, W, is_cuda=True):
    if is_cuda:
        loc_w = torch.linspace(-1.0, 1.0, W).cuda().unsqueeze(0).repeat(H, 1)
        loc_h = torch.linspace(-1.0, 1.0, H).cuda().unsqueeze(1).repeat(1, W)
    else:
        loc_w = torch.linspace(-1.0, 1.0, W).unsqueeze(0).repeat(H, 1)
        loc_h = torch.linspace(-1.0, 1.0, H).unsqueeze(1).repeat(1, W)
    loc = torch.cat([loc_w.unsqueeze(0), loc_h.unsqueeze(0)], 0).unsqueeze(0)
    return loc


class SAM(nn.Module):
    def __init__(self, sa_type, in_planes, rel_planes, out_planes, share_planes, kernel_size=3, stride=1, dilation=1):
        super(SAM, self).__init__()
        self.sa_type, self.kernel_size, self.stride = sa_type, kernel_size, stride
        self.conv1 = nn.Conv2d(in_planes, rel_planes, kernel_size=1)
        self.conv2 = nn.Conv2d(in_planes, rel_planes, kernel_size=1)
        self.conv3 = nn.Conv2d(in_planes, out_planes, kernel_size=1)
        if sa_type == 0:
            self.conv_w = nn.Sequential(nn.BatchNorm2d(rel_planes + 2), nn.ReLU(inplace=True),
                                        nn.Conv2d(rel_planes + 2, rel_planes, kernel_size=1, bias=False),
                                        nn.BatchNorm2d(rel_planes), nn.ReLU(inplace=True),
                                        nn.Conv2d(rel_planes, out_planes // share_planes, kernel_size=1))
            self.conv_p = nn.Conv2d(2, 2, kernel_size=1)
            self.subtraction = Subtraction(kernel_size, stride, (dilation * (kernel_size - 1) + 1) // 2, dilation, pad_mode=1)
            self.subtraction2 = Subtraction2(kernel_size, stride, (dilation * (kernel_size - 1) + 1) // 2, dilation, pad_mode=1)
            self.softmax = nn.Softmax(dim=-2)
        else:
            self.conv_w = nn.Sequential(nn.BatchNorm2d(rel_planes * (pow(kernel_size, 2) + 1)), nn.ReLU(inplace=True),
                                        nn.Conv2d(rel_planes * (pow(kernel_size, 2) + 1), out_planes // share_planes, kernel_size=1, bias=False),
                                        nn.BatchNorm2d(out_planes // share_planes), nn.ReLU(inplace=True),
                                        nn.Conv2d(out_planes // share_planes, pow(kernel_size, 2) * out_planes // share_planes, kernel_size=1))
            self.unfold_i = nn.Unfold(kernel_size=1, dilation=dilation, padding=0, stride=stride)
            self.unfold_j = nn.Unfold(kernel_size=kernel_size, dilation=dilation, padding=0, stride=stride)
            self.pad = nn.ReflectionPad2d(kernel_size // 2)
        self.aggregation = Aggregation(kernel_size, stride, (dilation * (kernel_size - 1) + 1) // 2, dilation, pad_mode=1)

    def forward(self, x):
        x1, x2, x3 = self.conv1(x), self.conv2(x), self.conv3(x)
        if self.sa_type == 0:  # pairwise
            p = self.conv_p(position(x.shape[2], x.shape[3], x.is_cuda))
            w = self.softmax(self.conv_w(torch.cat([self.subtraction2(x1, x2), self.subtraction(p).repeat(x.shape[0], 1, 1, 1)], 1)))
        else:  # patchwise
            if self.stride != 1:
                x1 = self.unfold_i(x1)
            x1 = x1.view(x.shape[0], -1, 1, x.shape[2]*x.shape[3])
            x2 = self.unfold_j(self.pad(x2)).view(x.shape[0], -1, 1, x1.shape[-1])
            w = self.conv_w(torch.cat([x1, x2], 1)).view(x.shape[0], -1, pow(self.kernel_size, 2), x1.shape[-1])
        x = self.aggregation(x3, w)
        return x


class Bottleneck(nn.Module):
    def __init__(self, sa_type, in_planes, rel_planes, mid_planes, out_planes, share_planes=8, kernel_size=7, stride=1):
        super(Bottleneck, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.sam = SAM(sa_type, in_planes, rel_planes, mid_planes, share_planes, kernel_size, stride)
        self.bn2 = nn.BatchNorm2d(mid_planes)
        self.conv = nn.Conv2d(mid_planes, out_planes, kernel_size=1)
        self.relu = nn.ReLU(inplace=True)
        self.stride = stride

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(x))
        out = self.relu(self.bn2(self.sam(out)))
        out = self.conv(out)
        out += identity
        return out


class SAN(nn.Module):
    def __init__(self, sa_type, block, layers, kernels, num_classes):
        super(SAN, self).__init__()
        c = 64
        self.conv_in, self.bn_in = conv1x1(3, c), nn.BatchNorm2d(c)
        self.conv0, self.bn0 = conv1x1(c, c), nn.BatchNorm2d(c)
        self.layer0 = self._make_layer(sa_type, block, c, layers[0], kernels[0])

        c *= 4
        self.conv1, self.bn1 = conv1x1(c // 4, c), nn.BatchNorm2d(c)
        self.layer1 = self._make_layer(sa_type, block, c, layers[1], kernels[1])

        c *= 2
        self.conv2, self.bn2 = conv1x1(c // 2, c), nn.BatchNorm2d(c)
        self.layer2 = self._make_layer(sa_type, block, c, layers[2], kernels[2])

        c *= 2
        self.conv3, self.bn3 = conv1x1(c // 2, c), nn.BatchNorm2d(c)
        self.layer3 = self._make_layer(sa_type, block, c, layers[3], kernels[3])

        c *= 2
        self.conv4, self.bn4 = conv1x1(c // 2, c), nn.BatchNorm2d(c)
        self.layer4 = self._make_layer(sa_type, block, c, layers[4], kernels[4])

        self.relu = nn.ReLU(inplace=True)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(c, num_classes)

    def _make_layer(self, sa_type, block, planes, blocks, kernel_size=7, stride=1):
        layers = []
        for _ in range(0, blocks):
            layers.append(block(sa_type, planes, planes // 16, planes // 4, planes, 8, kernel_size, stride))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.relu(self.bn_in(self.conv_in(x)))
        x = self.relu(self.bn0(self.layer0(self.conv0(self.pool(x)))))
        x = self.relu(self.bn1(self.layer1(self.conv1(self.pool(x)))))
        x = self.relu(self.bn2(self.layer2(self.conv2(self.pool(x)))))
        x = self.relu(self.bn3(self.layer3(self.conv3(self.pool(x)))))
        x = self.relu(self.bn4(self.layer4(self.conv4(self.pool(x)))))

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


def san(sa_type, layers, kernels, num_classes):
    model = SAN(sa_type, Bottleneck, layers, kernels, num_classes)
    return model


if __name__ == '__main__':
    net = san(sa_type=0, layers=(3, 4, 6, 8, 3), kernels=[3, 7, 7, 7, 7], num_classes=1000).cuda().eval()
    print(net)
    y = net(torch.randn(4, 3, 224, 224).cuda())
    print(y.size())

Comparison

In this section, we relate the family of self-attention operators presented in Section 3 to other constructions, including convolution [19] and scalar attention [33, 35, 27, 13]. Table 2 summarizes some differences between the constructions. These are discussed in more detail below.

Table 2. The convolution does not adapt to the content of the image. Scalar attention produces scalar weights that do not vary along the channel dimension. Our operators efficiently compute attention weights that adapt across both spatial dimensions and channels.

Convolution. The regular convolution operator has fixed kernel weights that are independent of the content of the image. It does not adapt to the input content. The kernel weights can vary across channels.

Scalar attention. Scalar attention, as used in the transformer [33] and related constructions in computer vision [35, 27, 13], typically has the following form:

$yi = \sum_{j\in R(i)} \phi (x_i)^T\psi (x_j ) \beta (x_j )$ (6)

(A softmax and other forms of normalization can be added.) Unlike the convolution, the aggregation weights can vary across different locations, depending on the content of the image. On the other hand, the weight $\phi (x_i)^T\psi (x_j )$ is a scalar that is shared across all channels. (Hu et al. [13] explored alternatives to the dot product, but these alternatives operated on scalar weights that were likewise shared across channels.) This construction does not adapt the attention weights at different channels. Although this can be mitigated to some extent by introducing multiple heads [33], the number of heads is a small constant and scalar weights are shared by all channels within a head.

Vector attention (ours). The operators presented in Section 3 subsume scalar attention and generalize it in important ways. First, within the pairwise attention family, the relation function $\delta$ can produce vector output. This is the case for the summation, subtraction, Hadamard, and concatenation forms. This vector can then be further processed and mapped to the right dimensionality by $\gamma$ , which can also take position encoding channels as input. The mapping $\gamma$ produces a vector that has compatible dimensionality to the transformed features $\beta$ . This gives the construction significant flexibility in accommodating different relation functions and auxiliary inputs, expressive power due to multiple linear mappings and nonlinearities along the computation graph, ability to produce attention weights that vary along both spatial and channel dimensions, and computational efficiency due to the ability to reduce dimensionality by the mappings $\gamma$ and $\beta$ .

The patchwise family of operators generalizes convolution while retaining parameter and FLOP efficiency. This family of operators produces weight vectors for all positions along a feature map that also vary along the channel dimension. The weight vectors are informed by the entirety of the footprint of the operator.