MyDLNote - Network: [NLA系列] Efficient Attention: Attention with Linear Complexities

最新推荐文章于 2024-08-31 08:51:54 发布

Phoenixtree_DongZhao

最新推荐文章于 2024-08-31 08:51:54 发布

阅读量3.2k

点赞数 4

分类专栏： MyDLNote-Network MyDLNote-Attention deep learning 文章标签：深度学习

本文链接：https://blog.csdn.net/u014546828/article/details/104529320

版权

deep learning 同时被 3 个专栏收录

113 篇文章 8 订阅

订阅专栏

MyDLNote-Network

40 篇文章 2 订阅

订阅专栏

MyDLNote-Attention

40 篇文章 7 订阅

订阅专栏

Efficient Attention: Attention with Linear Complexities

[paper] Efficient Attention: Attention with Linear Complexities

[Project] https://cmsflash.github.io/ai/2019/12/02/efficient-attention.html

[GitHub] https://github.com/cmsflash/efficient-attention

先说说我对这篇文章的理解吧。技术上， efficient attention 是将 non-local attention 的 query，key，value 的乘法顺序进行调整，不再计算每个像素点之间的相关性，而是把 key 看做是 attention map，每个 attention map 表示某个语义。语义的个数可以理解为 key feature map 的个数吧。用这个表示语义的 attention map 聚合 value 矩阵，得到 context 向量组成的矩阵。然后，将 query 看做是系数，与 context 矩阵相乘。这样，计算量从像素的平方降到语义的平方。

[Non-Local Attention 系列]

Non-local neural networks

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [my CSDN]

Asymmetric Non-local Neural Networks for Semantic Segmentation [my CSDN]

Efficient Attention: Attention with Linear Complexities [my CSDN]

CCNet: Criss-Cross Attention for Semantic Segmentation [my CSDN]

Non-locally Enhanced Encoder-Decoder Network for Single Image De-raining [my CSDN]

Image Restoration via Residual Non-local Attention Networks [my CSDN]

Efficient Attention: Attention with Linear Complexities

Abstract

Introduction

Related Works

Dot-Product Attention

Scaling Attention

Method

3.1. A Revisit of Dot-Product Attention

3.2. Efficient Attention

3.3. Equivalence between Dot-Product and Efficient Attention

3.4. Interpretation of Efficient Attention

3.5. Efficiency Advantage

Abstract

The attention mechanism has seen wide applications in computer vision and natural language processing. Recent works developed the dot-product attention mechanism and applied it to various vision and language tasks. However, the memory and computational costs of dot-product attention grows quadratically with the spatiotemporal size of the input. Such growth prohibits the application of the mechanism on large inputs, e.g., long sequences, highresolution images, or large videos.

本文要解决的问题：点积的注意机制（即传统的 Non-local network, self-attention 等）计算量确实太大了。

To remedy this drawback, this paper proposes a novel efficient attention mechanism, which is equivalent to dot-product attention but has substantially less memory and computational costs. The resource efficiency allows more widespread and flexible incorporation of efficient attention modules into a neural network, which leads to improved accuracies.

重点：提出一个很好的理念，即 resource efficiency （理解为对于资源的高效利用率，用很少的资源提取足够的有用信息）可以将有效的注意力模块灵活地整合到神经网络中，从而提高了准确率。

Further, the resource efficiency of the mechanism democratizes attention to complicated models, which were unable to incorporate original dot-product attention due to prohibitively high costs. As an exemplar, an efficient attention-augmented model achieved state-of-the-art accuracies for stereo depth estimation on the Scene Flow dataset.

由于点积的注意机制计算量太大，不能用于复杂模型，而本文的 Efficient Attention 可以。

Introduction

Attention is a mechanism in neural networks that focuses on long-range dependency modeling, a key challenge to deep learning that convolution and recurrence struggle to solve. A recent series of works developed the highly successful dot-product attention mechanism, which facilitates easy integration into a deep neural network. The mechanism computes the response at every position as a weighted sum of features at all positions in the previous layer. In contrast to the limited spatial and temporal receptive fields of convolution and recurrence, dot-product attention expands the receptive field to the entire input in one pass. Using dot-product attention to efficiently model long-range dependencies allows convolution and recurrence to focus on local dependency modeling, in which they specialize. Dot-product attention-based models now hold state-of-the-art records on nearly all tasks in natural language processing [25, 20, 6, 21]. The non-local module [26], an adaptation of dot-product attention for computer vision, achieved state-of-the-art performance on video classification [26] and generative adversarial image modeling [29, 3] and demonstrated significant improvements on object detection [26], instance segmentation [26], person re-identification [14], and image deraining [12], etc.

注意力是神经网络中的一种机制，它专注于长期依赖建模，是卷积和递归难以解决的深度学习的关键挑战。最近的一系列工作开发了高度成功的点产品注意力机制，它可以方便地集成到深度神经网络中。该机制将每个位置的响应计算为前一层中所有位置特征的加权和。与有限的空间和时间的卷积和递归的接受域相比，点积注意将接受域扩展到整个输入一次。

However, global dependency modeling on large inputs, e.g. long sequences, high-definition images, and large videos, remains an unsolved problem. The quadratic memory and computational complexities with respect to the input size of existing dot-product attention modules inhibit their application on such large inputs.

The high memory and computational costs constrain the application of dot-product attention to the low-resolution or short-temporal-span parts of models [26, 29, 3] and prohibits its use for resolution-sensitive or resource-hungry tasks.

然而，对大输入(如长序列、高清图像和大视频)的全局依赖建模仍然是一个未解决的问题。

The need for global dependency modeling on large inputs greatly motivates the exploration for a resource-efficient attention algorithm. An investigation into the nonlocal module revealed an intriguing phenomenon. The attention maps at each position, despite generated independently, are correlated. As [ Non-local neural networks ] and [ Self-attention generative adversarial networks ] analyzed, the attention map of a position mainly focuses on semantically related regions. Figure 1 shows the learned attention maps in a non-local module. When generating an image of a bird before a bush, pixels on the legs tend to attend to other leg pixels for structural consistency. Similarly, body pixels mainly attend to the body, and background pixels focus on the bush.

对大输入的全局依赖关系建模的需求极大地激发了对资源效率高的注意力算法的探索。对非本地模块的调查揭示了一个有趣的现象。尽管每个位置的注意力地图是独立生成的，但它们是相关的。当在灌木前生成鸟的图像时，腿上的像素倾向于关注其他腿上的像素以获得结构一致性。同样，主体像素主要关注主体，背景像素聚焦于灌木。

Figure 1. An illustration of the learned attention maps in a nonlocal module. The first image identifies five query positions with colored dots. Each of the subsequent images illustrates the attention map for one of the positions. Adapted from [ Self-attention generative adversarial networks].

This observation inspired the design of the efficient attention mechanism that this paper proposes.

The mechanism first generates a key feature map, a query feature map, and a value feature map from an input.

It interprets each channel of the key feature map as a global attention map.

Using each global attention map as the weights, efficient attention aggregates the value feature map to produce a global context vector that summarizes an aspect of the global features.

Then, at each position, the module regards the query feature as a set of coefficients over the global context vectors.

Finally, the module computes a sum of the global context vectors with the query feature as the weights to produce the output feature at the position.

This algorithm avoids the generation of the pairwise attention matrix, whose size is quadratic in the spatiotemporal size of the input. Therefore, it achieves linear complexities with respect to input size and obtains significant efficiency improvements.

Efficient attention mechanism 具体步骤：

1. 生成 key feature map, a query feature map, and a value feature map。这个与传统的 dot-product attention 一样。

2. 它将 key feature map 的每个通道解释为一个全局注意图。

3. 使用每个全局注意图作为权重，有效的注意聚合 value feature map，生成一个全局上下文向量，该向量总结全局特征的一个方面。

4. 然后，在每个位置，模块将 query feature 视为全局上下文向量上的一组系数。

5. 最后，该模块计算全局上下文向量的和，以 query feature 作为权重，在该位置生成输出特征。

该算法避免了对注意矩阵的产生，注意矩阵的大小与输入的时空大小成二次关系。因此，它实现了相对于输入大小的线性复杂性，并获得了显著的效率改进。

The principal contribution of this paper is the efficient attention mechanism, which:

1. has linear memory and computational complexities with respect to the spatiotemporal size of the input;

就输入的时空大小而言，具有线性记忆和计算复杂性;

2. possesses the same representational power as the widely adopted dot-product attention mechanism;

具有与广泛采用的网络产品注意力机制相同的表征力;

3. allows the incorporation of significantly more attention modules into a neural network, which brings substantial performance boosts to tasks such as object detection and instance segmentation (on MS-COCO 2017) and image classification (on ImageNet); and

允许将更多的注意力模块合并到一个神经网络中，这大大提高了诸如对象检测和实例分割(关于MS-COCO 2017)和图像分类等任务的性能

4. facilitates the application of attention on resource-hungry tasks, such as stereo depth estimation (on the Scene Flow dataset).

有助于将注意力应用于资源消耗大的任务，如立体深度估计。

Dot-Product Attention

[Neural machine translation by jointly learning to align and translate] proposed the initial formulation of the dot-product attention mechanism to improve word alignment in machine translation. Successively, [Attention is all you need] proposed to completely replace recurrence with attention and named the resultant architecture the Transformer. The Transformer architecture is highly successful on sequence tasks. They hold the state-ofthe-art records on virtually all tasks in natural language processing [6, 21, 28] and is highly competitive on end-to-end speech recognition [7, 19]. [Non-local neural networks] first adapted dot-product attention for computer vision and proposed the non-local module. They achieved state-of-the-art performance on video classification and demonstrated significant improvements on object detection, instance segmentation, and pose estimation. Subsequent works applied it to various fields in computer vision, including image restoration [Non-local recurrent network for image restoration], video person re-identification [14], and notably generative adversarial image modeling, where SAGAN [Self-attention generative adversarial networks] and BigGAN [Large Scale GAN Training for High Fidelity Natural Image Synthesis] substantially advanced the state-of-the-art using the nonlocal module.

Efficient attention mainly builds upon the version of dot-product attention in the non-local module. Following [Non-local neural networks], the team conducted most experiments on object detection and instance segmentation. The paper compares the resource efficiency of the efficient attention module against the non-local module under the same performance and their performance under the same resource constraints.

本文比较了相同性能下 efficient attention 模块与 non-local 模块的 resource efficiency ，以及它们在相同资源约束下的资源效率。

Scaling Attention

Besides dot-product attention, there are a separate set of techniques the literature refers to as attention. This section refers to them as scaling attention. While dot-product attention is effective for global dependency modeling, scaling attention focuses on emphasizing important features and suppressing uninformative ones. For example, the squeezeand-excitation (SE) module uses global average pooling and a linear layer to compute a scaling factor for each channel and then scales the channels accordingly. SE-enhanced models achieved state-of-the-art performance on image classification and substantial improvements on scene segmentation and object detection. On top of SE, CBAM added global max pooling beside global average pooling and an extra spatial attention submodule. It further improved SE’s performance.

缩放比例注意力模型。

Despite both names containing attention, dot-product attention and scaling attention are two completely separate sets of techniques with very different goals. When appropriate, one might take both techniques and let them work in conjunction. Therefore, it is unnecessary to make any comparison of efficient attention with scaling attention techniques.

点积注意力模型和缩放比例注意力模型，是两种完全不相干的 attention 模型，可以结合使用。

Method

This section introduces the efficient attention mechanism. It is mathematically equivalent with the widely adopted dot-product attention mechanism in computer vision (i.e., the attention mechanism in the Transformer and the non-local module). However, efficient attention has linear memory and computational complexities with respect to the number of pixels or words (hereafter referred to as positions).

Section 3.1 reviews the dot-product attention mechanism and identifies its critical drawback on large inputs to motivate efficient attention. The introduction of the efficient attention mechanism is in Section 3.2. Section 3.3 shows the equivalence between dot-product and efficient attention. Section 3.4 discusses the interpretation of the mechanism. Section 3.5 analyzes its efficiency advantage over dot-product attention.

第3.1节回顾了网络产品的注意力机制，并指出了其在大投入情况下的关键缺陷，以激发有效的注意力。第3.2节介绍有效注意机制。第3.3节展示了点积和有效注意之间的等价性。第3.4节讨论了机制的解释。第3.5节分析了其相对于网络产品注意力的效率优势。

3.1. A Revisit of Dot-Product Attention

Modeling long-range dependencies has been a central challenge for natural language processing and computer vision. [Neural machine translation by jointly learning to align and translate] initially proposed dot-product attention for machine translation. Subsequently, the Transformer [Attention is all you need] adopted the mechanism to model long-range temporal dependencies between words. [Non-local neural networks] introduce dot-product attention for the modeling of long-range dependencies between pixels in image and video understanding.

For each input feature vector $x_i \in \mathbb{R}^d$ that corresponds to the -th position, dot-product attention first uses three linear layers to convert x_i into three feature vectors, i.e., the query feature $q_i \in \mathbb{R}^{d_k}$ , the key feature $k_i \in \mathbb{R}^{d_k}$ , and the value feature $v_i \in \mathbb{R}^{d_v}$ . The query and key features must have the same feature dimension . One can measure the similarity between the -th query and the -th key as $\rho(q^T_i k_j )$ , where $\rho$ is a normalization function. In general, the similarities are asymmetric, since the query and key features are the outputs of two separate layers. The dot-product attention module calculates the similarities between all pairs of positions. Using the similarities as weights, position aggregates the value features from all positions via weighted summation to obtain its output feature.

可以测量 -th query 与 j-th key 之间的相似度 $\rho(q^T_i k_j )$ ，其中 $\rho$ 为归一化函数。一般来说，相似性是不对称的，因为 query 和 key 特性是两个独立层的输出。dot-product attention 模块计算所有位置对之间的相似性。利用相似度作为权重，将所有位置的值特征进行加权求和，得到其输出特征。

If one represents all positions’ query, key, and value features in matrix forms as $Q \in\mathbb{R}^{n\times d_k}$ , $K\in\mathbb{R}^{n\times d_k}$ , $V\in\mathbb{R}^{n\times d_v}$ , respectively, the output of dot-product attention is

$D(Q,K,V)=\rho(QK^T)V$ (1)

The normalization function $\rho$ has two common choices:

$Scalling: \rho(Y)=Y/n \\ Softmax: \rho(Y)=\sigma_{row}(Y)$ (2)

where $\sigma_{row}$ denotes applying the softmax function along each row of matrix . An illustration of the dot-product attention module is in Figure 2 (left).

Non-local attention 的数学计算公式。

The critical drawback of this mechanism is its resource demands. Since it computes a similarity between each pair of positions, there are n^2 such similarities, which results in O(n^2 ) memory complexity and O(d_k n^2 ) computational complexity. Therefore, dot-product attention’s resource demands get prohibitively high on large inputs. In practice, application of the mechanism is only possible on low-resolution features.

这种机制的关键缺点是它的资源需求。由于它计算的是每一对位置之间的相似性，因此存在 n^2 个相似性，从而导致了 O(n^2 ) 内存复杂度和 O(d_k n^2 ) 计算复杂度。因此，在大量的投入下，dot-product attention 对资源的需求高得令人望而却步。在实际应用中，该机制仅适用于低分辨率特性。

Figure 2. Illustration of the architecture of dot-product and efficient attention. Each box represents an input, output, or intermediate matrix. Above each box is the name of the corresponding matrix. The variable name and the size of the matrix are inside each box. denotes matrix multiplication.

3.2. Efficient Attention

Observing the critical drawback of dot-product attention, this paper proposes the efficient attention mechanism, which is mathematically equivalent to dot-product attention but much faster and more memory efficient. In efficient attention, the individual feature vectors $X_i\in \mathbb{R}^{n\times d}$ still pass through three linear layers to form the query features $Q \in \mathbb{R}^{ n\times d_k }$ , key features $K \in \mathbb{R}^{ n\times d_k }$ , and value features $V \in \mathbb{R}^{ n\times d_v }$ . However, instead of interpreting the key features as feature vectors in $\mathbb{R}^{ n\times d_k }$ , the module regards them as d_k single-channel feature maps. Efficient attention uses each of these feature maps as a weighting over all positions and aggregates the value features from all positions through weighted summation to form a global context vector. The name reflects the fact that the vector does not correspond to a specific position, but is a global description of the input features.

在 efficient attention 中，个体特征向量 $X_i\in \mathbb{R}^{n\times d}$ 仍然通过三个线性层形成 query 特征 $Q \in \mathbb{R}^{ n\times d_k }$ 、key 特征 $K \in \mathbb{R}^{ n\times d_k }$ 和 value 特征 $V \in \mathbb{R}^{ n\times d_v }$ 。然而，该模块并没有将 key 特征解释为维特征向量，而是将其视为 d_k 维单通道特征映射。efficient attention 使用每个特征图作为所有位置的加权，并通过加权求和聚合所有位置的值特征，形成一个全局上下文向量。该名称反映了这样一个事实，即向量并不对应于特定的位置，而是输入特征的全局描述。

The following equation characterizes the efficient attention mechanism:

$E(Q,K,V)=\rho_q(Q)\rho_k((K)^TV)$ (3)

where $\rho_q$ and $\rho_k$ are normalization functions for the query and key features, respectively. The implementation of the same two normalization methods as for dot-production attention are

$Scalling: \rho_q(Y)=\rho_k(Y)=Y/\squr\sqrt[]{n} \\ Softmax: \rho_q(Y)=\sigma_{row}(Y) \\ \rho_k(Y)=\sigma_{col}$ (4)

where $\sigma_{row}$ , $\sigma_{col}$ denote applying the softmax function along each row or column of matrix , respectively. The efficient attention module is a concrete implementation of the mechanism for computer vision data. For an input feature map $X\in \mathbb{R}^{ h\times w\times d }$ , the module flattens it to a matrix $X\in \mathbb{R}^{ hw\times d }$ , applies the efficient attention mechanism on it, and reshapes the result to $h\times w \times d_v$ . If $d_v \neq d$ , it further applies a $1\times 1$ convolution to restore the dimensionality to . Finally, it adds the resultant features to the input features to form a residual structure.

可以看到，efficient attention 其实就是将 non-lcoal 中的 Q，K，V 的相乘顺序换了下。简单理解就是，本来两个长方形矩阵（注意，长边远远大于短边，因为长边是所有像素总和 $h\times w$ ，而短边是通道个数）相乘是短边相乘得到一个长边的正方形矩阵，现在改成两个长边相乘，得到一个短边的正方形矩阵。这种改变，其实是对 Q，K，V 赋予了新的含义吧。

下面的两个章节分别从两个角度对比 efficient attention 和 non-lcoal。一方面，从公式角度证明二者是比较相似的；另一方面，从意义角度对 Q，K，V 进行新的诠释。

3.3. Equivalence between Dot-Product and Efficient Attention

Following is a formal proof of the equivalence between dot-product and efficient attention when using scaling normalization. Substituting the scaling normalization formula in Equation (2) into Equation (1) gives

$D(Q,K,V)=\frac{QK^T}{n}V$ (5)

Similarly, plugging the scaling normalization formulae in Equation (4) into Equation (3) results in

$E(Q,K,V)=\frac{Q}{\sqrt{n}}(\frac{K^T}{\sqrt{n}}V)$ (6)

Since scalar multiplication is commutative with matrix multiplication and matrix multiplication is associative, we have $E(Q,K,V)=\frac{Q}{\sqrt{n}}(\frac{K^T}{\sqrt{n}}V)\\ ~~~~~~~~~~~~~~~~~~~~=\frac{1}{n}Q(K^TV)\\ ~~~~~~~~~~~~~~~~~~~~=\frac{1}{n}(QK^T)V\\ ~~~~~~~~~~~~~~~~~~~~=\frac{QK^T}{n}V$ (7)

Comparing Equations (5) and (7), we get

D(Q,K,V)=E(Q,K,V) (8)

Thus, the proof is complete.

证明很简单，不翻译了。

The above proof works for the softmax normalization variant with one caveat. The two softmax operations on , are not exactly equivalent to the single softmax on QK^T . However, they closely approximate the effect of the original softmax function. The critical property of $\sigma_{row}(QK^T)$ is that each row of it sums up to 1 and represents a normalized attention distribution over all positions. The matrix $\sigma_{row}(Q)\sigma_{col}(K)^T$ shares this property. Therefore, the softmax variant of efficient attention is a close approximate of that variant of dot-product attention. Section 4.1 demonstrates this claim empirically.

上述证明适用于softmax标准化变体，但有一个警告。两个 , 的 softmax 操作，并不完全等同于一个 QK^T softmax操作。然而，它们与原始的 softmax 函数的效果非常接近。 $\sigma_{row}(QK^T)$ 临界性质的每一行总和为1，表示所有位置上的标准化注意分布。 $\sigma_{row}(Q)\sigma_{col}(K)^T$ 也有这个属性。因此，efficient attention 的 softmax 变体与 dot-product attention 的 softmax 变体非常接近。第4.1节从经验上论证了这一论断。

3.4. Interpretation of Efficient Attention

Efficient attention brings a new interpretation of the attention mechanism. In dot-product attention, selecting position as the reference position, one can collect the similarities of all positions to position and form an attention map s_i for that position. The attention map s_i represents the degree to which position attends to each position in the input. A higher value for position on s_i means position attends more to position . In dot-product attention, every position has such an attention map s_i , which the mechanism uses to aggregate the value features to produce the output at position .

Efficient attention 对注意机制提出了新的解释。在 dot-product attention 中，选择位置作为参考位置，可以收集所有位置与位置的相似性，形成该位置的 attention map s_i 。 s_i 表示对输入中每个位置的注意程度。位置在 s_i 上的值越高，表示位置越关注位置。在 dot-product attention 中，每个位置都有一个这样的 attention map s_i ，该机制用于聚合 value 特征以生成位置处的输出。

In contrast, efficient attention does not generate an attention map for each position. Instead, it interprets the key features $K\in \mathbb{R}^{n\times d_k}$ as attention maps. Each k^T_j is a global attention map that does not correspond to any specific position. Instead, each of them corresponds to a semantic aspect of the entire input. For example, one such attention map might cover the persons in the input. Another might correspond to the background. Section 4.3 gives several concrete examples. Efficient attention uses each k^T_j to aggregate the value features V and produce a global context vector gj . Since k^T_j describes a global, semantic aspect of the input, gj also summarizes a global, semantic aspect of the input. Then, position i uses qi as a set of coeffi- cients over g0, g1, . . . , gdk−1. Using the previous example, a person pixel might place a large weight on the global context vector for persons to refine its representation. A pixel at the boundary of an object might have large weights on the global context vectors for both the object and the background to enhance the contrast.

相反，efficient attention 不会为每个位置生成一个 attention map。相反，它将 $K\in \mathbb{R}^{n\times d_k}$ 的 key 特征解释为 attention map $K\in \mathbb{R}^{n\times d_k}$ 。每个 k^T_j 是一个全局 attention map，它并不对应于任何特定的位置。相反，它们中的每一个都对应于整个输入的语义方面。例如，一个 attention map 可能会覆盖输入中的人。另一个可能与背景相对应。第4.3节给出了几个具体的例子。efficient attention 使用每一个 k^T_j 聚合 value 特征，生成全局上下文向量 g_j 。由于 k^T_j 描述了输入的全局语义方面， g_j 也总结了输入的全局语义方面。然后，位置使用 q_i 作为 $g_0, g_1, ... ,g_{d_k-1}$ 。使用前面的示例，人的像素可能会对全局上下文向量施加很大的权重，以便人对其表示进行改进。一个物体边界上的像素可能在物体和背景的全局上下文向量上都有较大的权重来增强对比度。

3.5. Efficiency Advantage

This section analyzes the efficiency advantage of efficient attention over dot-product attention in memory and computation. The reason behind the efficiency advantage is that efficient attention does not compute a similarity between each pair of positions, which would occupy O(n^2) memory and require O(d_kn^2) computation to generate. Instead, it only generates d_k global context vectors in $\mathbb{R}^{d_v}$ . This change eliminates the O(n^2) terms from both the memory and computational complexities of the module. Consequently, efficient attention has O((d_k + d)n + d_kd) memory complexity and O(((d_kd + d^2)n) computational complexities, assuming the common setting of d_v = d . Table 1 shows complexity formulae of the efficient attention module and the non-local module (using dot-product attention) in detail. In computer vision, this complexity difference is very significant. Firstly, itself is quadratic in image side length and often very large in practice. Secondly, d_k is a parameter of the module, which the designer of a network can tune to meet different resource requirements. Section 4.2.3 shows that, within a reasonable range, this parameter has minimal impact on performance. This result means that an efficient attention module can typically have a small d_k , which further increases its efficiency advantage over dot-product attention. Table 2 compares the complexities of the efficient attention module with the ResBlock. The table shows that the resource demands of the efficient attention module are on par with (less than in most cases) the ResBlock, which gives an intuitive idea on the level of efficiency of he module.

efficient attention 不计算每对位置之间的相似性（占用 O(n^2) 内存，需要 O(d_kn^2) 计算才能生成）。相反，它只在 $\mathbb{R}^{d_v}$ 中生成 d_k 全局上下文向量。这种变化消除了模块的内存和计算复杂性中的 O(n^2) 项。因此，假设 d_v = d ，efficient attention 为 O((d_k + d)n + d_kd) 内存复杂度和 O(((d_kd + d^2)n) 计算复杂度。表 1 详细列出了 efficient attention 模块和 non-local 注意模块 (使用dot-product attention) 的复杂度公式。在计算机视觉中，这种复杂性差异是非常显著的。首先，本身是图像边长的二次方，在实际应用中往往很大。其次， d_k 是模块的一个参数，网络设计者可以根据不同的资源需求对其进行调整。第4.2.3节表明，在合理的范围内，该参数对性能的影响最小。这一结果意味着一个 efficient attention 模块通常可以有一个小的 d_k ，这进一步提高了它相对于 dot-product attention 的效率优势。表2比较了 efficient attention 模块和 ResBlock 的复杂性。从表中可以看出， efficient attention 模块的资源需求与 ResBlock 相当 (大部分情况下小于)，这对 efficient attention 模块的效率水平有一个直观的认识。

Table 1. Comparison of resource usage of the efficient attention and non-local modules. This table assumes that , which is the setting for all experiments in Section 4 and also a common setting in the literature for dot-product attention.

The rest of this section will give several concrete examples comparing the resource demands of the efficient attention and non-local modules. Figure 3 compares their resource consumption for image features with different sizes. Directly substituting the non-local module on the 64 × 64 feature map in SAGAN [29] yields a 17-time saving of memory and 32-time saving of computation. The gap widens rapidly with the increase of the input size. For a 256 × 256 feature map, a non-local module would require impractical amounts of memory (17.2 GB) and computation (412 GMACC). With the same input size, an efficient attention module uses 1/260 the amount of memory and 1/515 the amount of computation. The difference is more prominent for videos. Replacing the non-local module on the tiny 28 × 28 × 4 feature volume in res3 of the non-local I3DResNet-50 network [26] results in 2-time memory and computational saving. On a larger 64 × 64 × 32 feature volume, an efficient attention module requires 1/32 the amount of memory and 1/1025 the amount of computation.

本节的其余部分将给出几个具体的示例，比较 efficient attention 模块和 non-local 模块的资源需求。图 3 比较了不同尺寸图像特征的资源消耗。将 SAGAN 中的 64×64 feature map上的 non-local 模块直接替换，可以节省 17 倍的内存，节省 32 倍的计算时间。随着输入尺寸的增大，差距迅速增大。对于一个 256×256 的 map，一个 non-local 模块将需要大量的内存 (17.2 GB) 和计算 (412 GMACC)。在输入大小相同的情况下，一个有效的注意力模块需要占用 1/260 的内存和 1/515 的计算量。这种差异在视频中更为明显。在 non-local 的 I3DResNet-50 网络的 res3 中，替换非本地的 28×28×4 特征体上的 non-local 模块，可以节省 2 倍的内存和计算量。在一个较大的 64×64×32 特征卷上，一个高效的注意模块需要 1/32 的内存和 1/1025 的计算量。

Table 2. Comparison of resource usage of the efficient attention module and the ResBlock. Since the ResBlock does not have parameters dk, dv, this table sets , , the typical values for these parameters.

Figure 3. Resource requirements under different input sizes.
The blue and orange bars depict the resource requirements of the efficient attention and non-local modules, respectively. The calculation assumes . The figure is in log scale.