【CV-Paper 11】 SENet-2017

论文学习(Paper) 专栏收录该内容
63 篇文章 7 订阅


Squeeze-and-Excitation Networks


The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ∼25%. Models and code are available at https://github.com/hujie-frank/SENet.

卷积神经网络(CNN)的核心构建模块是卷积运算符,它使网络能够通过在每一层的局部感受野内融合空间和通道信息来构造信息特征。大量的先前研究已经研究了这种关系的空间成分,试图通过在整个特征层次上提高空间编码的质量来增强CNN的表示能力。在这项工作中,我们将重点放在通道关系上,并提出一个新的架构单元,我们称之为“Squeeze-and-Excitation”(SE)块,该模块通过显式建模渠道之间的相互依赖性来自适应地重新校准渠道方式的特征响应。我们证明了这些块可以堆叠在一起以形成SENet架构,这些架构可以非常有效地泛化不同数据集。我们进一步证明,SE块以少量额外的计算成本为现有的最新CNN带来了性能上的显着改善。Squeeze-and-Excitation网络构成了我们ILSVRC 2017类别提交的基础,该类别赢得了第一名,并将前5名的错误减少到2.251%,相对于2016年的入围作品,相对提升了约25%。有关模型和代码,请访问https://github.com/hujie-frank/SENet。

Index Terms—Squeeze-and-Excitation, Image representations, Attention, Convolutional Neural Networks.


CONVOLUTIONAL neural networks (CNNs) have proven to be useful models for tackling a wide range of visual tasks [1], [2], [3], [4]. At each convolutional layer in the network, a collection of filters expresses neighbourhood spatial connectivity patterns along input channels—fusing spatial and channel-wise information together within local receptive fields. By interleaving a series of convolutional layers with non-linear activation functions and downsampling operators, CNNs are able to produce image representations that capture hierarchical patterns and attain global theoretical receptive fields. A central theme of computer vision research is the search for more powerful representations that capture only those properties of an image that are most salient for a given task, enabling improved performance. As a widely-used family of models for vision tasks, the development of new neural network architecture designs now represents a key frontier in this search. Recent research has shown that the representations produced by CNNs can be strengthened by integrating learning mechanisms into the network that help capture spatial correlations between features. One such approach, popularised by the Inception family of architectures [5], [6], incorporates multi-scale processes into network modules to achieve improved performance. Further work has sought to better model spatial dependencies [7], [8] and incorporate spatial attention into the structure of the network [9].


In this paper, we investigate a different aspect of network design - the relationship between channels. We introduce a new architectural unit, which we term the Squeeze-and-Excitation (SE) block, with the goal of improving the quality of representations produced by a network by explicitly modelling the interdependencies between the channels of its convolutional features. To this end, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.


The structure of the SE building block is depicted in Fig. 1. For any given transformation F t r F_{tr} Ftr mapping the input X X X to the feature maps U U U where U ∈ R H × W × C U ∈ \R^{H×W×C} URH×W×C, e.g. a convolution, we can construct a corresponding SE block to perform feature recalibration. The features U U U are first passed through a squeeze operation, which produces a channel descriptor by aggregating feature maps across their spatial dimensions ( H × W ) (H×W) (H×W). The function of this descriptor is to produce an embedding of the global distribution of channel-wise feature responses, allowing information from the global receptive field of the network to be used by all its layers. The aggregation is followed by an excitation operation, which takes the form of a simple self-gating mechanism that takes the embedding as input and produces a collection of per-channel modulation weights. These weights are applied to the feature maps U U U to generate the output of the SE block which can be fed directly into subsequent layers of the network.

SE构件的结构如图1所示。对于任何给定的变换 F t r F_{tr} Ftr,将输入 X X X 映射到特征图 U U U,其中 U ∈ R H × W × C U∈\R^{H×W×C} URH×W×C,例如卷积,我们可以构造一个相应的SE块来执行特征重新校准。首先,将特征 U U U 传递给挤压操作,该操作通过聚合跨越其空间维度(H×W)的特征图来产生通道描述符。该描述符的功能是产生对信道方式特征响应的全局分布的嵌入,从而允许来自网络全局感受野的信息被其所有层使用。聚合之后是激励操作,该激励操作采用简单的自选通机制的形式,该机制将嵌入作为输入并生成每通道调制权重的集合。将这些权重应用于特征图U,以生成SE块的输出,可以将其直接馈送到网络的后续层中。


It is possible to construct an SE network (SENet) by simply stacking a collection of SE blocks. Moreover, these SE blocks can also be used as a drop-in replacement for the original block at a range of depths in the network architecture (Section 6.4). While the template for the building block is generic, the role it performs at different depths differs throughout the network. In earlier layers, it excites informative features in a class-agnostic manner, strengthening the shared low-level representations. In later layers, the SE blocks become increasingly specialised, and respond to different inputs in a highly class-specific manner (Section 7.2). As a consequence, the benefits of the feature recalibration performed by SE blocks can be accumulated through the network.


The design and development of new CNN architectures is a difficult engineering task, typically requiring the selection of many new hyperparameters and layer configurations. By contrast, the structure of the SE block is simple and can be used directly in existing state-of-the-art architectures by replacing components with their SE counterparts, where the performance can be effectively enhanced. SE blocks are also computationally lightweight and impose only a slight increase in model complexity and computational burden.

新CNN架构的设计和开发是一项艰巨的工程任务,通常需要选择许多新的超参数和层配置。相比之下,SE块的结构很简单,并且可以通过用SE对应物替换组件来直接在现有的最新结构中使用,从而可以有效地提高性能。 SE块在计算上也很轻巧,并且模型复杂度和计算负担仅增加了一点。

To provide evidence for these claims, we develop several SENets and conduct an extensive evaluation on the ImageNet dataset [10]. We also present results beyond ImageNet that indicate that the benefits of our approach are not restricted to a specific dataset or task. By making use of SENets, we ranked first in the ILSVRC 2017 classification competition. Our best model ensemble achieves a 2.251% top-5 error on the test set1. This represents roughly a 25% relative improvement when compared to the winner entry of the previous year (top-5 error of 2.991%).

为了提供这些主张的证据,我们开发了多个SENet,并对ImageNet数据集进行了广泛的评估[10]。我们还提供了ImageNet以外的结果,这些结果表明我们的方法的好处并不局限于特定的数据集或任务。通过使用SENets,我们在ILSVRC 2017分类竞赛中排名第一。我们最好的模型集合在测试set1上达到2.251%的top-5错误。与上一年的获奖者相比,这意味着大约25%的相对改善(前5名错误为2.991%)。


Deeper architectures. VGGNets [11] and Inception models [5] showed that increasing the depth of a network could significantly increase the quality of representations that it was capable of learning. By regulating the distribution of the inputs to each layer, Batch Normalization (BN) [6] added stability to the learning process in deep networks and produced smoother optimisation surfaces [12]. Building on these works, ResNets demonstrated that it was possible to learn considerably deeper and stronger networks through the use of identity-based skip connections [13], [14]. Highway networks [15] introduced a gating mechanism to regulate the flow of information along shortcut connections. Following these works, there have been further reformulations of the connections between network layers [16], [17],which show promising improvements to the learning and representational properties of deep networks.

更深的架构。 VGGNets[11]和Inception模型[5]表明,增加网络深度可以显着提高其能够学习的表示质量。通过调节输入到每一层的分布,批归一化(BN)[6]为深度网络中的学习过程增加了稳定性,并产生了更平滑的优化表面[12]。在这些工作的基础上,ResNets证明了通过使用基于恒等映射的跳过连接来学习更深入,更强大的网络是可能的[13],[14]。公路网[15]引入了一种门控机制来调节沿快捷连接的信息流。在完成这些工作之后,网络层之间的连接有了进一步的重新设计[16],[17],这表明对深层网络的学习和表示特性有希望的改进。

An alternative, but closely related line of research has focused on methods to improve the functional form of the computational elements contained within a network. Grouped convolutions have proven to be a popular approach for increasing the cardinality of learned transformations [18], [19]. More flexible compositions of operators can be achieved with multi-branch convolutions [5], [6], [20], [21], which can be viewed as a natural extension of the grouping operator. In prior work, cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure [22], [23] or jointly by using standard convolutional filters [24] with 1 × 1 convolutions. Much of this research has concentrated on the objective of reducing model and computational complexity , reflecting an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the unit with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.

替代性但紧密相关的研究领域集中在 改进网络中包含的计算元素的函数形式 的方法上。事实证明,分组卷积是增加学习转换的基数的一种流行方法[18],[19]。可以使用多分支卷积[5],[6],[20],[21]实现更灵活的运算符组合,这可以看作是分组运算的自然扩展。在现有技术中,跨通道相关性通常独立于空间结构[22],[23]或通过使用具有1×1卷积的标准卷积滤波器[24]共同映射为特征的新组合。这项研究的大部分集中在降低模型和计算复杂度的目标上,反映了一个假设,即可以将通道关系表示为具有局部接受野的实例不可知功能的组合。相反,我们声称为单元提供一种机制,可以使用全局信息对通道之间的动态,非线性依赖性进行显式建模,从而可以简化学习过程,并显着增强网络的表示能力。

Algorithmic Architecture Search. Alongside the works described above, there is also a rich history of research that aims to forgo manual architecture design and instead seeks to learn the structure of the network automatically . Much of the early work in this domain was conducted in the neuro-evolution community , which established methods for searching across network topologies with evolutionary methods [25], [26]. While often computationally demanding, evolutionary search has had notable successes which include finding good memory cells for sequence models [27], [28] and learning sophisticated architectures for largescale image classification [29], [30], [31]. With the goal of reducing the computational burden of these methods, efficient alternatives to this approach have been proposed based on Lamarckian inheritance [32] and differentiable architecture search [33].


By formulating architecture search as hyperparameter optimisation, random search [34] and other more sophisticated model-based optimisation techniques [35], [36] can also be used to tackle the problem. Topology selection as a path through a fabric of possible designs [37] and direct architecture prediction [38], [39] have been proposed as additional viable architecture search tools. Particularly strong results have been achieved with techniques from reinforcement learning [40], [41], [42], [43], [44]. SE blocks 3 can be used as atomic building blocks for these search algorithms, and were demonstrated to be highly effective in this capacity in concurrent work [45].

通过将架构搜索表述为超参数优化,随机搜索[34]和其他更复杂的基于模型的优化技术[35],[36]也可以用来解决该问题。拓扑选择作为通过可能的设计结构的路径[37]和直接架构预测[38] [39]已被建议为其他可行的架构搜索工具。通过强化学习[40],[41],[42],[43],[44]的技术已经获得了特别强劲的结果。 SE块可以用作这些搜索算法的原子构建块,并在并行工作中被证明在此功能上非常有效[45]。

Attention and gating mechanisms. Attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components of a signal [46], [47], [48], [49], [50], [51]. Attention mechanisms have demonstrated their utility across many tasks including sequence learning [52], [53], localisation and understanding in images [9], [54], image captioning [55], [56] and lip reading [57]. In these applications, it can be incorporated as an operator following one or more layers representing higher-level abstractions for adaptation between modalities. Some works provide interesting studies into the combined use of spatial and channel attention [58], [59]. Wang et al. [58] introduced a powerful trunk-and-mask attention mechanism based on hourglass modules [8] that is inserted between the intermediate stages of deep residual networks. By contrast, our proposed SE block comprises a lightweight gating mechanism which focuses on enhancing the representational power of the network by modelling channel-wise relationships in a computationally efficient manner.

注意力和门控机制。注意可以解释为一种将可用的计算资源的分配偏向信号中最有用的部分的手段[46],[47],[48],[49],[50],[51]。注意机制已经证明了它们在许多任务中的效用,包括序列学习[52],[53],图像中的定位和理解[9],[54],图像描述[55],[56]和唇语[57]。在这些应用中,它可以作为操作员并入一个或多个层,这些层代表用于模态之间适应的高层抽象。一些龚总对空间和通道注意力的组合使用提供了有趣的研究[58],[59]。 Wang等[58]引入了一个强大的基于沙漏模块[8]的躯干和面具注意机制,该机制插入深层残差网络的中间阶段之间。相比之下,我们提出的SE块包括一个轻量级的门控机制,该机制着重于通过以计算有效的方式对通道之间的关系建模来增强网络的表示能力。


A Squeeze-and-Excitation block is a computational unit which can be built upon a transformation F t r F_{tr} Ftr mapping an input X ∈ R H 0 × W 0 × C 0 X ∈ \R^{H_0×W_0×C_0} XRH0×W0×C0 to feature maps U ∈ R H × W × C U ∈ \R^{H×W×C} URH×W×C. In the notation that follows we take F t r F_{tr} Ftr to be a convolutional operator and use V = [ v 1 , v 2 , . . . , v C ] V = [v_1,v_2, . . . ,v_C] V=[v1,v2,...,vC] to denote the learned set of filter kernels, where v c v_c vc refers to the parameters of the c-th filter. We can then write the outputs as U = [ u 1 , u 2 , . . . , u C ] U = [u_1,u_2, . . . ,u_C] U=[u1,u2,...,uC], where

挤压和激励块是可以在将输入 X ∈ R H 0 × W 0 × C 0 X ∈ \R^{H_0×W_0×C_0} XRH0×W0×C0映射到特征图 U ∈ R H × W × C U ∈ \R^{H×W×C} URH×W×C的变换 F t r F_{tr} Ftr上构建的计算单元。在下面的符号中,我们将 F t r F_{tr} Ftr 用作卷积运算符,并使用 V = [ v 1 , v 2 , . . . , v C ] V = [v_1,v_2, . . . ,v_C] V=[v1,v2,...,vC]表示学习到的滤波器集合,其中 v c v_c vc 表示第 c c c 个过滤器的参数。然后,我们可以将输出写为 U = [ u 1 , u 2 , . . . , u C ] U = [u_1,u_2, . . . ,u_C] U=[u1,u2,...,uC],其中

Here ∗ ∗ denotes convolution, v c = [ v c 1 , v c 2 , . . . , v c C ′ ] , X = [ x 1 , x 2 , . . . , x C ′ ] v_c= [v^1_c,v^2_c, . . . ,v^{C'}_c], X = [x1,x2, . . . ,x^{C'}] vc=[vc1,vc2,...,vcC],X=[x1,x2,...,xC] and u c ∈ R H × W u_c∈ \R^{H×W} ucRH×W. v c s v^s_c vcs is a 2D spatial kernel representing a single channel of vc that acts on the corresponding channel of X X X. To simplify the notation, bias terms are omitted. Since the output is produced by a summation through all channels, channel dependencies are implicitly embedded in vc, but are entangled with the local spatial correlation captured by the filters. The channel relationships modelled by convolution are inherently implicit and local (except the ones at top-most layers). We expect the learning of convolutional features to be enhanced by explicitly modelling channel interdependencies, so that the network is able to increase its sensitivity to informative features which can be exploited by subsequent transformations. Consequently, we would like to provide it with access to global information and recalibrate filter responses in two steps, squeeze and excitation, before they are fed into the next transformation. A diagram illustrating the structure of an SE block is shown in Fig. 1.

此处,∗ 表示卷积, v c = [ v c 1 , v c 2 , . . . , v c C ′ ] , X = [ x 1 , x 2 , . . . , x C ′ ] v_c= [v^1_c,v^2_c, . . . ,v^{C'}_c], X = [x1,x2, . . . ,x^{C'}] vc=[vc1,vc2,...,vcC],X=[x1,x2,...,xC] u c ∈ R H × W u_c∈ \R^{H×W} ucRH×W v c s v^s_c vcs 是2D空间内核,代表 v c v_c vc 的单个通道,该通道作用于 X X X 的相应通道。为简化表示法,省略了偏项。由于输出是通过所有通道的求和产生的,因此通道相关性隐式嵌入到 v c v_c vc 中,但与滤波器捕获的局部空间相关性纠缠在一起。通过卷积建模的通道关系本质上是隐式和局部的(最顶层的通道除外)。我们希望通过显式地建模通道相互依赖性来增强卷积特征的学习,以便网络能够提高其对信息特征的敏感性,这些特征可以被后续的转换利用。因此,我们希望向其提供全局信息的访问权,并在压缩和激励分两步将其反馈到下一个转换之前,重新校准滤波器的响应。图1显示了SE块的结构图。

3.1 Squeeze: Global Information Embedding

In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operates with a local receptive field and consequently each unit of the transformation output U U U is unable to exploit contextual information outside of this region.

为了解决利用通道依赖性的问题,我们首先在输出特征中考虑到每个通道的信号。每个学习到的滤波器都使用本地接收场进行操作,因此,转换输出 U U U 的每个单元都无法利用该区域之外的上下文信息。

To mitigate this problem, we propose to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally , a statistic z ∈ R C z ∈ \R^C zRC is generated by shrinking U through its spatial dimensions H × W H ×W H×W, such that the c-th element of z z z is calculated by:

Discussion. The output of the transformation U can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in prior feature engineering work [60], [61], [62]. We opt for the simplest aggregation technique, global average pooling, noting that more sophisticated strategies could be employed here as well.

讨论。变换 U U U 的输出可以解释为局部描述符的集合,这些描述符的统计量表示整个图像。利用这些信息在先验特征工程工作中很普遍[60],[61],[62]。我们选择最简单的聚合技术,即全局平均池,并注意到这里也可以采用更复杂的策略。

3.2 Excitation: Adaptive Recalibration

To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship since we would like to ensure that multiple channels are allowed to be emphasised (rather than enforcing a one-hot activation). To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation:


where δ δ δ refers to the ReLU [63] function, W 1 ∈ R C r × C W_1∈R^{\frac{C}{r}×C} W1RrC×C and W 2 ∈ R C × C r W_2∈\R^{C×\frac{C}{r}} W2RC×rC. To limit model complexity and aid generalisation, we parameterise the gating mechanism by forming a bottleneck with two fully-connected (FC) layers around the non-linearity , i.e. a dimensionality-reduction layer with reduction ratio r (this parameter choice is discussed in Section 6.1), a ReLU and then a dimensionality-increasing layer returning to the channel dimension of the transformation output U. The final output of the block is obtained by rescaling U with the activations s:

其中 δ δ δ 表示ReLU [63]函数, W 1 ∈ R C r × C W_1∈R^{\frac{C}{r}×C} W1RrC×C W 2 ∈ R C × C r W_2∈\R^{C×\frac{C}{r}} W2RC×rC。为了限制模型的复杂性和辅助通用化,我们通过在非线性周围形成两个完全连接的(FC)层形成瓶颈来对门控机制进行参数化,即具有缩减比r的降维层(此参数选择将在第 6.1节中讨论。),然后是ReLU,然后是返回到转换输出 U U U 的通道维的维数增加层。通过激活s重新缩放 U U U,可以得到块的最终输出:

where X ~ = [ x ~ 1 , x ~ 2 , . . . , x ~ C ] \tilde{X} = [\tilde{x}_1, \tilde{x}_2, . . . , \tilde{x}_C] X~=[x~1,x~2,...,x~C] and F s c a l e ( u c , s c ) F_{scale}(u_c, s_c) Fscale(uc,sc) refers to channel-wise multiplication between the scalar scand the feature map u c ∈ R H × W u_c∈ \R^{H×W} ucRH×W.

其中 X ~ = [ x ~ 1 , x ~ 2 , . . . , x ~ C ] \tilde{X} = [\tilde{x}_1, \tilde{x}_2, . . . , \tilde{x}_C] X~=[x~1,x~2,...,x~C] F s c a l e ( u c , s c ) F_{scale}(u_c, s_c) Fscale(uc,sc)表示标量扫描的特征图uc∈RH×W之间的通道方向乘法。

Discussion. The excitation operator maps the inputspecific descriptor z to a set of channel weights. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, which can be regarded as a selfattention function on channels whose relationships are not confined to the local receptive field the convolutional filters are responsive to.

讨论。激励运算符将输入特定描述符 z z z 映射到一组通道权重。在这方面,SE块本质上引入了以输入为条件的动力学,可以将其视为通道上的自注意力函数,这些通道的关系不限于卷积滤波器响应的局部接收场。

Fig. 2. The schema of the original Inception module (left) and the SEInception module (right).

Fig. 3. The schema of the original Residual module (left) and the SEResNet module (right).


3.3 Instantiations

The SE block can be integrated into standard architectures such as VGGNet [11] by insertion after the non-linearity following each convolution. Moreover, the flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by incorporating SE blocks into several examples of more complex architectures, described next.

通过在每次卷积之后的非线性之后插入,可以将SE块集成到标准体系结构中,例如VGGNet [11]。此外,SE块的灵活性意味着它可以直接应用于标准卷积以外的转换。为了说明这一点,我们通过将SE块合并到一些更复杂的体系结构示例中来开发SENet,如下所述。

We first consider the construction of SE blocks for Inception networks [5]. Here, we simply take the transformation F t r F_{tr} Ftr to be an entire Inception module (see Fig. 2) and by making this change for each such module in the architecture, we obtain an SE-Inception network. SE blocks can also be used directly with residual networks (Fig. 3 depicts the schema of an SE-ResNet module). Here, the SE block transformation F t r F_{tr} Ftr is taken to be the non-identity branch of a residual module. Squeeze and Excitation both act before summation with the identity branch. Further variants that integrate SE blocks with ResNeXt [19], Inception-ResNet [21], MobileNet [64] and ShuffleNet [65] can be constructed by following similar schemes. For concrete examples of SENet architectures, a detailed description of SE-ResNet-50 and SE-ResNeXt-50 is given in Table 1.

我们首先考虑为Inception网络构建SE块[5]。在这里,我们仅将转换 F t r F_{tr} Ftr 当作一个完整的Inception模块(参见图2),并通过对体系结构中的每个此类模块进行此更改,就可以获得SE-Inception网络。 SE块也可以直接用于剩余网络(图3描绘了SE-ResNet模块的架构)。在此,SE块变换 F t r F_{tr} Ftr 是被认为是残差模块的非同一性分支。挤压和激励在与身份分支求和之前都起作用。可以通过类似的方案构建将SE块与ResNeXt [19],Inception-ResNet [21],MobileNet [64]和ShuffleNet [65]集成在一起的其他变体。对于SENet体系结构的具体示例,表1中给出了SE-ResNet-50和SE-ResNeXt-50的详细说明。

One consequence of the flexible nature of the SE block is that there are several viable ways in which it could be integrated into these architectures. Therefore, to assess sensitivity to the integration strategy used to incorporate SE blocks into a network architecture, we also provide ablation experiments exploring different designs for block inclusion in Section 6.5.



For the proposed SE block design to be of practical use, it must offer a good trade-off between improved performance and increased model complexity . To illustrate the computational burden associated with the module, we consider a comparison between ResNet-50 and SE-ResNet-50 as the squeeze phase and two small FC layers in the excitation phase, followed by an inexpensive channel-wise scaling operation. In the aggregate, when setting the reduction ratio r (introduced in Section 3.2) to 16, SE-ResNet-50 requires ∼3.87 GFLOPs, corresponding to a 0.26% relative increase over the original ResNet-50. In exchange for this slight additional computational burden, the accuracy of SE-ResNet-50 surpasses that of ResNet-50 and indeed, approaches that of a deeper ResNet-101 network requiring ∼7.58 GFLOPs (Table 2).

为了使SE块设计具有实际用途,必须在改进的性能和增加的模型复杂性之间做出良好的权衡。为了说明与模块相关的计算负担,我们考虑将ResNet-50和SE-ResNet-50作为压缩阶段和激励阶段中的两个小FC层之间的比较,然后进行廉价的按通道缩放操作。总体而言,将缩减比r(在第3.2节中介绍)设置为16时,SE-ResNet-50需要约3.87 GFLOP,相对于原始ResNet-50相对增加0.26%。为了换取这些额外的计算负担,SE-ResNet-50的精度超过了ResNet-50的精度,实际上接近了需要约7.58 GFLOP的更深的ResNet-101网络的精度(表2)。

In practical terms, a single pass forwards and backwards through ResNet-50 takes 190 ms, compared to 209 ms for SE-ResNet-50 with a training minibatch of 256 images (both timings are performed on a server with 8 NVIDIA Titan X GPUs). We suggest that this represents a reasonable runtime overhead, which may be further reduced as global pooling and small inner-product operations receive further optimisation in popular GPU libraries. Due to its importance for embedded device applications, we further benchmark CPU inference time for each model: for a 224 × 224 pixel input image, ResNet-50 takes 164 ms in comparison to 167 ms for SE-ResNet-50. We believe that the small additional computational cost incurred by the SE block is justified by its contribution to model performance.

实际上,通过ResNet-50进行一次单向前进和后退需要190毫秒,而SE-ResNet-50则需要209毫秒(具有256个图像的训练小批量)(两个时间都是在具有8个NVIDIA Titan X GPU的服务器上执行的) 。我们建议,这代表了合理的运行时开销,随着全局池和小的内部产品操作在流行的GPU库中得到进一步的优化,可以进一步降低此开销。由于它对于嵌入式设备应用程序的重要性,我们进一步对每种型号的CPU推理时间进行基准测试:对于224×224像素的输入图像,ResNet-50需要164毫秒,而SE-ResNet-50则需要167毫秒。我们认为,SE块对模型性能的贡献可以证明SE块产生的少量额外计算成本是合理的。

We next consider the additional parameters introduced by the proposed SE block. These additional parameters result solely from the two FC layers of the gating mechanism and therefore constitute a small fraction of the total network capacity. Concretely, the total number introduced by the weight parameters of these FC layers is given by:
where r denotes the reduction ratio, S refers to the number of stages (a stage refers to the collection of blocks operating on feature maps of a common spatial dimension), Cs denotes the dimension of the output channels and Nsdenotes the number of repeated blocks for stage s (when bias terms are used in FC layers, the introduced parameters and computational cost are typically negligible). SE-ResNet-50 introduces ∼2.5 million additional parameters beyond the∼25 million parameters required by ResNet-50, corresponding to a ∼10% increase. In practice, the majority of these parameters come from the final stage of the network, where the excitation operation is performed across the greatest number of channels. However, we found that this comparatively costly final stage of SE blocks could be removed at only a small cost in performance (<0.1% top-5 error on ImageNet) reducing the relative parameter increase to ∼4%, which may prove useful in cases where parameter usage is a key consideration (see Section 6.4 and 7.2 for further discussion).

其中r表示缩小率,S表示级数(一个级表示在公共空间尺寸的特征图上运行的块的集合),Cs表示输出通道的尺寸,Ns表示重复的块数阶段s(在FC层中使用偏差项时,引入的参数和计算成本通常可以忽略不计)。 SE-ResNet-50在ResNet-50要求的约2500万个参数之外引入了约250万个附加参数,相当于增加了约10%。实际上,这些参数中的大多数来自网络的最后阶段,在该阶段中,在最大数量的通道上执行激励操作。但是,我们发现可以以很小的性能代价(相对于ImageNet,top-5误差<0.1%)将SE块的这一相对昂贵的最后阶段删除即可,从而将相对参数提高到〜4%,这在某些情况下可能是有用的其中参数的使用是关键考虑因素(有关更多讨论,请参见第6.4和7.2节)。

(左)ResNet-50 [13]。 (中)SE-ResNet-50。 (右)带有32×4d模板的SE-ResNeXt-50。括号内列出了带有剩余构建块的特定参数设置的形状和操作,而外部展示了阶段中堆叠块的数量。 fc后面的内括号表示SE模块中两个完全连接的层的输出尺寸。

ImageNet验证集和复杂度比较上的单次裁剪错误率(%)。原始专栏是指原始论文中报告的结果(ResNets的结果可从以下网站获得:https://github.com/Kaiminghe/deep-residual-networks)。为了进行公平的比较,我们重新训练了基线模型,并在重新实施列中报告了得分。 SENet列指的是已添加SE块的相应体系结构。括号中的数字表示在重新实施的基准上的性能改进。 †表示该模型已经在验证集的非黑名单子集中进行了评估(在[21]中进行了更详细的讨论),这可能会稍微改善结果。 VGG-16和SE-VGG-16经过批量归一化训练。

Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. MobileNet refers to “1.0 MobileNet-224” in [64] and ShuffleNet refers to “ShuffleNet 1 × (g = 3)” in [65]. The numbers in brackets denote the performance improvement over the re-implementation.

ImageNet验证集和复杂度比较上的单次裁剪错误率(%)。 MobileNet在[64]中引用“ 1.0 MobileNet-224”,而ShuffleNet在[65]中引用“ ShuffleNet 1×(g = 3)”。括号中的数字表示重新实施后的性能改进。


In this section, we conduct experiments to investigate the effectiveness of SE blocks across a range of tasks, datasets and model architectures.


5.1 Image Classification

To evaluate the influence of SE blocks, we first perform experiments on the ImageNet 2012 dataset [10] which comprises 1.28 million training images and 50K validation images from 1000 different classes. We train networks on the training set and report the top-1 and top-5 error on the validation set.

为了评估SE块的影响,我们首先对ImageNet 2012数据集[10]进行了实验,该数据集包含128万个训练图像和来自1000个不同类别的50K验证图像。我们在训练集上训练网络,并报告验证集上的top-1和top-5错误。

Each baseline network architecture and its corresponding SE counterpart are trained with identical optimisation schemes. We follow standard practices and perform data augmentation with random cropping using scale and aspect ratio [5] to a size of 224 × 224 pixels (or 299 × 299 for Inception-ResNet-v2 [21] and SE-Inception-ResNet-v2) and perform random horizontal flipping. Each input image is normalised through mean RGB-channel subtraction. All models are trained on our distributed learning system ROCS which is designed to handle efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.9 and a minibatch size of 1024. The initial learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs. Models are trained for 100 epochs from scratch, using the weight initialisation strategy described in [66]. The reduction ratio r (in Section 3.2) is set to 16 by default (except where stated otherwise).

每个基准网络体系结构及其对应的SE对应对象都使用相同的优化方案进行训练。我们遵循标准做法,并使用比例和长宽比[5]随机裁剪到224×224像素(对于Inception-ResNet-v2 [21]和SE-Inception-ResNet-v2为299×299)进行数据增强并执行随机的水平翻转。每个输入图像通过平均RGB通道减法进行归一化。所有模型都在我们的分布式学习系统ROCS上进行训练,该系统旨在处理大型网络的有效并行训练。使用动量为0.9,最小批量为1024的同步SGD进行优化。初始学习速率设置为0.6,每30个周期减少10倍。使用[66]中所述的权重初始化策略,从零开始训练模型100个纪元。减速比r(在3.2节中)默认设置为16(除非另有说明)。

When evaluating the models we apply centre-cropping so that 224 × 224 pixels are cropped from each image, after its shorter edge is first resized to 256 (299 × 299 from each image whose shorter edge is first resized to 352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

在评估模型时,我们应用中心裁剪,以便在将短边的大小首先调整为256的情况下从每个图像裁剪224×224像素(对于Inception-ResNet- v2和SE-Inception-ResNet-v2)。

Network depth. We begin by comparing SE-ResNet against ResNet architectures with different depths and report the results in Table 2. We observe that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity . Remarkably , SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%, exceeding ResNet-50 (7.48%) by 0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52% top-5 error) with only half of the total computational burden (3.87 GFLOPs vs. 7.58 GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 (6.07% top-5 error) not only matches, but outperforms the deeper ResNet-152 network (6.34% top-5 error) by 0.27%. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the gains are consistent across a range of different network depths, suggesting that the improvements induced by SE blocks may be complementary to those obtained by simply increasing the depth of the base architecture.

网络深度。我们首先将SE-ResNet与具有不同深度的ResNet架构进行比较,并在表2中报告结果。我们观察到SE块可以在不同深度上持续提高性能,而计算复杂度却极小地提高。值得注意的是,SE-ResNet-50的单次裁剪的top-5验证错误为6.62%,比ResNet-50(7.48%)高0.86%,接近更深层次的ResNet-101网络(6.52%top- 5个错误),仅占总计算负担的一半(3.87 GFLOP与7.58 GFLOP)。这种模式会在更大的深度上重复出现,其中SE-ResNet-101(6.07%top-5错误)不仅匹配,而且比更深的ResNet-152网络(6.34%top-5错误)高0.27%。应该注意的是,SE块本身会增加深度,但它们以极高的计算效率来实现,即使在扩展基础架构深度实现递减收益的点上,也能产生良好的收益。此外,我们看到增益在一系列不同的网络深度范围内是一致的,这表明SE块引起的改进可能与通过简单增加基础架构深度获得的改进是互补的。

Integration with modern architectures. We next study the effect of integrating SE blocks with two further state-ofthe-art architectures, Inception-ResNet-v2 [21] and ResNeXt (using the setting of 32 × 4d) [19], both of which introduce additional computational building blocks into the base network. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 is given in Table 1) and report results in Table 2. As with the previous experiments, we observe significant performance improvements induced by the introduction of SE blocks into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of 5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57% top-5 error), a model which has almost twice the total number of parameters and computational overhead. We note a slight difference in performance between our re-implementation of Inception-ResNet-v2 and the result reported in [21]. However, we observe a similar trend with regard to the effect of SE blocks, finding that SE counterpart (4.79% top-5 error) outperforms our reimplemented Inception-ResNet-v2 baseline (5.21% top-5 error) by 0.42% as well as the reported result in [21].

与现代架构的集成。接下来,我们研究将SE块与另外两种最新的体系结构Inception-ResNet-v2 [21]和ResNeXt(使用32×4d的设置)[19]集成的效果,这两种方法都引入了额外的计算构件进入基础网络。我们构建了SE-Inception-ResNet-v2和SE-ResNeXt(表1中给出了SE-ResNeXt-50的配置)这些网络的SENet等效项,并在表2中报告了结果。与以前的实验一样,我们观察到将SE块引入这两种体系结构可提高性能。特别是SE-ResNeXt-50的top-5误差为5.49%,优于其直接竞争对手ResNeXt-50(5.90%的top-5误差)以及更深的ResNeXt-101(5.57%的top-5)错误),该模型的参数和计算开销几乎是总数的两倍。我们注意到在重新实现Inception-ResNet-v2与[21]中报告的结果之间,性能存在细微差异。但是,我们在SE块的效果方面观察到了类似的趋势,发现SE对应项(4.79%的top-5错误)也比我们重新实现的Inception-ResNet-v2基准(5.21%的top-5错误)还要高0.42%作为[21]中报告的结果。

We also assess the effect of SE blocks when operating on non-residual networks by conducting experiments with the VGG-16 [11] and BN-Inception architecture [6]. To facilitate the training of VGG-16 from scratch, we add Batch Normalization layers after each convolution. We use identical training schemes for both VGG-16 and SE-VGG-16. The results of the comparison are shown in Table 2. Similarly to the results reported for the residual baseline architectures, we observe that SE blocks bring improvements in performance on the non-residual settings.

通过使用VGG-16 [11]和BN-Inception体系结构[6]进行实验,我们还评估了SE块在非残留网络上运行时的效果。为了便于从头开始训练VGG-16,我们在每次卷积后添加了批处理规范化层。我们对VGG-16和SE-VGG-16使用相同的训练方案。比较的结果显示在表2中。与报告的残留基线体系结构的结果类似,我们观察到SE块在非残留设置上带来了性能上的提高。

To provide some insight into influence of SE blocks on the optimisation of these models, example training curves for runs of the baseline architectures and their respective SE counterparts are depicted in Fig. 4. We observe that SE blocks yield a steady improvement throughout the optimisation procedure. Moreover, this trend is fairly consistent across a range of network architectures considered as baselines.


Mobile setting. Finally , we consider two representative architectures from the class of mobile-optimised networks, MobileNet [64] and ShuffleNet [65]. For these experiments, we used a minibatch size of 256 and slightly less aggressive data augmentation and regularisation as in [65]. We trained the models across 8 GPUs using SGD with momentum (set to 0.9) and an initial learning rate of 0.1 which was reduced by a factor of 10 each time the validation loss plateaued. The total training process required ∼ 400 epochs (enabling us to reproduce the baseline performance of [65]). The results reported in Table 3 show that SE blocks consistently improve the accuracy by a large margin at a minimal increase in computational cost.

移动设置。最后,我们考虑来自移动优化网络类别的两种代表性架构,即MobileNet [64]和ShuffleNet [65]。对于这些实验,我们使用的最小批处理大小为256,而积极的数据扩充和正则化则稍差一些[65]。我们使用SGD在动量(设置为0.9)和初始学习率0.1的情况下在8个GPU上训练了模型,每次验证损失稳定下来时,初始学习率就降低了10倍。整个培训过程需要约400个纪元(使我们能够重现[65]的基准绩效)。表3中报告的结果表明,SE块以最小的计算成本来不断提高精度。

Additional datasets. We next investigate whether the benefits of SE blocks generalise to datasets beyond ImageNet. We perform experiments with several popular baseline architectures and techniques (ResNet-110 [14], ResNet-164 [14], WideResNet-16-8 [67], Shake-Shake [68] and Cutout [69]) on the CIFAR-10 and CIFAR-100 datasets [70]. These comprise a collection of 50k training and 10k test 32 × 32 pixel RGB images, labelled with 10 and 100 classes respectively . The integration of SE blocks into these networks follows the same approach that was described in Section 3.3. Each baseline and its SENet counterpart are trained with standard data augmentation strategies [24], [71]. During training, images are randomly horizontally flipped and zero-padded on each side with four pixels before taking a random 32 × 32 crop. Mean and standard deviation normalisation is also applied. The setting of the training hyperparameters (e.g. minibatch size, initial learning rate, weight decay) match those suggested by the original papers. We report the performance of each baseline and its SENet counterpart on CIFAR-10 in Table 4 and performance on CIFAR-100 in Table 5. We observe that in every comparison SENets outperform the baseline architectures, suggesting that the benefits of SE blocks are not confined to the ImageNet dataset.

其他数据集。接下来,我们研究SE块的好处是否可以推广到ImageNet以外的数据集。我们在CIFAR-上使用几种流行的基准架构和技术(ResNet-110 [14],ResNet-164 [14],WideResNet-16-8 [67],Shake-Shake [68]和Cutout [69])进行实验。 10和CIFAR-100数据集[70]。这些包括50k训练和10k测试的32×32像素RGB图像的集合,分别标记为10和100类。 SE块集成到这些网络中的方式与第3.3节中介绍的方法相同。每个基线及其对应的SENet都使用标准数据增强策略进行了培训[24],[71]。在训练期间,在随机进行32×32裁切之前,将图像随机水平翻转并在每一侧用四个像素进行零填充。还应用了均值和标准差归一化。训练超参数的设置(例如小批量大小,初始学习率,体重下降)与原始论文所建议的相匹配。我们在表4中报告了每个基线及其SENet对应物在CIFAR-10上的性能,并在表5中报告了其在CIFAR-100上的性能。我们观察到,在每个比较中,SENet都优于基线体系结构,这表明SE块的优势不仅仅局限于到ImageNet数据集。

Fig. 4. T raining baseline architectures and their SENet counterparts on ImageNet. SENets exhibit improved optimisation characteristics and produce consistent gains in performance which are sustained throughout the training process.

图4.在ImageNet上探索基线架构及其SENet对应物。 SENets表现出改进的优化特性,并在整个培训过程中保持稳定的绩效。


5.2 Scene Classification

We also conduct experiments on the Places365-Challenge dataset [73] for scene classification. This dataset comprises 8 million training images and 36,500 validation images across 365 categories. Relative to classification, the task of scene understanding offers an alternative assessment of a model’s ability to generalise well and handle abstraction. This is because it often requires the model to handle more complex data associations and to be robust to a greater level of appearance variation.


We opted to use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the training and evaluation protocols described in [72], [74]. In these experiments, models are trained from scratch. We report the results in Table 6, comparing also with prior work. We observe that SE-ResNet-152 (11.01% top-5 error) achieves a lower validation error than ResNet-152 (11.61% top-5 error),providing evidence that SE blocks can also yield improvements for scene classification. This SENet surpasses the previous state-of-the-art model Places-365-CNN [72] which has a top-5 error of 11.48% on this task.

我们选择使用ResNet-152作为评估SE区块有效性的可靠基准,并遵循[72],[74]中所述的训练和评估协议。在这些实验中,模型是从头开始训练的。我们将结果报告在表6中,并与先前的工作进行了比较。我们观察到SE-ResNet-152(11.01%top-5错误)的验证错误低于ResNet-152(11.61%top-5错误),提供证据表明SE块也可以改善场景分类。该SENet超越了以前的最新模型Places-365-CNN [72],该模型在此任务上的top-5错误为11.48%。

5.3 Object Detection on COCO

We further assess the generalisation of SE blocks on the task of object detection using the COCO dataset [75]. As in previous work [19], we use the minival protocol, i.e., training the models on the union of the 80k training set and a 35k val subset and evaluating on the remaining 5k val subset. Weights are initialised by the parameters of the model trained on the ImageNet dataset. We use the Faster R-CNN [4] detection framework as the basis for evaluating our models and follow the hyperparameter setting described in [76] (i.e., end-to-end training with the ’2x’ learning schedule). Our goal is to evaluate the effect of replacing the trunk architecture (ResNet) in the object detector with SE-ResNet, so that any changes in performance can be attributed to better representations. Table 7 reports the validation set performance of the object detector using ResNet-50, ResNet-101 and their SE counterparts as trunk architectures. SE-ResNet-50 outperforms ResNet-50 by 2.4% (a relative 6.3% improvement) on COCO’s standard AP metric and by 3.1% on AP@IoU=0.5. SE blocks also benefit the deeper ResNet-101 architecture achieving a 2.0% improvement (5.0% relative improvement) on the AP metric. In summary , this set of experiments demonstrate the generalisability of SE blocks. The induced improvements can be realised across a broad range of architectures, tasks and datasets.

我们使用COCO数据集进一步评估了SE块在目标检测任务上的推广性[75]。与以前的工作[19]一样,我们使用最小化协议,即在80k训练集和35k val子集的联合上训练模型,并对剩余的5k val子集进行评估。权重由ImageNet数据集上训练的模型的参数初始化。我们使用Faster R-CNN [4]检测框架作为评估模型的基础,并遵循[76]中所述的超参数设置(即,采用“ 2x”学习进度的端到端训练)。我们的目标是评估用SE-ResNet替换对象检测器中的中继体系结构(ResNet)的效果,以便性能的任何变化都可以归因于更好的表示形式。表7报告了使用ResNet-50,ResNet-101及其SE对应物作为中继线体系结构的对象检测器的验证集性能。 SE-ResNet-50在COCO的标准AP指标方面比ResNet-50优于2.4%(相对6.3%的提高),而在AP@IoU=0.5方面则优于3.1%。 SE块还受益于更深的ResNet-101体系结构,在AP指标上实现了2.0%的提高(相对改进5.0%)。总而言之,这组实验证明了SE块的通用性。可以在广泛的体系结构,任务和数据集上实现改进。

5.4 ILSVRC 2017 Classification Competition

SENets formed the foundation of our submission to the ILSVRC competition where we achieved first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a top-5 error of 2.251% on the test set. As part of this submission, we constructed an additional model, SENet-154, by integrating SE blocks with a modified ResNeXt [19] (the details of the architecture are provided in Appendix). We compare this model with prior work on the ImageNet validation set in Table 8 using standard crop sizes (224×224 and 320×320). We observe that SENet-154 achieves a top-1 error of 18.68% and a top-5 error of 4.47% using a 224 × 224 centre crop evaluation, which represents the strongest reported result.

SENets构成了我们向ILSVRC竞赛提交作品的基础,在那里我们获得了第一名。我们的获奖作品包括一个小型的SENet,它采用了标准的多尺度和多作物融合策略,在测试集上获得的top-5错误为2.251%。作为此提交的一部分,我们通过将SE块与修改后的ResNeXt [19]集成在一起,构造了一个附加模型SENet-154(该体系结构的详细信息在附录中提供)。我们将此模型与使用标准作物尺寸(224×224和320×320)在表8中对ImageNet验证集的先前工作进行了比较。我们观察到,使用224×224中心作物评估,SENet-154的top-1误差达到18.68%,top-5误差达到4.47%,这代表了最强的报告结果。

Following the challenge there has been a great deal of further progress on the ImageNet benchmark. For comparison, we include the strongest results that we are currently aware of in Table 9. The best performance using only ImageNet data was recently reported by [79]. This method uses reinforcement learning to develop new policies for data augmentation during training to improve the performance of the architecture searched by [31]. The best overall performance was reported by [80] using a ResNeXt-101 32×48d architecture. This was achieved by pretraining their model on approximately one billion weakly labelled images and finetuning on ImageNet. The improvements yielded by more sophisticated data augmentation [79] and extensive pretraining [80] may be complementary to our proposed changes to the network architecture.

迎接挑战之后,ImageNet基准有了很多进一步的进步。为了进行比较,我们在表9中包含了我们目前知道的最强结果。[79]最近报告了仅使用ImageNet数据获得的最佳性能。这种方法使用强化学习来为培训期间的数据扩充制定新的策略,以提高[31]搜索的架构的性能。 [80]使用ResNeXt-101 32×48d架构报告了最佳的整体性能。这是通过在大约十亿个弱标签图像上预训练他们的模型并在ImageNet上进行微调来实现的。通过更复杂的数据增强[79]和广泛的预训练[80]所产生的改进可能是对我们对网络体系结构提出的更改的补充。



In this section we conduct ablation experiments to gain a better understanding of the effect of using different configurations on components of the SE blocks. All ablation experiments are performed on the ImageNet dataset on a single machine (with 8 GPUs). ResNet-50 is used as the backbone architecture. We found empirically that on ResNet architectures, removing the biases of the FC layers in the excitation operation facilitates the modelling of channel dependencies, and use this configuration in the following experiments. The data augmentation strategy follows the approach described in Section 5.1. To allow us to study the upper limit of performance for each variant, the learning rate is initialised to 0.1 and training continues until the validation loss plateaus2(∼300 epochs in total). The learning rate is then reduced by a factor of 10 and then this process is repeated (three times in total). Label-smoothing regularisation [20] is used during training.

在本节中,我们进行消融实验,以更好地理解在SE块的组件上使用不同配置的效果。所有烧蚀实验均在一台机器(具有8个GPU)上在ImageNet数据集上执行。 ResNet-50被用作骨干架构。从经验上我们发现,在ResNet架构上,消除激励操作中FC层的偏差有助于对通道依赖性进行建模,并在以下实验中使用此配置。数据增强策略遵循5.1节中描述的方法。为了让我们研究每种变体的性能上限,将学习率初始化为0.1,并继续训练直到验证损失平稳2(总共约300个纪元)。然后将学习率降低10倍,然后重复此过程(总共3次)。训练期间使用标签平滑正则化[20]。

6.1 Reduction ratio

The reduction ratio r introduced in Eqn. 5 is a hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the network. To investigate the trade-off between performance and computational cost mediated by this hyperparameter, we conduct experiments with SE-ResNet-50 for a range of different r values. The comparison in Table 10 shows that performance is robust to a range of reduction ratios. Increased complexity does not improve performance monotonically while a smaller ratio dramatically increases the parameter size of the model. Setting r = 16 achieves a good balance between accuracy and complexity . In practice, using an identical ratio throughout a network may not be optimal (due to the distinct roles performed by different layers), so further improvements may be achievable by tuning the ratios to meet the needs of a given base architecture.

在等式5中引入的缩小比 r r r 是一个超参数,它允许我们改变网络中SE块的容量和计算成本。为了研究此超参数介导的性能和计算成本之间的平衡,我们使用SE-ResNet-50对不同r值范围进行了实验。表10中的比较表明,该性能在一定范围的减速比下具有鲁棒性。增加的复杂度不会单调地提高性能,而较小的比率会极大地增加模型的参数大小。设置r = 16可以在精度和复杂度之间取得良好的平衡。实际上,在整个网络中使用相同的比率可能不是最佳的(由于不同层执行不同的角色),因此可以通过调整比率以满足给定基础体系结构的需求来实现进一步的改进。

6.2 Squeeze Operator

We examine the significance of using global average pooling as opposed to global max pooling as our choice of squeeze operator (since this worked well, we did not consider more sophisticated alternatives). The results are reported in Table 11. While both max and average pooling are effective, average pooling achieves slightly better performance, justifying its selection as the basis of the squeeze operation. However, we note that the performance of SE blocks is fairly robust to the choice of specific aggregation operator.



6.3 Excitation Operator

We next assess the choice of non-linearity for the excitation mechanism. We consider two further options: ReLU and tanh, and experiment with replacing the sigmoid with these alternative non-linearities. The results are reported in Table 12. We see that exchanging the sigmoid for tanh slightly worsens performance, while using ReLU is dramatically worse and in fact causes the performance of SE-ResNet-50 to drop below that of the ResNet-50 baseline. This suggests that for the SE block to be effective, careful construction of the excitation operator is important.


6.4 Different stages

We explore the influence of SE blocks at different stages by integrating SE blocks into ResNet-50, one stage at a time. Specifically, we add SE blocks to the intermediate stages: stage 2, stage 3 and stage 4, and report the results in Table 13. We observe that SE blocks bring performance benefits when introduced at each of these stages of the architecture. Moreover, the gains induced by SE blocks at different stages are complementary, in the sense that they can be combined effectively to further bolster network performance.


6.5 Integration strategy

Finally , we perform an ablation study to assess the influence of the location of the SE block when integrating it into existing architectures. In addition to the proposed SE design, we consider three variants: (1) SE-PRE block, in which the SE block is moved before the residual unit; (2) SE-POST block, in which the SE unit is moved after the summation with the identity branch (after ReLU) and (3) SE-Identity block, in which the SE unit is placed on the identity connection in parallel to the residual unit. These variants are illustrated in Figure 5 and the performance of each variant is reported in Table 14. We observe that the SE-PRE, SE-Identity and proposed SE block each perform similarly well, while usage of the SE-POST block leads to a drop in performance. This experiment suggests that the performance improvements produced by SE units are fairly robust to their location, provided that they are applied prior to branch aggregation.

最后,我们进行了消融研究,以评估将SE块集成到现有架构中时其位置的影响。除了建议的SE设计之外,我们还考虑三种变体:(1)SE-PRE块,其中SE块移动到残差单元之前; (2)SE-POST块,其中SE单元在与标识分支求和之后(在ReLU之后)移动;以及(3)SE-Identity块,其中SE单元与标识分支并行放置在标识连接上剩余单位。这些变体在图5中进行了说明,每种变体的性能在表14中进行了报告。我们观察到SE-PRE,SE-Identity和建议的SE块的性能均相似,而使用SE-POST块则导致性能下降。该实验表明,如果在分支聚合之前应用SE单元,则SE单元所产生的性能改进将对其位置相当可靠。

In the experiments above, each SE block was placed outside the structure of a residual unit. We also construct a variant of the design which moves the SE block inside the residual unit, placing it directly after the 3 × 3 convolutional layer. Since the 3 × 3 convolutional layer possesses fewer channels, the number of parameters introduced by the corresponding SE block is also reduced. The comparison in Table 15 shows that the SE 3×3 variant achieves comparable classification accuracy with fewer parameters than the standard SE block. Although it is beyond the scope of this work, we anticipate that further efficiency gains will be achievable by tailoring SE block usage for specific architectures.

在上面的实验中,每个SE块都放置在残差单元的结构外部。我们还构造了一种设计变体,将SE块移动到残差单元内部,将其直接放置在3×3卷积层之后。由于3×3卷积层拥有较少的通道,因此相应SE块引入的参数数量也减少了。表15中的比较表明,与标准SE块相比,SE 3×3变体以较少的参数实现了可比的分类精度。尽管这超出了这项工作的范围,但我们预计通过针对特定架构量身定制SE块使用,可以进一步提高效率。


Although the proposed SE block has been shown to improve network performance on multiple visual tasks, we would also like to understand the relative importance of the squeeze operation and how the excitation mechanism operates in practice. A rigorous theoretical analysis of the representations learned by deep neural networks remains challenging, we therefore take an empirical approach to examining the role played by the SE block with the goal of attaining at least a primitive understanding of its practical function.


7.1 Effect of Squeeze

To assess whether the global embedding produced by the squeeze operation plays an important role in performance, we experiment with a variant of the SE block that adds an equal number of parameters, but does not perform global average pooling. Specifically, we remove the pooling operation and replace the two FC layers with corresponding 1 × 1 convolutions with identical channel dimensions in the excitation operator, namely NoSqueeze, where the excitation output maintains the spatial dimensions as input. In contrast to the SE block, these point-wise convolutions can only remap the channels as a function of the output of a local operator. While in practice, the later layers of a deep network will typically possess a (theoretical) global receptive field, global embeddings are no longer directly accessible throughout the network in the NoSqueeze variant. The accuracy and computational complexity of both models are compared to a standard ResNet-50 model in Table 16. We observe that the use of global information has a significant influence on the model performance, underlining the importance of the squeeze operation. Moreover, in comparison to the NoSqueeze design, the SE block allows this global information to be used in a computationally parsimonious manner.



7.2 Role of Excitation

To provide a clearer picture of the function of the excitation operator in SE blocks, in this section we study example activations from the SE-ResNet-50 model and examine their distribution with respect to different classes and different input images at various depths in the network. In particular, we would like to understand how excitations vary across images of different classes, and across images within a class.


We first consider the distribution of excitations for different classes. Specifically, we sample four classes from the ImageNet dataset that exhibit semantic and appearance diversity, namely goldfish, pug, plane and cliff (example images from these classes are shown in Appendix). We then draw fifty samples for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block of each stage (immediately prior to downsampling) and plot their distribution in Fig. 6. For reference, we also plot the distribution of the mean activations across all of the 1000 classes.


We make the following three observations about the role of the excitation operation. First, the distribution across different classes is very similar at the earlier layers of the network, e.g. SE 2 3. This suggests that the importance of feature channels is likely to be shared by different classes in the early stages. The second observation is that at greater depth, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features, e.g. SE 4 6 and SE 5 1. These observations are consistent with findings in previous work [81], [82], namely that earlier layer features are typically more general (e.g. class agnostic in the context of the classification task) while later layer features exhibit greater levels of specificity [83].

关于激励操作的作用,我们进行了以下三个观察。首先,不同类别的分布在网络的较早层(例如网络)非常相似。 SE 2 3.这表明功能通道的重要性很可能在早期由不同类别的人共享。第二个观察结果是,在更大的深度上,每个通道的值变得更加特定于类,因为不同的类对特征的判别值表现出不同的偏好,例如, SE 4 6和SE 5 1.这些观察结果与先前的工作[81],[82]中的发现一致,即,较早的图层特征通常更通用(例如,在分类任务中与类无关),而较晚的图层特征表现出更高的特异性[83]。

Next, we observe a somewhat different phenomena in the last stage of the network. SE 5 2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to one. At the point at which all activations take the value one, an SE block reduces to the identity operator. At the end of the network in the SE 5 3 (which is immediately followed by global pooling prior before classifiers), a similar pattern emerges over different classes, up to a modest change in scale (which could be tuned by the classifiers). This suggests that SE 5 2 and SE 5 3 are less important than previous blocks in providing recalibration to the network. This finding is consistent with the result of the empirical investigation in Section 4 which demonstrated that the additional parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance.

接下来,我们在网络的最后阶段观察到一些不同的现象。 SE 5 2表现出一种有趣的趋向饱和状态的趋势,在该状态中,大多数激活都接近一个。在所有激活取值为1的点上,SE块简化为身份运算符。在SE 5 3中的网络末端(紧随其后的是在分类器之前先进行全局池化),在不同的类上出现了类似的模式,规模的变化不大(可以由分类器进行调整)。这表明SE 5 2和SE 5 3在为网络提供重新校准方面没有以前的模块重要。该发现与第4节中的经验研究结果一致,该结果表明,通过删除最后阶段的SE块而仅会降低性能,可以显着减少额外的参数计数。

Finally, we show the mean and standard deviations of the activations for image instances within the same class for two sample classes (goldfish and plane) in Fig. 7. We observe a trend consistent with the inter-class visualisation, indicating that the dynamic behaviour of SE blocks varies over both classes and instances within a class. Particularly in the later layers of the network where there is considerable diversity of representation within a single class, the network learns to take advantage of feature recalibration to improve its discriminative performance [84]. In summary , SE blocks produce instance-specific responses which nevertheless function to support the increasingly class-specific needs of the model at different layers in the architecture.



In this paper we proposed the SE block, an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. A wide range of experiments show the effectiveness of SENets, which achieve state-of-the-art performance across multiple datasets and tasks. In addition, SE blocks shed some light on the inability of previous architectures to adequately model channel-wise feature dependencies. We hope this insight may prove useful for other tasks requiring strong discriminative features. Finally , the feature importance values produced by SE blocks may be of use for other tasks such as network pruning for model compression.


  • 0
  • 2
  • 0
  • 一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页
钱包余额 0