论文阅读：Identity Mappings in Deep Residual Networks（ResNetV2）

最新推荐文章于 2022-06-07 22:39:25 发布

贾小树

最新推荐文章于 2022-06-07 22:39:25 发布

阅读量350

点赞数

分类专栏：论文阅读目标分类

本文链接：https://blog.csdn.net/j879159541/article/details/115143836

版权

论文阅读同时被 2 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标分类

8 篇文章 0 订阅

订阅专栏

文章目录

1、论文总述

本篇论文针对ResNet的中残差和恒等映射进行了进一步的分析，提出了一个改进版本ResNetV2，不过本人认为大多数情况下用原来的ResNet50或者ResNet101就已经够用，ResNetV2主要是针对CNN特别特别深时的改进，如大于100层，到1000层时，这时候再换ResNetV2即可。

在这里插入图片描述

本文的工作主要是有两部分：（1）针对残差模块中的恒等映射h进行研究，把它换成1*1卷积、门函数、常数尺度缩放等，发现还是直接啥也不做的效果更好，即h(x) = x；（2）针对h和F相加之后的激活函数 f 的位置进行研究，发现把 f 放在残差F中在深层时效果更好，从而提出了ResNetV2。

在这里插入图片描述

To understand the role of skip connections, we analyze and compare various
types of h(xl). We find that the identity mapping h(xl) = xl chosen in [1]
achieves the fastest error reduction and lowest training loss among all variants
we investigated, whereas skip connections of scaling, gating [5,6,7], and 1×1
convolutions all lead to higher training loss and error. These experiments suggest
that keeping a “clean” information path (indicated by the grey arrows in Fig. 1, 2,
and 4) is helpful for easing optimization.
To construct an identity mapping f(yl) = yl, we view the activation functions (ReLU and BN [8]) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation”. This point of view leads to a newresidual unit design, shown in (Fig. 1(b)). Based on this unit, we present competitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier
to train and generalizes better than the original ResNet in [1]. We further report
improved results on ImageNet using a 200-layer ResNet, for which the counterpart of [1] starts to overfit. These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep
learning.

2、f变为恒等映射后的变化

在这里插入图片描述

3、跳连 Identity 的重要性

在这里插入图片描述
上图即是对h进行各式改进，然后做实验，下表即是实验结果，可以发现还是原来的效果最好。

在这里插入图片描述

As indicated by the grey arrows in Fig. 2, the shortcut connections are the
most direct paths for the information to propagate. Multiplicative manipulations
(scaling, gating, 1×1 convolutions, and dropout) on the shortcuts can hamper
information propagation and lead to optimization problems.
It is noteworthy that the gating and 1×1 convolutional shortcuts introduce
more parameters, and should have stronger representational abilities than identity shortcuts. In fact, the shortcut-only gating and 1×1 convolution cover the
solution space of identity shortcuts (i.e., they could be optimized as identity
shortcuts). However, their training error is higher than that of identity shortcuts, indicating that the degradation of these models is caused by optimization
issues, instead of representational abilities.（不是表示能力不行，而是现有的优化方法无法更好的优化这样的）

4、激活函数不同位置的影响

在这里插入图片描述

（a）是原始ResNet采用的；（b）在相加之后改变了数据的分布，影响梯度传播好像，具体可以参考：ResNetV2：ResNet深度解析；（c）残差单元输出值都是正的；（d）和原始的性能差不多；（e）即本文提出的ResNetV2，在特别特别深层时相比原始的有提升。

注：（d）和（e）都属于预激活pre-activation

5、pre-activation的两点优势

We find the impact of pre-activation is twofold. First, the optimization is further
eased (comparing with the baseline ResNet) because f is an identity mapping.
Second, using BN as pre-activation improves regularization of the models.

更具体的解释可以参考论文11、 12页

6、训练尺度用法

Table 5 shows the results of ResNet-152 [1] and ResNet-200, all trained from
scratch. We notice that the original ResNet paper [1] trained the models using
scale jittering with shorter side s ∈ [256, 480], and so the test of a 224×224 crop
on s = 256 (as did in [1]) is negatively biased. Instead, we test a single 320×320
crop from s = 320, for all original and our ResNets. Even though the ResNets
are trained on smaller crops, they can be easily tested on larger crops because
the ResNets are fully convolutional by design. This size is also close to 299×299
used by Inception v3 [19], allowing a fairer comparison.