cv基础算法04-GoogleNet-v2

最新推荐文章于 2022-05-04 12:44:50 发布

东阳z

最新推荐文章于 2022-05-04 12:44:50 发布

阅读量226

点赞数

分类专栏：计算机视觉人工智能

本文链接：https://blog.csdn.net/qq_22473333/article/details/108112768

版权

人工智能同时被 2 个专栏收录

34 篇文章 2 订阅

订阅专栏

计算机视觉

9 篇文章 0 订阅

订阅专栏

东阳的学习记录，坚持就是胜利！

GoogleNet系列论文第二篇，这篇在v1的基础上，增加了BN（批标准化操作）（主要），但对v1的网络结构并没有提出大的改进。

研究意义

加快了深度学习的发展
开启神经网络设计新时代，标准化层已经成为深度神经网络标配（标准化层不是BN层）
在Batch Normalization基础上拓展出了一系例标准化网络层，如Layer Normalization（LN），Instance Normalization（IN），Group Normalization（GN）

研究背景

ICS(内部协变量偏移)

ICS：输入数据分布变化，导致的模型训练困难，对深度神经网络影响极大。（见下图）在后面内容会详细讲。(BN层就是通过缓解ICS现象提高训练速度，最后证明，对准确率也有提升)
在这里插入图片描述

白化(Whitening)

白化：去除输入数据的冗余信息，使得数据特征之间相关性较低，所有特征具有相同方差
文献[1]和文献[2]将数据变为0均值，1标准差的形式，实现白化
依概率论：N(x)=x −mean/std , 使X变为0均值，1标准差。mean-mean=0, std * 1/std = 1

[1] 1998-LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Net- works: Tricks of the trade. Springer, 1998b.
[2]2011-Wiesler, Simon and Ney, Hermann. A convergence anal- ysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Sys- tems 24, pp. 657–665, Granada, Spain, December 2011.

摘要

提出问题：数据分布变化导致训练困难（PS：权重变化–输出变化–输入变化–训练困难）
已有方法：通过小学习率和精心权重初始化，但是很训练缓慢，效果不佳
ICS：上述现象为ICS，本文提出标准化网络层来缓解它
BN：在模型中增加BN层，可在mini-batch中执行标准化，让模型可以用大学习率，不需要精心设计权值初始化，可以少用Dropout
本文成果：BN使模型训练加快14倍，并且可显著提高分类精度，在Imagenet分类任务中超越了人类的表现（是超过了人类经过两周的“训练”后的表现）

BN(Batch Normalization)

理解批的概念

批标准化里的批的意思是：
x[:, 0, :, :]，四个维度分别是(B, C, H, W)，即这里的批是特征维度。
在这里插入图片描述

标准化：使得上一步得出的特征图，变为0均值，1标准差。（记住这里的均值是指数滑动平均）。
经过标准化后，数据被拉到sigmoid激活函数的非饱和区，解决了梯度消失问题。
在这里插入图片描述

存在的问题

问题1

经过BN后（以sigmoid为例，见上图），数据集中在sigmoid的线性区，因此降低了模型的表达能力。

解决：
在这里插入图片描述
采用可学习参数beta和gama，增加线性变换，提升网络的表达能力，同时提供了恒等映射的可能。

问题2

mini-batch的统计信息充当总体是不准确的
解决：采用指数滑动平均(Exponential Moving Average)
在这里插入图片描述

使用BN层的注意事项

BN层前一层不需要加偏置（bias），该偏置可被BN层中的Shift（Beta）给抵消
卷积网络时，是针对特征图为单位进行BN层，即是2D的BN操作

网络结构

v2的网络结构和v1相比，除了增加BN层之后，没有本质区别，总的来说有以下几个区别：

激活函数前加入BN
55卷积替换为2个33卷积
第一个Inception模块增加一个Inception结构
增多“5*5”卷积核
尺寸变化采用stride=2的卷积（不再使用池化层，为什么不使用池化层）
增加9层（10-1层）到 31层
（10表示inception数量）

结果分析

MNIST实验

在这里插入图片描述

加入BN层之后，收敛速度更快
加入BN后，输出值更加稳定，缓解ICS现象

ILSVRC 分类实验一：速度对比

参数设置：初始学习率=0.0015
x5: 表示学习率 = 0.0015*5 = 0.0075

加BN更快：BN-Baseline比Inception快一倍
可用大学习率：BN-x5 比 Inception 快14倍
加BN精度更高：BN-x30比 x5 慢，但精度更高
Sigmoid时，加BN精度更高：BN-x5-Sigmoid虽精度最低，比Inception-Sigmoind高很多(换用ReLU并没有太大作用)

模型集成（多模型融合？）

超越人类！！！
最终版本为六个BN-x30集成，六个BN-x30不同之处：

增大权重初始化的值，即分布的标准差变大
dropout设为5%或10%，GoogLeNet-V1是40%（获得了较低的top5）

关键点&创新点

关键点&创新点
提出BN层，缓解ICS带来的训练困难，可实现

可以用更大学习率，加速模型收敛
可以不用精心设计权值初始化
可以不用dropout或较小的dropout
可以不用L2或者较小的weight decay
可以不用LRN(local response normalization)
借鉴VGG，全面将55卷积替换为两个33卷积堆叠

思考与展望

两个模型组合优点，放在一句话。加速14倍是BN-x5，获得显著提升的是BN-x30
we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. （1 Introduction p6）
0均值，1标准差的数据分布可加速网络训练
It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its in- puts are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. （2 towards Reducing Internal Covariate Shift p1）
即使不去相关，0均值，1方差的数据分布也可加快网络训练（去相关操作？？）
As shown in (LeCun et al., 1998b), such normalization speeds up convergence, even when the fea- tures are not decorrelated.(3 Normalization via Mini-Batch Statistics p1)
推理时，BN相当于线性变换，即缩放加平移，进一步的，可将BN层融合到卷积层中（怎么融合）
Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation. (3.1 p1)
bias作用被抵消，因此不需要bias，并且线性变换中的beta可充当bias（BN层的前一个网络层可以不用bais）
Note that, since we normalize W u+b, the bias b can be ignored since its effect will be canceled by the subsequent mean subtraction （3.2 p2）
卷积层的BN中，不仅考虑batch维度，还考虑空间维度，以feature map维度进行求取均值，方差
we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation.（3.2 p2）
一个样本的计算受到其它样本的约束（指数滑动平均），可认为是一种正则约束
a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example. （3.4 p1）
堆叠使用2个33卷积，全面替换55卷积，并且给予更多卷积核
The main difference to the net- work described in (Szegedy et al., 2014) is that the 5 × 5 convolutional layers are replaced by two consecutive lay- ers of 3 × 3 convolutions with up to 128 filters. （4.2 p1）

加速BN的7个操作

Increase learning rate： BN特性
Remove Dropout：BN可充当正则
Reduce the L2 weight regularization by a factor of 5.
因为BN允许权重大一些，所以对于权重大小的限制可以减轻一些
Accelerate the learning rate decay
Remove Local Response Normalization（避免了无意义的操作）
Shuffle training examples more throughly
Reduce the photometric distortions（降低了广度畸变）