【深度学习基础】Batch Normalization

最新推荐文章于 2021-12-30 14:57:11 发布

two_star

最新推荐文章于 2021-12-30 14:57:11 发布

阅读量1k

点赞数

分类专栏：深度学习文章标签：深度学习

本文链接：https://blog.csdn.net/qq_25024883/article/details/84451069

版权

深度学习专栏收录该内容

12 篇文章 1 订阅

订阅专栏

Batch Normalization

Internal Covariate Shift
- 1. 概念
- 2. 白化（whitening）
Batch Normalization
超参调试
softmax分类

Internal Covariate Shift

1. 概念

The distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
每层输入的分布在训练过程中会发生变化，因为前面的层的参数会发生变化。通过要求较低的学习率和仔细的参数初始化减慢了训练，并且使具有饱和非线性的模型训练起来非常困难。我们将这种现象称为内部协变量转移。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift.
我们把训练过程中深度网络内部结点的分布变化称为内部协变量转移。

需要提醒的是，Internal Covariate Shift != Covariate Shift.
下图是原论文中的解释。
在这里插入图片描述

2. 白化（whitening）

如果对网络的输入进行白化（whitening），网络训练状态会收敛的更快——即输入线性变换为具有零均值和单位方差，并去相关。
当每一层观察下面的层产生的输入时，实现每一层输入进行相同的白化将是有利的。通过白化每一层的输入，采取措施实现输入的固定分布，可以消除内部协变量转移的不良影响。
白化的劣势在于：

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.
白化的计算量太大，并且白化不是处处可微的。所以在DL中，其实很少用到白化。

近似白化预处理公式：
$\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$
其中 $E(x^{(k)})$ 指的是每一批数据神经元 $x^{(k)}$ 的平均值，然后分母就是每一批数据神经元 $x^{(k)}$ 激活度的一个标准差。

Batch Normalization

1. 解释

如果仅仅使用白化的归一化公式，对网络某一层A的输出做归一化，然后输入到下一层网络B，会影响到网络A所学习到的特征。BN采用了两个步骤进行简化。
第一步是，变换重构，引入了可学习参数 $\gamma$ 和 $\beta$ 。
$y^{(k)} = \gamma^{(k)} * \hat x^{(k)} + \beta^{(k)}$
第二步是，在计算均值和方差的时候，使用每一个mini-batch的数据进行计算。
所以BN的前项传导算法就是：
在这里插入图片描述