[水水博文-论文杂读]How Does Batch Normalization Help Optimization?-CSDN博客

本文链接：https://blog.csdn.net/PeaceInMind/article/details/80581074

[2018-NIPS-oral] How Does Batch Normalization Help Optimization[paper]

这篇文章主要设计了多组对比实验去分析了Batch normalization(BN)成功的原因并在最后给出了一些证明。本文主要的观点是BN并不是去解决原始作者理解的internal covariate shift(ICS) 而是由于 BN reparametrizes the underlying optimization problem to make its landscape significantly more smooth。

站在门外汉的角度来看，本文也只是观测到了加入BN后优化问题"smooth"了，gradient容易预测了，但是并不能完全证明"smooth"是引起性能好的原因，或许是性能好会造成优化问题"smooth"。有钱人大多会买奢侈品，但是买了奢侈品不一定是有钱人。

1 原始作者（BN的作者）对BN出发点的一个解释是, 由于底层网络的改变，某一层的输入随着训练的进行会不断改变

they describe ICS as the phenomenon wherein the distribution of inputs to a layer in the network changes due to an update of parameters of the previous layers

所以呢为了反驳这点，作者用VGG在一个小的数据库CIFAR-10上做了实验，如下图。下图左边首先证明了BN强大的能力，特别是对于大学习率的情况，如果没有BN的话很难收敛。但是这幅图的重点是右边，作者把一些层的输入画了出来，发现就算不加BN, 这些层的输入分布随着训练进行的变化和加了BN的变化其实是差不多的（这里说的是分布变化，不是说分布）。也就是说，就算不加BN，某一层输入分布也不会随着底层网络的改变发生较大变化。

2 作者进一步设计实验去分析BN的有效性是不是因为控制了ICS。

作者设计了一个比较有意思的实验，BN不是要把输入归一化吗，我在归一化后再加入一个随机的噪声，那么ICS就打破了。（个人有点奇怪的是为啥不放test accuracy,如果test accuracy不好，能说明什么呢？）

We train networks with random noise injected after BatchNorm layers. Specifically, we perturb each activation for each sample in the batch using i.i.d. noise sampled from a non-zero mean and non-unit variance distribution. We emphasize that this noise distribution changes at each time step

上图右边用了 noisy batchnorm后，layer 13的ICS好像明显加大了，但是训练精度上和普通的BN没多大区别。

3 作者再次去验证batchnorm是不是减小了ICS.

因为ICS一直没有一个精度的定义，所以作者自己定义了一个，大概意思就是，以某一层为例，底层（靠近输入）权值更新但顶层不更新，当前层的梯度变化的幅度，（个人对这个定义持保留态度，因为不太理解为啥不用方向的夹角，因为个人感觉梯度的方向比梯度的幅值更重要）

To quantify the extent to which the parameters in a layer would have to “adjust” in reaction to a parameter update in the previous layers, we measure the difference between the gradients of each layer before and after updates to all the previous layers

发现加了BN的有时候还要大些。

4 作者开始提出自己的论点

Indeed, we identify the key impact that BatchNorm has on the training process: it reparametrizes the underlying optimization problem to make its landscape significantly more smooth.

主要是分析了沿着当前梯度方向走一步，对损失或者梯度的影响，发现加了BN的影响更小，也就是沿着当前梯度方向可以继续往下走，方向改变比较小，所以步子可以迈的比较大，而不像不加BN的网络，梯度方向变化更大，步子走大了，就走错了。