Batch Normalization原文详细解读_batchnorm 原文-CSDN博客

本文链接：https://blog.csdn.net/appleyuchi/article/details/94442598

本文深入探讨BatchNormalization（BN）在深度学习中的作用，包括其在激活函数前后的位置争议、算法原理、反向传播处理及对训练速度、分类效果的显著提升。通过对比不同文献，分析BN在实际应用中的优势与挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这篇博客分为两部分，
一部分是[3]中对于BN(Batch Normalization的缩写)的大致描述
一部分是原文[1]中的完整描述

、####################先说下书籍[3]############################################
Batch Normalization首次在[1]中提出,原文中没有给出具体的图示。
我们先来归纳下[3]中提到的Batch Normalization的大意:
Algorithm1
上图出自[3].
[1][2]讨论了应该把Batch Normalization插入到激活函数的前面还是后面.
[2]中有一段话是用来吐槽[1]没有把BN的插入位置写清楚，摘录如下：

5.2.1 WHERE TO PUT BN – BEFORE OR AFTER NON-LINEARITY?
It is not clear from the paper Ioffe & Szegedy (2015) where to put the batch-normalization layer before input of each layer as stated in Section 3.1, or before non-linearity, as stated in section 3.2,so we have conducted an experiment with FitNet4 on CIFAR-10 to clarify this.Results are shown in Table 5.
Exact numbers vary from run to run, but in the most cases, batch normalization put after non-linearity performs better.
In the next experiment we compare BN-FitNet4, initialized with Xavier and LSUV-initialized FitNet4. Batch-normalization reduces training time in terms of needed number of iterations, but each iteration becomes slower because of extra computations. The accuracy versus wall-clock-time graphs are shown in Figure 3.

注意：
[3]的附带代码没有实现[1]中提到的对卷积层进行Batch Normalization转化

Batch Normalization算法之一（因为这个是[1]提出中的Algorithm 1）：

讲人话，什么意思呢？
上面图中的Batch Norm层，输入mini-batch中的m条数据，输出 $y_i$ ，
mini-batch和 $y_i$ 之间的映射关系通过上面的Algorithm 1来实现
注意，书籍[3]没有对反向传播的BN进行论文，代码也没有相关实现，对于卷积层的BN转化，该书籍也没有实现。
Batch Normalization完整的代码还是需要阅读tensorflow的源码

#######################再说下论文原文[1]结构#############################
论文结构：
$Batch\ Normalization=\left\{ \begin{aligned} Abstract \\ 1.Introduction \\ \text{2.Towards Reducint Internal Covariate Shift}\\ \text{3.Normalization via Mini-Batch Statistics}\\ \text{4.Experiments} \end{aligned} \right.$
∴论文的重点是2和3两个section

#########先看下Introduction都讲了些啥####################
Deep learning has dramatically advanced the state of the
art in vision, speech, and many other areas. Stochastic gradient descent (SGD) has proved to be an effec-
tive way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss
$\Theta=\arg \min _{\Theta} \frac{1}{N} \sum_{i=1}^{N} \ell\left(\mathrm{x}_{i}, \Theta\right)$
where $x_{1...N}$ is the training data set. With SGD, the training proceeds in steps, and at each step we consider a mini-
batch $x_{1...m}$ of size $m$ . The mini-batch is used to approximate the gradient of the loss function with respect to the
parameters, by computing
$\frac{1}{m} \frac{\partial \ell\left(\mathrm{x}_{i}, \Theta\right)}{\partial \Theta}$
（这里的 $\Theta$ 就是神经网络里面的权重。上面内容绝大多数都是套话）

Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient
of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch
size increases. Second, computation over a batch can be much more efﬁcient than m computations for individual
examples, due to the parallelism afforded by the modern computing platforms.(这段话的意思是更新权重的时候使用全部数据效果会比较好，但是用batch的速度会更快。这段主要就是回顾了一些基础知识)

While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters,speciﬁcally the learning rate used in optimization, as well as the initial values for the model parameters. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.
(最后一句是重点，浅层的小变化会引起深层的大变化）

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shif t (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift(这是这篇文章首次提到covariate shift) can be extended beyond the
learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing

$\ell=F_{2}\left(F_{1}\left(\mathrm{u}, \Theta_{1}\right), \Theta_{2}\right)$

here $F 1$ and $F 2$ are arbitrary transformations(这里指的是激活函数), and the parameters $Θ 1, Θ 2$ are to be learned so as to minimize the loss ℓ. Learning $Θ 2$ can be viewed as if the inputs $x = F 1 (u, Θ 1)$ are fed into the sub-network.
$\ell=F_{2}\left(\mathrm{x}, \Theta_{2}\right)$

For example, a gradient descent step
$\Theta_{2} \leftarrow \Theta_{2}-\frac{\alpha}{m} \sum_{i=1}^{m} \frac{\partial F_{2}\left(\mathrm{x}_{i}, \Theta_{2}\right)}{\partial \Theta_{2}}$

(for batch sizem and learning rateα) is exactly equivalent
to that for a stand-alone network $F 2$ with input $x$ . Therefore, the input distribution properties that make training
more efﬁcient – such as having the same distribution between the training and test data – apply to training the
sub-network as well. As such(严格来说，真正意义上) it is advantageous for the distribution ofx to remain ﬁxed over time. Then, $Θ 2$ does not have to readjust to compensate for the change in the distribution of x.
(以上一部分依然是在回顾基础知识：如何更新权重，其中的 $F 1$ 和 $F 2$ 都是指激活函数)

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the sub-network, as well. Consider a layer with a sigmoid activation function z = g(Wu +b) where u is the layer input,the weight matrix W and bias vector b are the layer parameters to be learned, and $g(x)=\frac{1}{1+exp(-x)}$ . As |x| increases,g′(x) tends to zero. This means that for all dimensions ofx = Wu+b except those with small absolute values, the gradient ﬂowing down tou will vanish and the model will train slowly. However, since x is affected by W,b and the parameters of all the layers below, changes to those parameters during training will likely move many dimensions of x into the saturated regime of the nonlinearity and slow down the convergence. This effect is ampliﬁed as the network depth increases. In practice, the saturation problem and the resulting vanishing gradients are usually addressed by using Rectiﬁed Linear Units (Nair & Hinton, 2010) ReLU(x) = max(x,0), careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
上面这段话的意思就是训练的时候别进激活函数的饱和区，进入饱和区的话要再调整就很难，从而会导致增加训练时间。
也就是说，权值稳定可能有两种情况，1.收敛 2.梯度消失
关于梯度爆炸和梯度消失的概念可以参考[7]讲得非常好。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. (这里首次提到interl covariate shift)Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that ﬁxes the means and variances of layer inputs. Batch Normalization also has a beneﬁcial effect on the gradient ﬂow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values.This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
注意，covariate shift和interl covariate shift的区别是前者是应用于整个系统，后者是应用于系统的各个部分(也就是各个层，但是每种层的处理方法都不一样)。
这段文字讲了一堆好处，但是其实这些好处都是作者意外发现的，并不是一开始就想到的。

In Sec. 4.2, we apply Batch Normalization to the bestperforming ImageNet classiﬁcation network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. Using an ensemble of such networks trained with Batch Normalization, we achieve the top-5 error rate that improves upon the best known results on
ImageNet classiﬁcation.
这段的重点是，采用BN措施以后，需要的训练steps数量仅仅是原来的7%

###################introduction 结束######################################

#################Section 2开始####################

这个部分讲了作者的探索，就是BN如果加在反向传播以后好不好，作者认为不好，因为会抵消偏置。具体分析如下：
For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data：
$\widehat{x}=x-E[x]$ where $\mathcal{X}=\left\{x_{1 \ldots N}\right\}$ is the set of values of $x$ over the training set, and

$\mathrm{E}[x]=\frac{1}{N} \sum_{i=1}^{N} x_{i}$ .
If a gradient descent step ignores the dependence of $E [x]$ on $b$ ,then it will update $\leftarrow b+\Delta b$ ,where $\Delta b \propto-\partial \ell / \partial \widehat{x}$ .Then $u+(b+\Delta b)-\mathrm{E}[u+(b+\Delta b)]=u+b-\mathrm{E}[u+b]$

Thus, the combination of the update to b and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss.

上面的话什么意思呢？如果说BN加在BP后面，那么BP后每个b显然都会增加一个 $\Delta b$ ,每个E[u+b]也会增加一个 $\Delta b$ ，一旦进行BN处理，那么这两个东西一减，就把 $\Delta b$ 抵消了，那BP的工作就白费了。

另外，稍微提下，这篇论文中：
inference step:前向传播的意思，因为推断的时候(infer)的时候需要从头到尾计算一遍才能计算输出值。
gradient step:反向传播阶段

这篇文章的毛病就是一些概念用词没有前后统一。

####################Section 2结束#################

###################section3################################
3.提出了上面的Algorithm 1以及BN在反向传播时如何处理
3.1.提出算法2，实质是对调用了algorithm1 以后的整个神经网络的工作流程的描述
3.2.BN在卷积层中的处理（一笔带过，纯文字描述）
3.3.和3.4说下BN的好处

3.1已经在下面提过了，重点来看下3中反向传播的处理

$\frac{\partial \ell}{\partial \widehat{x}_{i}}=\frac{\partial \ell}{\partial y_{i}} \cdot \gamma$

$\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot\left(x_{i}-\mu_{\mathcal{B}}\right) \cdot \frac{-1}{2}\left(\sigma_{\mathcal{B}}^{2}+\epsilon\right)^{-3 / 2}$

$\frac{\partial \ell}{\partial \mu_{\mathcal{B}}}=\left(\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\right)+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{\sum_{i=1}^{m}-2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}$

$\frac{\partial \ell}{\partial x_{i}}=\frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}+\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}$

$\frac{\partial \ell}{\partial \gamma}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}} \cdot \widehat{x}_{i}$

$\frac{\partial \ell}{\partial \beta}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}}$

虽然提到了训练误差 $l$ 关于 $\gamma$ , $\beta$ 的求导方式，但是并没有提到 $\gamma$ , $\beta$ 的更新方式，因为这个论文是谷歌的人写的，tensorflow是谷歌开发的，所以具体细节还是要去看tensorflow的代码。

上述求导的转化目标是有规律的，每一行都视上一行已经得到的偏导值为常数，并在这一行的求导过程中加以利用。

###################section3结束################################
再往后就是实验了（略）

############################################################

论文的思路是把[5][6]中提到的白化操作应用到了神经网络层的内部
论文提出了两个算法，Algorithm 1就是上面的那个,Algorithm 2是下面这个：
在这里插入图片描述

先解释下上面的 $\frac{m}{m-1}$ 咋回事。
使用的是无偏方差估计。
根据我们的概率论与数理统计知识可以得知：
$∵ E (S) = D (X)$
$∴E_B(\sigma_B^2)=D_B(B)=\frac{1}{m}\sum_{i=1}^m[x-\overline{x}]^2①$
这里的m是一个batch中的数据数量，也就是m条数据
显然我们需要的是这批batch的方差，而不是总体方差
∴ $\sigma_B^2=\frac{1}{m-1}\sum_{i=1}^m[x-\overline{x}]^2=\frac{m}{m-1}E_B(\sigma_B^2)$

注意:
这篇论文没有讲清楚如何更新算法2中的 $\gamma$ 和 $\beta$
Algorithm2调用了Algorithm1
Algorithm2的思路与上面的神经网络结构图对应

######################################################################
最后是应付面试用的，BN啥好处？摘自[8]:
①不仅仅极大提升了训练速度，收敛过程大大加快；
②还能增加分类效果，一种解释是这是类似于Dropout的一种防止过拟合的正则化表达方式，所以不用Dropout也能达到相当的效果；
③另外调参过程也简单多了，对于初始化要求没那么高，而且可以使用大的学习率等。
######################################################################

Reference:
[1]Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift-Sergey Ioffe and Christian Szegedy
[2]All you need is a good init-Dmytro Mishkin and Jiri Matas
[3]深度学习入门-基于Python的理论与实现
[4]Understanding the backward pass through Batch Normalization Layer
[5]Efficient bckprop-Yann A. LeCun，Leon Bottou，Genevieve B. Orr…
[6]A convergence analysis of log-linear training-Wiesler,Simon and Ney,Hermann
[7]深度学习中 Batch Normalization为什么效果好？
[8]【深度学习】深入理解Batch Normalization批标准化