【论文快读】Batch Normalization_batch normalization在前馈的时候-CSDN博客

本文链接：https://blog.csdn.net/tfcy694/article/details/80305241

本文介绍Batch Normalization(BN)技术如何通过减少内部协变量偏移来加速深层神经网络的训练过程。该方法通过对每批次输入数据进行归一化处理，允许使用更高的学习率，并减少了对Dropout的需求，降低了对参数初始化和激活函数非线性的敏感性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

标题：Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
链接：https://arxiv.org/abs/1502.03167
作者：Sergey Ioffe，Christian Szegedy
摘要：
abs
使得DNN训练起来很麻烦的原因是：当每一层的参数改变时，下一层的输入的分布也会随之改变（“internal covariate shift“）。这就对低学习率、参数的初始化和activetion的非线性有较高的要求。作者的方案是在每一个mini-batch中，对layer input进行nomalization修正，以便增加学习率、减少Dropout，减少梯度流对于初始化和模型参数的依赖，避免陷入saturation，加速训练，实现state-of-art的效果。

作者把Internal Covariate Shift定义为训练过程中网络activation的distribution的变化量相对于网络参数distribution的变化量。所以训练过程中ICS应当是逐渐降低的。另外LeCun已经证明过，白化的输入可以加速训练的收敛。
网络前馈过程中，当每一层的输出远离0时，有可能造成梯度消失的后果，所以需要将每一个layer做normalization，然而，这一调整也相当于直接干预了训练结果，所以依然需要训练参数进行可控恢复
BN的位置：
这里写图片描述
以下BN算法中，参数向量 $\gamma$ 、 $\beta$ 都和网络权重一样，需要在学习过程中进行update：
alg1
通过以上算法 $\gamma=\dfrac{1}{m}\sum\limits_{i=1}^mx_i$ 、 $\beta=\sqrt{\dfrac{1}{m}\sum\limits_{i=1}^m(x_i-\gamma)^2}$ 时，BN结果可以恢复上层的输出。

下图算法展示了一个BN网络的全流程：
这里写图片描述
实际上，通常BN不是在神经元的level，而是在feature map的level进行的。
由于我们可以证明BN的scale变换并不改变Jacobbian及梯度传播，

又由上文所述，BN是可恢复的，所以这一算法必然是稳定收敛的。

import torch
import torch.nn as nn

m = nn.BatchNorm2d(8,affine=True)
input = torch.randn(2,8,3,4)
output = m(input)

print("输入图片：")
print(input)
print("归一化权重：")
print(m.weight)		# (8,) the length is the same as the number of channels
print("归一化的偏重：")
print(m.bias)		# (8,) the length is the same as the number of channels
print("归一化的输出：")
print(output)		# (2,8,3,4), same shape as input

print("输入的第一个维度：")
print(input[0][0])
# compute on each channel
firstDimenMean = torch.Tensor.mean(input.permute(1,0,2,3)[0])
firstDimenVar= torch.Tensor.var(input.permute(1,0,2,3)[0],False) #Bessel's Correction贝塞尔校正不会被使用
print(m.eps)
print("输入的第一个维度平均值：")
print(firstDimenMean)
print("输入的第一个维度方差：")
print(firstDimenVar)

# should be the same as output[0][0][0][0]
bacthnormone = ((input[0][0][0][0] - firstDimenMean)/(torch.pow(firstDimenVar+m.eps,0.5) )) * m.weight[0] + m.bias[0]
print(bacthnormone)