在机器学习中,进行模型训练之前,需对数据做归一化处理,使其分布一致。在深度神经网络训练过程中,通常一次训练一个batch,而非全体数据。每个batch具有不同的分布,这样产生了internal covarivate shift问题——在训练过程中,数据分布会发生变化,对下一层网络的学习带来困难。Batch Normalization强行将数据拉回到均值为0,方差为1的正太分布上,一方面使得数据分布一致,另一方面避免梯度消失。可以加快网络训练的收敛速度。
参考:
“Batch normalization: Accelerating deep network training by reducing internal covariate shift.”
PyTorch踩坑指南(1)nn.BatchNorm2d()函数_白水煮蝎子的博客-CSDN博客
<1>torch.nn.BatchNorm1d
CLASS torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)
Parameters:
-
num_features (int) – number of features or channels C of the input.样本的特征维数,一般输入数据的形式是矩阵。
-
eps (float) – a value added to the denominator for numerical stability. Default: 1e-5.为避免归一化时分母为0。
-
momentum (float) – the value used for the running_mean and running_var computation. Can be set to
None
for cumulative moving average (i.e. simple average). Default: 0.1 用来计算running_mean和running_var的一个量。 -
affine (bool) – a boolean value that when set to
True
, this module has learnable affine parameters. Default:True.
用于设置是否具有可学习的仿射参数和,
的初始值为1,的初始值为0。
-
track_running_stats (bool) – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics, and initializes statistics buffersrunning_mean
andrunning_var
asNone
. When these buffers areNone
, this module always uses batch statistics. in both training and eval modes. Default:True.
用于设置是否更新统计值均值与方差,如果设置为True,则running_mean
的初始值为0,running_var
的初始值为1,如果设置为False,则初始化running_mean
和running_var
为None,如果为None则在训练和测试的过程中均使用测试数据batch 的统计值进行数据归一化。
Shape:
-
Input: (N, C), where N is the batch size, Cis the number of features or channels.输入数据的形式是矩阵。
-
Output: (N, C) (same shape as input).
需要注意的是:仿射参数和是在反向传播学习得到,
running_mean
和running_var
是在正向传播中统计得到。
参考:BatchNorm1d — PyTorch 2.0 documentation
nn.BatchNorm1d_harry_tea的博客-CSDN博客
1.模型均值和方差的更新机制、数据归一化机制
需要注意的是 track_running_stats 的设置只在创建BatchNorm时有效,不在创建BatchNorm时设置不起效果,如下面案例:
import torch
m = torch.nn.BatchNorm1d(8,momentum=0.1, affine=True, track_running_stats = True) #track_running_stats只在初始化时设置有效
m.track_running_stats = False
print('running_mean:',m.running_mean) #初始值
print('running_var:',m.running_var )
print('track_running_stats:',m.track_running_stats )
m2 = torch.nn.BatchNorm1d(8,momentum=0.1, affine=True, track_running_stats = False)
print('running_mean:',m2.running_mean) #初始值
print('running_var:',m2.running_var )
running_mean: tensor([0., 0., 0., 0., 0., 0., 0., 0.])
running_var: tensor([1., 1., 1., 1., 1., 1., 1., 1.])
track_running_stats: False
running_mean: None
running_var: None
running_mean 和running_var 模型参数是否需要更新,需要结合参数trainning和track_running_states来看,归一化的机制也因这两种参数的不同而不同。
(1)trainning = True,track_running_states = True:模型处于训练阶段,每作一次归一化,模型都需要更新参数running_mean和running_var,即跟踪每个batch数据的均值和方差。
参数更新方法:
其中,为模型的均值或方差,为当前观测数据batch的均值或方差,为更新后的均值或方差,为更新参数。
观测数据batch归一化方法:
其中,为观测数据batch的均值,为观测数据batch的方差,为归一化之后的某个通道的数据,即归一化是用当前batch的均值和方差做归一化。
注意方差的无偏估计:
方差的有偏估计:,
需要注意的是:running_mean和running_var更新用的无偏的方差,数据归一化用的是有偏的方差。当数值较大时,有偏估计和无偏估计基本一致。
import torch
m = torch.nn.BatchNorm1d(8,momentum=0.1, affine=True, track_running_stats = True) #track_running_stats只在初始化时设置有效
print('running_mean:',m.running_mean) #初始值
print('running_var:',m.running_var )
print('weight:',m.weight)
print('bias:',m.bias)
input = torch.randn(5, 8)
print('input:',input)
print('input[...,0]:',input[...,0])#第一列数据
obser_mean = torch.Tensor([input[...,i].mean() for i in range(8)])# 输入数据的均值
obser_var_unbiased = torch.Tensor([input[...,i].var() for i in range(8)])# 方差 无偏估计,相当于torch.var(input, dim=0, unbiased=True)
obser_var_biased = torch.Tensor([input[...,i].var(unbiased=False) for i in range(8)])# 方差 有偏估计,相当于torch.var(input, dim=0, unbiased=False)
print('obser_mean:',obser_mean)
print('obser_var_unbiased:',obser_var_unbiased)
print('obser_var_biased:',obser_var_biased)
obser_running_mean = (1-m.momentum)*m.running_mean + m.momentum*obser_mean
obser_running_var = (1-m.momentum)*m.running_var + m.momentum*obser_var_unbiased
output = m(input)
output_obser = (input[...,0] - obser_mean[0])/(pow(obser_var_biased[0] + m.eps,0.5))#归一化
print('obser_running_mean:',obser_running_mean)
print('obser_running_var:',obser_running_var)
print('running_mean:',m.running_mean)
print('running_var:',m.running_var )
print('output[...,0]:',output[...,0])#归一化数据
print('output_obser:',output_obser)
running_mean: tensor([0., 0., 0., 0., 0., 0., 0., 0.])
running_var: tensor([1., 1., 1., 1., 1., 1., 1., 1.])
weight: Parameter containing:
tensor([1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True)
bias: Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
input: tensor([[ 0.4901, -0.1794, 1.1301, -0.1901, -0.7794, -0.2863, -0.9673, 1.5712],
[ 0.7150, -0.6555, 0.1724, 1.8487, 0.3064, -0.0863, 1.3970, 0.3117],
[-1.4870, -1.0768, -0.8371, 1.7132, 0.9250, 0.6004, -0.2488, 0.8714],
[-0.7459, 0.3344, -1.1203, 1.7061, 0.6755, -0.2490, 1.4969, 0.6247],
[-0.2600, 2.0536, 0.5194, -1.4121, -1.3856, -0.2249, 0.0729, -0.4737]])
input[...,0]: tensor([ 0.4901, 0.7150, -1.4870, -0.7459, -0.2600])
obser_mean: tensor([-0.2576, 0.0953, -0.0271, 0.7332, -0.0516, -0.0492, 0.3501, 0.5810])
obser_var_unbiased: tensor([0.8138, 1.4763, 0.8822, 2.1516, 0.9799, 0.1376, 1.1456, 0.5629])
obser_var_biased: tensor([0.6510, 1.1810, 0.7058, 1.7213, 0.7839, 0.1101, 0.9165, 0.4503])
obser_running_mean: tensor([-0.0258, 0.0095, -0.0027, 0.0733, -0.0052, -0.0049, 0.0350, 0.0581])
obser_running_var: tensor([0.9814, 1.0476, 0.9882, 1.1152, 0.9980, 0.9138, 1.0146, 0.9563])
running_mean: tensor([-0.0258, 0.0095, -0.0027, 0.0733, -0.0052, -0.0049, 0.0350, 0.0581])
running_var: tensor([0.9814, 1.0476, 0.9882, 1.1152, 0.9980, 0.9138, 1.0146, 0.9563])
output[...,0]: tensor([ 0.9267, 1.2054, -1.5237, -0.6053, -0.0031],
grad_fn=<SelectBackward0>)
output_obser: tensor([ 0.9267, 1.2054, -1.5237, -0.6053, -0.0031])
(2)trainning = False,track_running_states = True:模型处于测试阶段,利用模型存储的均值和方差作归一化,但不更新模型的均值和方差。
# 测试阶段
m.eval()
print(m.training)
print(m.track_running_stats)
input = torch.randn(5, 8)
obser_mean = torch.mean(input, dim=0)# 输入数据的均值
obser_var_biased = torch.var(input, dim=0, unbiased=False) # 方差 有偏估计
print('obser_mean:',obser_mean)
print('obser_var_biased:',obser_var_biased)
print('running_mean:',m.running_mean)
print('running_var:',m.running_var )
output = m(input)
output_obser = (input[...,0] - obser_mean[0])/(pow(obser_var_biased[0] + m.eps,0.5))#归一化
output2_obser = (input[...,0] - m.running_mean[0])/(pow(m.running_var[0] + m.eps,0.5))
print('output[...,0]:',output[...,0])
print('output_obser:',output_obser)
print('output2_obser:',output2_obser)
False
True
obser_mean: tensor([-0.1944, -0.0978, -0.2910, 0.1413, -0.3192, 0.2840, 0.2374, 0.0797])
obser_var_biased: tensor([2.0099, 0.1910, 0.8365, 0.2942, 1.6348, 1.7859, 1.5220, 0.6940])
running_mean: tensor([ 0.0830, 0.0147, 0.0465, 0.0591, 0.0311, 0.0579, -0.0096, 0.0033])
running_var: tensor([1.0621, 0.9541, 0.9964, 1.1851, 0.9764, 1.0526, 1.0279, 0.9809])
output[...,0]: tensor([ 0.4747, -2.2079, -1.2251, 1.7868, -0.1744],
grad_fn=<SelectBackward0>)
output_obser: tensor([ 0.5407, -1.4093, -0.6949, 1.4946, 0.0689])
output2_obser: tensor([ 0.4747, -2.2079, -1.2251, 1.7868, -0.1744])
(3)trainning = True或False,track_running_states = False:模型无论处于训练或测试阶段,都是利用当前batch的均值和方差做归一化,且不更新模型的均值和方差。
import torch
m = torch.nn.BatchNorm1d(8,momentum=0.1, affine=True, track_running_stats = False) #track_running_stats只在初始化时设置有效
print('running_mean:',m.running_mean) #初始值
print('running_var:',m.running_var )
input = torch.randn(5, 8)
obser_mean = torch.Tensor([input[...,i].mean() for i in range(8)])# 输入数据的均值
obser_var_biased = torch.Tensor([input[...,i].var(unbiased=False) for i in range(8)])# 方差 有偏估计
print('obser_mean:',obser_mean)
print('obser_var_biased:',obser_var_biased)
output = m(input)
output_obser = (input[...,0] - obser_mean[0])/(pow(obser_var_biased[0] + m.eps,0.5))#归一化
print('running_mean:',m.running_mean)
print('running_var:',m.running_var )
print('output[...,0]:',output[...,0])#归一化数据
print('output_obser:',output_obser)
# 测试阶段
m.eval()
print(m.training)
print(m.track_running_stats)
input = torch.randn(5, 8)
obser_mean = torch.mean(input, dim=0)# 输入数据的均值
obser_var_biased = torch.var(input, dim=0, unbiased=False) # 方差 有偏估计
print('obser_mean:',obser_mean)
print('obser_var_biased:',obser_var_biased)
print('running_mean:',m.running_mean)
print('running_var:',m.running_var )
output = m(input)
output_obser = (input[...,0] - obser_mean[0])/(pow(obser_var_biased[0] + m.eps,0.5))#归一化
print('output[...,0]:',output[...,0])
print('output_obser:',output_obser)
running_mean: None
running_var: None
obser_mean: tensor([-0.4277, 0.2008, -0.3871, 0.4741, 0.5016, 0.6817, -0.2613, 0.0763])
obser_var_biased: tensor([0.3961, 0.7895, 1.1211, 0.2614, 0.2954, 0.4563, 1.8461, 0.7862])
running_mean: None
running_var: None
output[...,0]: tensor([ 0.6016, 1.0995, -1.0294, 0.6994, -1.3712],
grad_fn=<SelectBackward0>)
output_obser: tensor([ 0.6016, 1.0995, -1.0294, 0.6994, -1.3712])
False
False
obser_mean: tensor([-0.7911, -0.0979, 0.5710, -0.8198, 0.3552, -0.0772, 0.7881, 0.7573])
obser_var_biased: tensor([2.0702, 1.2274, 1.2483, 0.5527, 0.3471, 0.2689, 1.0752, 0.7770])
running_mean: None
running_var: None
output[...,0]: tensor([ 0.1504, 0.6443, -0.4143, 1.2792, -1.6596],
grad_fn=<SelectBackward0>)
output_obser: tensor([ 0.1504, 0.6443, -0.4143, 1.2792, -1.6596])
<2>torch.nn.BatchNorm2d
CLASS torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)
Parameters:
-
num_features (int) – Cfrom an expected input of size (N, C, H, W)
Shape:
-
Input: (N, C, H, W)
-
Output: (N, C, H, W) (same shape as input)
BatchNorm2d和BatchNorm1d不同的地方在于输入数据的不同。在计算观测数据batch的均值和方差的时候是对batch的同一通道上的所有元素进行统计计算,假设输入数据为,则有:
每个通道的均值:。
每个通道的方差:。
对通道的每个元素作归一化:。
BatchNorm2d的模型均值和方差的更新机制,以及数据归一化机制跟BatchNorm1d一样。
import torch
m = torch.nn.BatchNorm2d(3, eps=0, momentum=0.5, affine=True, track_running_stats=True)
print('running_mean:',m.running_mean) #初始值
print('running_var:',m.running_var )
print('weight:',m.weight)
print('bias:',m.bias)
input = torch.randn(1, 3, 5, 5)
print('input[0][0]:',input[0][0])#第一个batch,第一个通道
obser_mean = torch.Tensor([input[0][i].mean() for i in range(3)])# 输入数据的均值
obser_var_unbiased = torch.Tensor([input[0][i].var() for i in range(3)])# 方差 无偏估计,相当于torch.var(input[0][0], unbiased=True)
obser_var_biased = torch.Tensor([input[0][i].var(unbiased=False) for i in range(3)])# 方差 有偏估计,相当于torch.var(input[0][0], unbiased=False)
print('obser_mean:',obser_mean)
print('obser_var_unbiased:',obser_var_unbiased)
print('obser_var_biased:',obser_var_biased)
obser_running_mean = (1-m.momentum)*m.running_mean + m.momentum*obser_mean
obser_running_var = (1-m.momentum)*m.running_var + m.momentum*obser_var_unbiased
output = m(input)
print('obser_running_mean:',obser_running_mean)
print('obser_running_var:',obser_running_var)
print('running_mean:',m.running_mean)
print('running_var:',m.running_var )
output_obser = (input[0][0] - obser_mean[0])/(pow(obser_var_biased[0] + m.eps,0.5))#归一化
print('output[0][0]:',output[0][0])#归一化数据
print('output_obser:',output_obser)
running_mean: tensor([0., 0., 0.])
running_var: tensor([1., 1., 1.])
weight: Parameter containing:
tensor([1., 1., 1.], requires_grad=True)
bias: Parameter containing:
tensor([0., 0., 0.], requires_grad=True)
input[0][0]: tensor([[-0.3550, -1.1596, -0.4947, -0.8188, -0.1722],
[-1.3371, 0.9375, 0.5564, 2.3561, -0.5711],
[ 0.3932, 2.6657, 0.3440, -0.9300, 0.1791],
[-1.0307, 0.2115, 0.4953, 1.8088, 0.0496],
[-1.0584, 0.4566, -0.1415, 1.2106, 0.4498]])
obser_mean: tensor([ 0.1618, 0.2137, -0.0836])
obser_var_unbiased: tensor([1.1056, 1.1707, 0.9300])
obser_var_biased: tensor([1.0613, 1.1239, 0.8928])
obser_running_mean: tensor([ 0.0809, 0.1068, -0.0418])
obser_running_var: tensor([1.0528, 1.0853, 0.9650])
running_mean: tensor([ 0.0809, 0.1068, -0.0418])
running_var: tensor([1.0528, 1.0853, 0.9650])
output[0][0]: tensor([[-0.5016, -1.2827, -0.6372, -0.9518, -0.3242],
[-1.4549, 0.7529, 0.3830, 2.1300, -0.7114],
[ 0.2246, 2.4305, 0.1768, -1.0598, 0.0168],
[-1.1575, 0.0482, 0.3237, 1.5987, -0.1089],
[-1.1844, 0.2861, -0.2944, 1.0180, 0.2795]],
grad_fn=<SelectBackward0>)
output_obser: tensor([[-0.5016, -1.2827, -0.6372, -0.9518, -0.3242],
[-1.4549, 0.7529, 0.3830, 2.1300, -0.7114],
[ 0.2246, 2.4305, 0.1768, -1.0598, 0.0168],
[-1.1575, 0.0482, 0.3237, 1.5987, -0.1089],
[-1.1844, 0.2861, -0.2944, 1.0180, 0.2795]])