Normalization原理
- BatchNorm:对mini-batch的每一层做归一化,算NHW的均值,对小batchsize效果不好;BN主要缺点是对batchsize的大小比较敏感,由于每次计算均值和方差是在一个batch上,所以如果batchsize太小,则计算的均值、方差不足以代表整个数据分布,
- LayerNorm:对mini-batch中的每一个sample做归一化,算CHW的均值,主要对RNN作用明显;
- InstanceNorm:一个mini-batch中每一个sample中的每一个channal做归一化,算HW的均值,用在风格化迁移;因为在图像风格化中,生成结果主要依赖于某个图像实例,所以对整个batch归一化不适合图像风格化中,因而对HW做归一化。可以加速模型收敛,并且保持每个图像实例之间的独立。
bn = torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
#input: (N, C, L) num_features=C or (N, L) num_features=L
#output:(N, C, L) or (N, L)
bn = torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
#input: (N, C, H,W) num_features=C
#output:(N, C, H, W)
bn = torch.nn.BatchNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
#input: (N, C, D, H, W)
#output:(N, C, D, H, W)
ln = torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)
#input: (N, *) , normalized_shape = features,size()[1: ]
#output:(N, *)
如下图所示,其实BN, LN, IN都是对输入数据做归一化,都是实现如下公式,重点在于这三组归一化中对那些维度求mean和std
y
=
x
−
m
e
a
n
s
t
d
∗
g
a
m
m
a
+
b
e
t
a
y = \frac{x-mean}{std}*gamma +beta
y=stdx−mean∗gamma+beta
- gamma和beta是缩放和平移参数,,想象一下,当gamma和beta分别为方差和均值是就回到了BN之前的状态,在训练时这两个参数是有梯度的,测试时不需要
- momentum用于对均值和方差进行动量更新,在torch中一般取(0.1 0.01 0.001),计算的结果在训练阶段用不到,是为测试时所保留的,具体计算方式如下:
r u n n i n g m e a n = ( 1 − m o m e n t u m ) ∗ r u n n i n g m e a n + m o m e n t u m ∗ r u n n i n g m e a n running mean = (1-momentum)*running mean + momentum*running mean runningmean=(1−momentum)∗runningmean+momentum∗runningmean
r u n n i n g s t d = ( 1 − m o m e n t u m ) ∗ r u n n i n g s t d + m o m e n t u m ∗ r u n n i n g s t d running std = (1-momentum)*running std + momentum*running std runningstd=(1−momentum)∗runningstd+momentum∗runningstd
python实现normalization
BatchNorm
torch实现
import torch
inp = torch.rand(2,3,4)
bn = torch.nn.BatchNorm1d(3,affine=False) #affine设置为false可以使gamma 和 beta为 None
bn_out = bn(inp)
>>inp
>>tensor([[[0.1319, 0.1956, 0.2887, 0.1659],
[0.2575, 0.5717, 0.8141, 0.3247],
[0.2323, 0.4682, 0.6605, 0.5281]],
[[0.6290, 0.2745, 0.4449, 0.2501],
[0.0568, 0.1737, 0.3457, 0.9583],
[0.8578, 0.0243, 0.5147, 0.2992]]])
>>bn_out
>>tensor([[[-1.0761, -0.6623, -0.0576, -0.8551],
[-0.6099, 0.4529, 1.2728, -0.3827],
[-0.8881, 0.0827, 0.8736, 0.3289]],
[[ 2.1523, -0.1501, 0.9568, -0.3080],
[-1.2888, -0.8934, -0.3115, 1.7605],
[ 1.6855, -1.7436, 0.2738, -0.6128]]])
python实现
inp1 = inp.permute(1,2,0).reshape(inp.size()[1],-1)
mean = inp1.mean(-1).reshape(1, inp.size()[1], 1)
std = inp1.std(-1, unbiased=False).reshape(reshape(1, inp.size()[1], 1))
#unbiased=True时为无偏估计,即[(x1-x)^2+(x2-x)^2...(xn-x)^2]/(n-1)
#unbiased=False时, 方差计算为[(x1-x)^2+(x2-x)^2...(xn-x)^2]/(n)
mybn_out = (inp-mean)/std
>>mybn_out
>>tensor([[[-1.0762, -0.6624, -0.0576, -0.8553],
[-0.6099, 0.4529, 1.2729, -0.3826],
[-0.8881, 0.0825, 0.8738, 0.3290]],
[[ 2.1528, -0.1499, 0.9570, -0.3084],
[-1.2888, -0.8934, -0.3116, 1.7606],
[ 1.6856, -1.7439, 0.2739, -0.6128]]])