文章目录
本课程来自深度之眼deepshare.net,部分截图来自课程视频。
简介
学习深度学习中常见的标准化方法
学习-Batch Normalization;Layer Normalizatoin、Instance Normalizatoin和Group Normalizatoin
Batch Normalization 概念
Batch Normalization:批标准化
批:一批数据,通常为mini-batch
标准化:0均值,1方差
文献:Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift
文中提到BN的优点,不过话说后来有证明说BN其实没啥子用
优点(1可用4不用):
1.可以用更大学习率,加速模型收敛
2.可以不用精心设计权值初始化
3.可以不用dropout或较小的dropout
4.可以不用L2或者较小的weight decay
5.可以不用LRN(local response normalization)
计算方式:
mini-batch中有
x
1
,
x
2
,
.
.
.
x
m
x_1,x_2,...x_m
x1,x2,...xm个数据,有两个待学习的参数
γ
,
β
\gamma,\beta
γ,β
然后通过这两个参数对x进行BN变化:
y
i
=
=
B
N
γ
,
β
(
x
i
)
y_i==BN_{\gamma,\beta}(x_i)
yi==BNγ,β(xi)
具体计算公式看下图:
在公式中求normalize这步中,为了防止分母为0的情况出现,加了一个修正系数
ϵ
\epsilon
ϵ,得到的
x
^
i
\widehat x_i
x
i是一个服从0均值,1标准差的分布。最后对
x
^
i
\widehat x_i
x
i进行scale和shift操作就是通过两个超参数进行的。
BN本来是要解决ICS:Internal Covariate Shift问题的,就是防止权重过大或者过小时造成深层神经网络的梯度爆炸或消失的问题。但是也顺带的带来了优点(1可用4不用)。
下面的是代码例子演示了,如果不加BN层,不构造权值的初始值,会导致梯度消失,如果用kaiming权值进行初始化,则还可以传递。如果加了BN层,构不构初始化造权值都一样的,所有的数据的STD保持得更加好。
# -*- coding: utf-8 -*-
"""
# @file name : bn_and_initialize.py
# @author : TingsongYu https://github.com/TingsongYu
# @date : 2019-11-01
# @brief : bn与权值初始化
"""
import torch
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers=100):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear), bn in zip(enumerate(self.linears), self.bns):
x = linear(x)
x = bn(x)#注意BN层的位置
x = torch.relu(x)
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
print("layers:{}, std:{}".format(i, x.std().item()))
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
# method 1
# nn.init.normal_(m.weight.data, std=1) # normal: mean=0, std=1
# method 2 kaiming
nn.init.kaiming_normal_(m.weight.data)
neural_nums = 256
layer_nums = 100
batch_size = 16
net = MLP(neural_nums, layer_nums)
# net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
然后对之前的人民币二分类中的LeNet进行同样的改造
如果不对权值进行特定初始化:
如果对权值进行特定初始化:
使用BN:
注意看权值变化范围是在0.5之内的。
PyTorch的Batch Nomalzaton 1d/2d/3d实现
_BatchNorm(基类)
·nn.BatchNorm1d
·nn.BatchNorm2d
·nn.BatchNorm3d
基类的参数:
·num_features:一个样本特征数量(最重要)
·eps:分母修正项,一般是1e-5
·momentum:指数加权平均估计当前mean/var
·affine:是否需要affine transform,默认是true
·track_running_stats:是训练状态,还是测试状态
·nn.BatchNorm1d
·nn.BatchNorm2d
·nn.BatchNorm3d
主要属性:
·running_mean:均值就是图片公式中的
μ
\mu
μ
·running_Var:方差图片公式中的
σ
\sigma
σ
·weight:affine transform中的
γ
\gamma
γ(可学习)
·bias:affine transform中的
β
\beta
β(可学习)
上面四个属性中,后面两个是可学习的,前面两个呢?
在训练阶段:均值和方差采用指数加权平均计算
running_mean=(1-momentum)* pre_running_mean +momentum * mean_t
running_var=(1-momentum)* pre_running_var+ momentum * var_t
在测试阶段:当前统计值(已经估计好的值)
下面分析Batch Nomalzaton 1d/2d/3d对数据的要求:
1D
·nn.BatchNorm1d input=B* 特征数* 1d特征
看下图可知,有3个batch,每个batch的特征数量是5个,特征的维度是1.所以大小是351,有时候1省略不写变成3 * 5.
那如何计算BatchNorm1d 的四个属性呢?
对每个特征横着看,对1.1.1求均值方差,学习
γ
\gamma
γ和
β
\beta
β可得到第一个特征的四个属性,
同样对2.2.2求均值方差,学习
γ
\gamma
γ和
β
\beta
β可得到第二个特征的四个属性
同样对3.3.3求均值方差,学习
γ
\gamma
γ和
β
\beta
β可得到第三个特征的四个属性
对于上面的图例,给出实例代码,计算均值和方差:
import torch
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
# ======================================== nn.BatchNorm1d
flag = 1
#flag = 0
if flag:
batch_size = 3
num_features = 5#特征数5个
momentum = 0.3
features_shape = (1)#1d,就是图中的每个特征维度为1
feature_map = torch.ones(features_shape) #得到一个为1的张量
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0)# 然后在特征数量方向进行扩展,就是图中的y轴
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) #在batch方向上进行扩展,就是图中的x轴
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))#打印出来应该是3*5*1
bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
#计算均值和方差
for i in range(2):
outputs = bn(feature_maps_bs)
print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))
print("iteration:{}, running var:{} ".format(i, bn.running_var))
mean_t, var_t = 2, 0
running_mean = (1 - momentum) * running_mean + momentum * mean_t
running_var = (1 - momentum) * running_var + momentum * var_t
print("iteration:{}, 第二个特征的running mean: {} ".format(i, running_mean))
print("iteration:{}, 第二个特征的running var:{}".format(i, running_var))
打印出来的结果:
上面是输入数据,和图片中的设置是一样的都是1.2.3.4.5,主要是下面,均值和方差的计算
可以看到,明显均值不是1,为什么?1.1.1的均值应该是1才对,是因为有 momentum = 0.3
根据公式:running_mean=(1-momentum)* pre_running_mean +momentum * mean_t
由于是第一次迭代,pre_running_mean (上一次的均值)没有,可以设定一个值,这里默认是0,
当前均值mean_t=(1+1+1)/3=1,因此
running_mean=(1-0.3)0+0.31=0.3;
同理,第二个特征(3个2)计算出来是0.6
同理,第三个特征(3个3)计算出来是0.9
下面看第二次迭代的时候:momentum = 0.3,mean_t=(1+1+1)/3=1(第一个特征的输入没有变化)
pre_running_mean=0.3
running_mean=(1-0.3)0.3+0.31 =0.51
同理,第二个特征(3个2)计算出来是(1-0.3)0.6+0.32=1.02
同理,第三个特征(3个3)计算出来是(1-0.3)0.9+0.33=1.53
2D
·nn.BatchNorm2d input=B* 特征数* 2d特征
上图的shape是3322,一个特征的维度是22,每个特征有3个特征数,总共有3个样本
flag = 1
# flag = 0
if flag:
batch_size = 3
num_features = 6
momentum = 0.3
features_shape = (2, 2)#2d的特征
feature_map = torch.ones(features_shape) # 2D
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0) # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))
bn = nn.BatchNorm2d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
for i in range(2):
outputs = bn(feature_maps_bs)
print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))
由于特征数num_features 是3,所以四个属性的shape也是3.
3D
·nn.BatchNorm3d input=B* 特征数* 3d特征
上图的shape是34223,一个特征的维度是223,每个特征有4个特征数,总共有3个样本
# flag = 1
flag = 0
if flag:
batch_size = 3
num_features = 4
momentum = 0.3
features_shape = (2, 2, 3)
feature = torch.ones(features_shape) # 3D
feature_map = torch.stack([feature * (i + 1) for i in range(num_features)], dim=0) # 4D
feature_maps = torch.stack([feature_map for i in range(batch_size)], dim=0) # 5D
print("input data:\n{} shape is {}".format(feature_maps, feature_maps.shape))
bn = nn.BatchNorm3d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
for i in range(2):
outputs = bn(feature_maps)
print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))
常见的Normalizaton——BN、LN、IN and GN
- Batch Normalization (BN)
- Layer Normalization (LN)
- Instance Normalization (IN)
- Group Normalization (GN)
4个Normalizaton的方法相同之处在于求取的公式都是一样,都是减除乘加,就是减去 μ \mu μ,除以 σ \sigma σ,乘上 γ \gamma γ,加上 β \beta β
x ^ i ← x i − μ β σ β 2 + ϵ \widehat x_i\leftarrow\frac{x_i-\mu_\beta}{\sqrt{\sigma^2_\beta+\epsilon}} x i←σβ2+ϵxi−μβ
y i ← γ x ^ i + β ≡ N γ , β ( x i ) y_i\leftarrow\gamma\widehat x_i+\beta\equiv N_{\gamma ,\beta}(x_i) yi←γx i+β≡Nγ,β(xi)
不同的地方在于:均值和方差求取方式。
1.Layer Normalization
参考文献:Layer Normalization
起因:BN不适用于变长的网络,如RNN
思路:逐层计算均值和方差,按下图三个圈圈来计算
注意事项:
1.不再有running_mean和running_var
2.gamma和beta为逐元素(逐特征的)的
nn.LayerNorm
主要参数:
·normalized_shape:该层特征形状
·eps:分母修正项(公式里面的
ϵ
\epsilon
ϵ)
·elementwise affine:是否需要affine transform
# -*- coding: utf-8 -*-
"""
# @file name : bn_and_initialize.py
# @author : TingsongYu https://github.com/TingsongYu
# @date : 2019-11-03
# @brief : pytorch中常见的 normalization layers
"""
import torch
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
# ======================================== nn.layer norm
# flag = 1
flag = 0
if flag:
batch_size = 8
num_features = 6
features_shape = (3, 4)
feature_map = torch.ones(features_shape) # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D
# feature_maps_bs shape is [8, 6, 3, 4], B * C * H * W
ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=True)#这里要告诉PyTorch,每一层,就是上图中画的圈圈的shape是什么样子
# ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=False)
# ln = nn.LayerNorm([6, 3, 4])这里是可以自己指定圈圈的shape,但是指定的时候要根据feature_maps_bs的shape[8, 6, 3, 4]从后往前数
#ln = nn.LayerNorm([6, 3]) 这句会报错,是因为设置圈圈的shape没有feature_maps_bs的shape从后往前数
output = ln(feature_maps_bs)
print("Layer Normalization")
print(ln.weight.shape)
print(feature_maps_bs[0, ...])
print(output[0, ...])
2.Instance Normalization
文献:
Instance Normalization:The Missing Ingredient for Fast Stylization
Image Style Transfer Using Convolutional Neural Networks
起因:BN在图像生成(lmage Generation)中不适用
但是在图像生成任务中(如下图),每个batch的风格是不一样的,把不同batch的特征来求均值明显是不好的。
思路:逐Instance(channel)计算均值和方差
nn.InstanceNorm主要参数(和bn一样,所以InstanceNorm也有1d,2d,3d,这里就不赘述了):
·num_features:一个样本特征数量(最重要)
·eps:分母修正项
·momentum:指数加权平均估计当前mean/var
·affine:是否需要affine transform
·track_running_stats:是训练状态,还是测试状态
# ======================================== nn.instance norm 2d
# flag = 1
flag = 0
if flag:
batch_size = 3
num_features = 3
momentum = 0.3
features_shape = (2, 2)#设置这个大小是和上面图片一样
feature_map = torch.ones(features_shape) # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4D
print("Instance Normalization")#打印输入数据
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))
instance_n = nn.InstanceNorm2d(num_features=num_features, momentum=momentum)
for i in range(1):
outputs = instance_n(feature_maps_bs)
print(outputs)
# print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
# print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
# print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
# print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))
这里得到均值都是0,原因是逐Instance(channel)计算均值,每个instance都是相同
求均值后再相减得0.
3.Group Normalization
参考文献:Group Normalization
起因:小batch样本中,BN估计的值不准,通常一个batchsize为64或128或者更大,如果cpu不行,一次吃不下这么多数据,只能吃2个大小的batchsize,如下图:
如果按BN的思想,应该是这样来计算均值和方差,但是数据太少,会不准确:
思路:数据不够,通道来凑,注意:下面的图中,不在x轴上做计算,只在y轴特征数上进行分组计算。
注意事项:
1.不再有running_mean和running_var
2.gamma和beta为逐通道(channel)的
应用场景:大模型(小batch size)任务
nn.GroupNorm
主要参数:
·num_groups:分组数,通常为2.4.8.16.32
·num_channels:通道数(特征数),如果通道数(特征数)为256,分组数为4,那么每组的通道数为:64
·eps:分母修正项
·affine:是否需要affine transform
flag = 1
# flag = 0
if flag:
batch_size = 2
num_features = 4
num_groups = 3 # 3 Expected number of channels in input to be divisible by num_groups
#这个分组数必须要被特征数整除,否则会报错,这里3是会报错的。
features_shape = (2, 2)
feature_map = torch.ones(features_shape) # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0) # 3D
feature_maps_bs = torch.stack([feature_maps * (i + 1) for i in range(batch_size)], dim=0) # 4D
gn = nn.GroupNorm(num_groups, num_features)
outputs = gn(feature_maps_bs)
print("Group Normalization")
print(gn.weight.shape)
print(outputs[0])
Normalization小结
小结:BN、LN、IN和GN都是为了克服Internal Covariate Shift(ICS)
作业:
1.Batch Normalization 的4个重要参数分别是什么?BN层中采用这个四个参数对X做了什么操作?
2.课程结尾的 “加减乘除”是什么意思?