文章目录
引言
本节第一部分介绍深度学习中最重要的一个 Normalizatoin方法—Batch Normalization,并分析其计算方式,同时分析PyTorch中nn.BatchNorm1d、nn.BatchNorm2d、nn.BatchNorm3d三种BN的计算方式及原理。
本节第二部分介绍2015年之后出现的常见的Normalization方法—Layer Normalizatoin、Instance Normalizatoin和Group Normalizatoin,分析各Normalization的由来与应用场景,同时对比分析BN,LN,IN和GN之间的计算差异。
一、Batch Normalization概念
Batch Normalization即“批标准化”,批指的是一批数据,通常为mini-batch。标准化指的是mean=0,std=1。
Batch Normalization有以下优点:
- 可以用更大学习率,加速模型收敛
在未使用Batch Normalization时,如果学习率过大,很容易导致梯度激增,从而使得模型无法训练。 - 可以不用精心设计权值初始化
设计权值初始化,这是由于数据的尺度有可能逐渐变大或者变小,从而会导致梯度的激增或者消失,使得模型无法训练
具体可以参考:PyTorch学习—11.权值初始化 - 可以不用dropout或较小的dropout
论文中实验结果 - 可以不用L2或者较小的weight decay
论文中实验结果 - 可以不用LRN(local response normalization)
详细的可以了解《 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》
1. Batch Normalization的计算方式
神经网络训练过程的本质是学习数据分布,如果训练数据与测试数据的分布不同将大大降低网络的泛化能力,因此我们需要在训练开始前对所有输入数据进行归一化处理。然而随着网络训练的进行,每个隐层的参数变化使得后一层的输入发生变化,从而每一批训练数据的分布也随之改变,致使网络在每次迭代中都需要拟合不同的数据分布,增大训练的复杂度以及过拟合的风险。批量归一化可以看做是在每一层输入和上一层输出之间加入一个计算层,这个计算层的作用就是归一化处理,将所有批数据强制在统一的数据分布下,从而增强模型的泛化能力。
批量归一化,虽然增强了模型的泛化能力,但同时降低了模型的拟合能力。因此,在批量归一化的具体实现中引入了变量重构以及可学习参数 γ \gamma γ和 β \beta β, γ \gamma γ和 β \beta β变成了该层的学习参数,仅用两个参数就可以恢复最优的输入数据分布。
完整的批量归一化网络层前向传播公式:
在原始论文中Batch Normalization的提出是为了解决Internal Covariate Shift问题,即数据分布(尺度)的变化,导致训练困难。在学习权值初始化 时,我们分析过网络方差的变化,Batch Normalization就是为了解决这个问题。在解决了这个问题后带来了上述一系列的优点。下面,我们通过代码来观察这些优点。
在未权值初始化,未bn情况下,发现数据尺度发生了巨大变化。
import torch
import numpy as np
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, neural_num, layers=100):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
# 前向传播
for (i, linear), bn in zip(enumerate(self.linears), self.bns):
x = linear(x)
# x = bn(x)
x = torch.relu(x)
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
print("layers:{}, std:{}".format(i, x.std().item()))
return x
# 权值初始化
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
# method 1
# nn.init.normal_(m.weight.materials, std=1) # normal: mean=0, std=1
# method 2 kaiming
nn.init.kaiming_normal_(m.weight.data)
neural_nums = 256
layer_nums = 100
batch_size = 16
net = MLP(neural_nums, layer_nums)
# net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
layers:0, std:0.32943010330200195
layers:1, std:0.1318356990814209
layers:2, std:0.052585702389478683
layers:3, std:0.02193078212440014
layers:4, std:0.00893945898860693
layers:5, std:0.0036938649136573076
layers:6, std:0.0015579452738165855
layers:7, std:0.0006277725333347917
layers:8, std:0.0002646839711815119
...
layers:47, std:7.859776476425103e-20
layers:48, std:2.9688882202417704e-20
layers:49, std:1.1333666053890026e-20
layers:50, std:4.123510701654615e-21
layers:51, std:1.6957295453597266e-21
layers:52, std:6.230306804239323e-22
layers:53, std:2.4648755417600425e-22
...
layers:90, std:5.762514160668824e-37
layers:91, std:2.4922294974995734e-37
layers:92, std:9.623848803677693e-38
layers:93, std:4.248455843360902e-38
layers:94, std:1.7813144929173677e-38
layers:95, std:6.768123045051648e-39
layers:96, std:2.81580556797436e-39
layers:97, std:1.1762345166711439e-39
layers:98, std:4.812521354984649e-40
layers:99, std:1.925075804320147e-40
如果我们设计标准正态初始化,则数据变化仍然很大
import torch
import numpy as np
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, neural_num, layers=100):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
# 前向传播
for (i, linear), bn in zip(enumerate(self.linears), self.bns):
x = linear(x)
# x = bn(x)
x = torch.relu(x)
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
print("layers:{}, std:{}".format(i, x.std().item()))
return x
# 权值初始化
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
# method 1
nn.init.normal_(m.weight.data, std=1) # normal: mean=0, std=1
# method 2 kaiming
# nn.init.kaiming_normal_(m.weight.data)
neural_nums = 256
layer_nums = 100
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
layers:0, std:9.278536796569824
layers:1, std:110.19374084472656
layers:2, std:1147.586181640625
layers:3, std:12954.3427734375
layers:4, std:140433.40625
layers:5, std:1587572.125
layers:6, std:17144180.0
layers:7, std:193045600.0
...
layers:29, std:2.299191217901132e+31
layers:30, std:2.5098634669304556e+32
layers:31, std:2.8851340804932224e+33
layers:32, std:3.580790586058518e+34
layers:33, std:3.951448749544361e+35
layers:34, std:4.6563579532070865e+36
output is nan in 35 layers
如果我们使用kaiming初始化,则数据变化合理
import torch
import numpy as np
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, neural_num, layers=100):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
# 前向传播
for (i, linear), bn in zip(enumerate(self.linears), self.bns):
x = linear(x)
# x = bn(x)
x = torch.relu(x)
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
print("layers:{}, std:{}".format(i, x.std().item()))
return x
# 权值初始化
def initialize(self):
for m in self.modules()