正确的权值初始化可以加快模型的收敛,不恰当的初始化会引发梯度消失和爆炸
梯度消失与爆炸
不恰当初始化如何引起梯度消失和爆炸的:
下面是三层的全连接网络,我们来看第二个隐藏权值梯度是如何求取的
X
如果,则 ,从而导致了梯度消失;
如果,则 ,从而导致了梯度爆炸。
一旦引发梯度消失或梯度爆炸,就会导致模型无法训练,从公式可以看出,要防止出现梯度消失或爆炸,就要控制网络输出层的输出值的范围,也就是让网络输出值不能太大,也不能太小。
下面从代码来看多层全连接层的一个输出:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data) # normal: mean=0, std=1
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
可看到结果均为nan,也就是我们的数据可能非常大,也可能非常小,已经超出了当前精度可表示的范围。下面我们通过forward中观察什么时候数据开始出现nan,这里采用标准差来衡量数据的尺度范围。通过打印每一层的标准差,并用if进行判断,出现nan则停止传播:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data) # normal: mean=0, std=1
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
我们可以看到数据在31层时已经出现nan,再往后我们当前的精度就没法表达特别大或特别小的数据了。同时我们可以看到每一层我们的标准差都在逐渐变大,下面通过方差的推导来说明为什么网络的标准差会越来越大,最终超出我们可表示的范围。
第一个网络层的标准差为15.9599,每一层的神经元的个数为256,sqrt(256) = 16,所以第一个网络层的标准差为16,第二层的标准差就变为了256,也就是在上一层标准差的基础上乘以,这样每经过一层,其标准差都会乘以16,所以数据的尺度范围会不断的扩大,最终在31层的时候超出了精度可以表示的范围。
1、
2、
3、
由1、2、3可得:
若,则
下面来观测网络层神经元标准差是怎样的:
,则的方差为:均是均值为0,方差为1。
,其中n代表网络层数
因此,,
随着网络层数的增加,每增加一层,标准差为扩大,标准差最终会超出我们精度的范围,引出nan。
从上面可以看出,标准差由三个元素决定:网络层数、输入值的方差、网络层权值的方差
若要让网络层方差的尺度表示不变,则需让每层神经元方差为1,因此
下面在代码中设置标准差为:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num)) # normal: mean=0, std=1
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
我们可以看到方差均为1左右,这样就是一个很好输出的数据分布。我们通过恰当的权值初始化方法,可以实现多层的全连接网络的输出值的尺度维持在一定的范围之内,不会过大,也不会过小。这里我们并没有考虑具有激活函数的权值初始化:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.tanh(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num)) # normal: mean=0, std=1
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
从上可以看出,加了激活函数后,方差越来越小,从而会导致梯度的消失。
若对网络层增加了饱和激活函数,会发现随着网络层数的增加,神经元的方差越来越小,并导致梯度消失。
Xavier初始化
2010年,某大牛发布了一篇文章,详细说明了具有激活函数时应该如何进行权值初始化。在文章中结合方差一致性,也就是保证数据尺度维持在恰当范围内方差为1。其针对的是饱和的激活函数进行分析的,如Sigmoid、Tanh
参考文献:《Understanding the difficulty of training deep feedforward neural networks》
根据文章中的公式推导可得以下两等式:
其中为输入层的神经元个数,为权值方差,为输出层的神经元个数。这是全面考虑了前向传播和反向传播的数据尺度问题,从而
下面推导Xavier的上限、下限:
这里考虑的均匀分布,服从,则
,因此,属于
下面通过Xavier对权值进行初始化,然后观测网络层的输出:首先计算均匀分布的上限、下限,然后计算激活函数的增益,该增益是指数据输入到激活函数之后它的标准差的一个变化,接着采用均匀分布设置上下限来初始化我们的权值。
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.tanh(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
a = np.sqrt(6 / (self.neural_num + self.neural_num))
tanh_gain = nn.init.calculate_gain('tanh')
a *= tanh_gain
nn.init.uniform_(m.weight.data, -a, a)
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
从上面可以看出,目前每一层标准差都能维持在0.65左右,也就是每一层的数据分布都不会过大或过小。pytorch也采用了Xavier方法,我们可以通过该方法直接调用。
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.tanh(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
tanh_gain = nn.init.calculate_gain('tanh')
nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
可以看出采用Xavier方法前几层的计算结果与手动计算的结果相同。虽然Xavier针对sigmoid、tanh这一类的饱和的激活函数提出了有效的初始化办法,但是在2010年,非饱和函数relu不断使用,由于非饱和函数的性质,Xavier不再适用。下面将激活函数改为relu:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
tanh_gain = nn.init.calculate_gain('tanh')
nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
从上结果可以看出,采用relu激活函数后,网络层的方差越来越大。
若对网络层添加非饱和
激活函数,如relu,则会出现方差越来越大,从而导致梯度爆炸发生,针对该情况产生了Kaiming权重初始化方法。
Kaiming初始化
方差一致性:保持数据尺度维持在恰当范围,通常方差为1
激活函数:ReLU及其变种
权值方法为:
对于ReLU的变种:,其中a为负半轴的斜率
因此,
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
从上面可以看出,神经元的方差都在一个值左右波动。
同样pytorch也提供了该方法:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight.data)
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
从上结果看出,采用kaming方法与手动计算结果相同。
pytorch提供的十种初始化方法:
1、Xavier 均匀分布
2、Xavier正态分布
3、Kaiming均匀分布
4、Kaiming正态分布
5、均匀分布
6、正态分布
7、常数分布
8、正交矩阵初始化
9、单位矩阵初始化
10、稀疏矩阵初始化
https://pytorch.org/docs/stable/nn.init.html
下面学习一个特殊函数,其功能是计算激活函数的方差变化尺度,方差变化尺度是指输入数据的方差除以经过激活函数之后输出数据的方差,即为一个方差的比例。
nn.init.calculate_gain(nonlinearity,param=None)
主要参数:
- nonlinearity:激活函数名称
- param:激活函数的参数,如Learky ReLU的negative_slop
# ========================= calculate gain ===================
# flag = 0
flag = 1
if flag:
x = torch.randn(10000)
out = torch.tanh(x)
gain = x.std() / out.std()
print('gain:{}'.format(gain))
tanh_gain = nn.init.calculate_gain('tanh')
print('tanh_gain in PyTorch:', tanh_gain)