引言
本节讲解权值初始化的必要性,首先分析神经网络中权值的方差过大导致梯度爆炸的原因,然后从方差一致性原则出发分析Xavier初始化方法与Kaiming初始化方法的由来,最后介绍pytorch提供的十种初始化方法。
一、梯度消失与爆炸
恰当的权值初始化可以加速收敛,不当的权值初始化会导致梯度爆炸或梯度消失,最终导致模型无法训练。下面我们了解不恰当的权值初始化是如何导致梯度消失与爆炸的?
我们可以知道要避免梯度消失与爆炸,要严格控制网络输出层的输出值的尺度范围,使得每一层的输出值不能太大也不能太小。
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
# x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data) # normal: mean=0, std=1
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
layer:0, std:16.27458381652832
layer:1, std:258.77801513671875
layer:2, std:4141.2744140625
layer:3, std:65769.9140625
layer:4, std:1037747.625
layer:5, std:16792116.0
layer:6, std:270513248.0
layer:7, std:4271523072.0
layer:8, std:69773983744.0
layer:9, std:1135388917760.0
layer:10, std:17972382924800.0
layer:11, std:290361566035968.0
layer:12, std:4646467956375552.0
layer:13, std:7.375517986167194e+16
layer:14, std:1.191924480578945e+18
layer:15, std:1.9444053896660517e+19
layer:16, std:3.0528445240940115e+20
layer:17, std:4.825792774212115e+21
layer:18, std:7.649714347139797e+22
layer:19, std:1.2594570721248546e+24
layer:20, std:2.009953775506879e+25
layer:21, std:3.221620181771679e+26
layer:22, std:5.130724860334453e+27
layer:23, std:8.456056429939108e+28
layer:24, std:1.337131496090396e+30
layer:25, std:2.1005011162403366e+31
layer:26, std:3.2537079800422226e+32
layer:27, std:5.275016630929896e+33
layer:28, std:8.364601398774255e+34
layer:29, std:1.3522442523642847e+36
layer:30, std:2.1151457613842935e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[ -inf, -inf, -inf, ..., -inf,
1.1685e+38, 1.4362e+38],
[-2.9867e+38, 2.6396e+35, 1.1949e+38, ..., 8.2088e+37,
1.9041e+38, -5.0897e+37],
[-1.4351e+38, 9.9710e+37, -1.4595e+37, ..., 4.2070e+36,
-inf, -1.2583e+38],
...,
[ inf, inf, inf, ..., 1.6265e+38,
1.3557e+38, -inf],
[-1.2538e+38, -1.9771e+38, -inf, ..., -1.8160e+38,
1.0576e+38, inf],
[ inf, 5.5094e+37, -4.4087e+36, ..., -inf,
-2.4495e+38, 7.9425e+37]], grad_fn=<MmBackward>)
下面我们通过方差公式推导来观察网络层输出的标准差为什么会越来越大?最终会超出我们要表示的范围。
E ( 𝑿 ∗ 𝒀 ) = 𝑬 ( 𝑿 ) ∗ 𝑬 ( 𝒀 ) D ( 𝑿 ) = 𝑬 ( X 𝟐 ) − [ 𝑬 ( 𝑿 ) ] 𝟐 D ( 𝑿 + 𝒀 ) = 𝑫 ( 𝑿 ) + 𝑫 ( 𝒀 ) D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) + D ( X ) ∗ [ 𝑬 ( 𝒀 ) ] 𝟐 + D ( Y ) ∗ [ 𝑬 ( 𝑿 ) ] 𝟐 E(𝑿 ∗ 𝒀) = 𝑬(𝑿) ∗ 𝑬(𝒀)\\D(𝑿) = 𝑬(X^𝟐) − [𝑬(𝑿)]^𝟐\\D(𝑿 + 𝒀) = 𝑫(𝑿) + 𝑫(𝒀)\\D(X*Y) = D(X)*D(Y) + D(X)* [𝑬(𝒀)]^𝟐 + D(Y)* [𝑬(𝑿)]^𝟐 E(X∗Y)=E(X)∗E(Y)D(X)=E(X2)−[E(X)]2D(X+Y)=D(X)+D(Y)D(X∗Y)=D(X)∗D(Y)+D(X)∗[E(Y)]2+D(Y)∗[E(X)]2
若 E ( X ) = 0 , E ( Y ) = 0 E(X)=0, E(Y)=0 E(X)=0,E(Y)=0,
D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X*Y) = D(X)*D(Y) D(X∗Y)=D(X)∗D(Y)
H 𝟏 𝟏 = ∑ 𝒊 = 𝟎 𝒏 𝑿 𝒊 ∗ 𝑾 1 i 𝐃 ( H 𝟏 𝟏 ) = ∑ 𝒊 = 𝟎 𝒏 𝑫 ( 𝑿 𝒊 ) ∗ 𝑫 ( 𝑾 𝟏 𝒊 ) = n ∗ ( 1 ∗ 1 ) = n H_{𝟏𝟏} = \sum_{𝒊=𝟎}^𝒏 𝑿_𝒊 ∗ 𝑾_{1i}\\𝐃(H_{𝟏𝟏}) = \sum_{𝒊=𝟎}^𝒏𝑫(𝑿_𝒊) ∗ 𝑫(𝑾_{𝟏𝒊})\\= n * (1 * 1)= n H11