PyTorch学习—11.权值初始化

引言

  本节讲解权值初始化的必要性,首先分析神经网络中权值的方差过大导致梯度爆炸的原因,然后从方差一致性原则出发分析Xavier初始化方法与Kaiming初始化方法的由来,最后介绍pytorch提供的十种初始化方法。

一、梯度消失与爆炸

  恰当的权值初始化可以加速收敛,不当的权值初始化会导致梯度爆炸或梯度消失,最终导致模型无法训练。下面我们了解不恰当的权值初始化是如何导致梯度消失与爆炸的?
在这里插入图片描述
在这里插入图片描述
我们可以知道要避免梯度消失与爆炸,要严格控制网络输出层的输出值的尺度范围,使得每一层的输出值不能太大也不能太小。

class MLP(nn.Module):
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            # x = torch.relu(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break

        return x

    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)    # normal: mean=0, std=1

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)
layer:0, std:16.27458381652832
layer:1, std:258.77801513671875
layer:2, std:4141.2744140625
layer:3, std:65769.9140625
layer:4, std:1037747.625
layer:5, std:16792116.0
layer:6, std:270513248.0
layer:7, std:4271523072.0
layer:8, std:69773983744.0
layer:9, std:1135388917760.0
layer:10, std:17972382924800.0
layer:11, std:290361566035968.0
layer:12, std:4646467956375552.0
layer:13, std:7.375517986167194e+16
layer:14, std:1.191924480578945e+18
layer:15, std:1.9444053896660517e+19
layer:16, std:3.0528445240940115e+20
layer:17, std:4.825792774212115e+21
layer:18, std:7.649714347139797e+22
layer:19, std:1.2594570721248546e+24
layer:20, std:2.009953775506879e+25
layer:21, std:3.221620181771679e+26
layer:22, std:5.130724860334453e+27
layer:23, std:8.456056429939108e+28
layer:24, std:1.337131496090396e+30
layer:25, std:2.1005011162403366e+31
layer:26, std:3.2537079800422226e+32
layer:27, std:5.275016630929896e+33
layer:28, std:8.364601398774255e+34
layer:29, std:1.3522442523642847e+36
layer:30, std:2.1151457613842935e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[       -inf,        -inf,        -inf,  ...,        -inf,
          1.1685e+38,  1.4362e+38],
        [-2.9867e+38,  2.6396e+35,  1.1949e+38,  ...,  8.2088e+37,
          1.9041e+38, -5.0897e+37],
        [-1.4351e+38,  9.9710e+37, -1.4595e+37,  ...,  4.2070e+36,
                -inf, -1.2583e+38],
        ...,
        [        inf,         inf,         inf,  ...,  1.6265e+38,
          1.3557e+38,        -inf],
        [-1.2538e+38, -1.9771e+38,        -inf,  ..., -1.8160e+38,
          1.0576e+38,         inf],
        [        inf,  5.5094e+37, -4.4087e+36,  ...,        -inf,
         -2.4495e+38,  7.9425e+37]], grad_fn=<MmBackward>)

  下面我们通过方差公式推导来观察网络层输出的标准差为什么会越来越大?最终会超出我们要表示的范围。
E ( 𝑿 ∗ 𝒀 ) = 𝑬 ( 𝑿 ) ∗ 𝑬 ( 𝒀 ) D ( 𝑿 ) = 𝑬 ( X 𝟐 ) − [ 𝑬 ( 𝑿 ) ] 𝟐 D ( 𝑿 + 𝒀 ) = 𝑫 ( 𝑿 ) + 𝑫 ( 𝒀 ) D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) + D ( X ) ∗ [ 𝑬 ( 𝒀 ) ] 𝟐 + D ( Y ) ∗ [ 𝑬 ( 𝑿 ) ] 𝟐 E(𝑿 ∗ 𝒀) = 𝑬(𝑿) ∗ 𝑬(𝒀)\\D(𝑿) = 𝑬(X^𝟐) − [𝑬(𝑿)]^𝟐\\D(𝑿 + 𝒀) = 𝑫(𝑿) + 𝑫(𝒀)\\D(X*Y) = D(X)*D(Y) + D(X)* [𝑬(𝒀)]^𝟐 + D(Y)* [𝑬(𝑿)]^𝟐 E(XY)=E(X)E(Y)D(X)=E(X2)[E(X)]2D(X+Y)=D(X)+D(Y)D(XY)=D(X)D(Y)+D(X)[E(Y)]2+D(Y)[E(X)]2
E ( X ) = 0 , E ( Y ) = 0 E(X)=0, E(Y)=0 E(X)=0,E(Y)=0
D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X*Y) = D(X)*D(Y) D(XY)=D(X)D(Y)
在这里插入图片描述
H 𝟏 𝟏 = ∑ 𝒊 = 𝟎 𝒏 𝑿 𝒊 ∗ 𝑾 1 i 𝐃 ( H 𝟏 𝟏 ) = ∑ 𝒊 = 𝟎 𝒏 𝑫 ( 𝑿 𝒊 ) ∗ 𝑫 ( 𝑾 𝟏 𝒊 ) = n ∗ ( 1 ∗ 1 ) = n H_{𝟏𝟏} = \sum_{𝒊=𝟎}^𝒏 𝑿_𝒊 ∗ 𝑾_{1i}\\𝐃(H_{𝟏𝟏}) = \sum_{𝒊=𝟎}^𝒏𝑫(𝑿_𝒊) ∗ 𝑫(𝑾_{𝟏𝒊})\\= n * (1 * 1)= n H11

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值