Pytorch —— 权值初始化

1、梯度消失与爆炸

这里使用一个三层的全连接网络,现在观察一下第二个隐藏层 W 2 W_2 W2的权值的梯度是怎么求取的。

在这里插入图片描述
根据链式求导法则可以知道, W 2 W_2 W2的求导如下:
H 2 = H 1 ∗ W 2 \mathrm{H}_{2}=\mathrm{H}_{1} * \mathrm{W}_{2} H2=H1W2
Δ W 2 = ∂ L o s s ∂ W 2 = ∂ L o s s ∂ o u t ⋆ ∂ o u t ∂ H 2 ⋆ ∂ H 2 ∂ w 2 \Delta \mathrm{W}_{2}=\frac{\partial \mathrm{Loss}}{\partial \mathrm{W}_{2}}=\frac{\partial \mathrm{Loss}}{\partial \mathrm{out}} \star \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} \star \frac{\partial \mathrm{H}_{2}}{\partial \mathrm{w}_{2}} ΔW2=W2Loss=outLossH2outw2H2
= ∂ Loss ⁡ ∂ o u t ⋆ ∂ o u t ∂ H 2 ∗ H 1 =\frac{\partial \operatorname{Loss}}{\partial \mathrm{out}} \star \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \mathrm{H}_{1} =outLossH2outH1
上面公式中, H 1 H_1 H1是上一层神经元的输出值, W 2 W_2 W2的梯度依赖于上一层的输出,如果 H 1 H_1 H1的输出值趋向于零, W 2 W_2 W2的梯度也趋向于零,从而导致梯度消失。如果 H 1 H_1 H1趋向于无穷大,那么 W 2 W_2 W2也趋向于无穷大,从而导致梯度爆炸。

从上面我们可以知道,要避免梯度消失或者梯度爆炸,就要严格控制网络输出层的输出值的范围,也就是每一层网络的输出值不能太大也不能太小。

下面通过代码观察全连接网络的输出:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

现在观察一下output的输出,运行代码,输出为:

tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<MmBackward>)

可以发现输出的每一个值都是nan,也就是数据非常大或者非常小,已经超出了当前精度能够表示的范围。

现在返回forward()中观察数据什么时候变为nan,在代码中使用标准差来衡量数据的尺度范围:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通过运行上面的代码,可以得到下面的输出结果:

layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875
layer:3, std:65576.8125
layer:4, std:1045011.875
layer:5, std:17110408.0
layer:6, std:275461408.0
layer:7, std:4402537984.0
layer:8, std:71323615232.0
layer:9, std:1148104736768.0
layer:10, std:17911758454784.0
layer:11, std:283574846619648.0
layer:12, std:4480599809064960.0
layer:13, std:7.196814275405414e+16
layer:14, std:1.1507761512626258e+18
layer:15, std:1.853110740188555e+19
layer:16, std:2.9677725826641455e+20
layer:17, std:4.780376223769898e+21
layer:18, std:7.613223480799065e+22
layer:19, std:1.2092652108825478e+24
layer:20, std:1.923257075956356e+25
layer:21, std:3.134467063655912e+26
layer:22, std:5.014437766285408e+27
layer:23, std:8.066615144249704e+28
layer:24, std:1.2392661553516338e+30
layer:25, std:1.9455688099759845e+31
layer:26, std:3.0238180658999113e+32
layer:27, std:4.950357571077011e+33
layer:28, std:8.150925520353362e+34
layer:29, std:1.322983152787379e+36
layer:30, std:2.0786820453988485e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[        inf, -2.6817e+38,         inf,  ...,         inf,
                 inf,         inf],
        [       -inf,        -inf,  1.4387e+38,  ..., -1.3409e+38,
         -1.9659e+38,        -inf],
        [-1.5873e+37,         inf,        -inf,  ...,         inf,
                -inf,  1.1484e+38],
        ...,
        [ 2.7754e+38, -1.6783e+38, -1.5531e+38,  ...,         inf,
         -9.9440e+37, -2.5132e+38],
        [-7.7184e+37,        -inf,         inf,  ..., -2.6505e+38,
                 inf,         inf],
        [        inf,         inf,        -inf,  ...,        -inf,
                 inf,  1.7432e+38]], grad_fn=<MmBackward>)

通过分析结果,可以知道,在31层的时候就会输出nan结果。

下面通过方差的公式推导来分析为什么神经网络的输出的标准差会越来越大,最终会超出可以表示的范围。
在进行方差公式推导之前,先来复习三个基本公式:
(1)两个相互独立的随机变量X和Y的乘积的期望为: E ( X ∗ Y ) = E ( X ) ∗ E ( Y ) \mathrm{E}(X * Y)=E(X) * E(Y) E(XY)=E(X)E(Y)
(2)方差的推导公式: D ( X ) = E ( X 2 ) − [ E ( X ) ] 2 \mathrm{D}(X)=E\left(\mathrm{X}^{2}\right)-[\boldsymbol{E}(X)]^{2} D(X)=E(X2)[E(X)]2
(3)两个相互独立的随机变量X和Y的和的方差为: D ( X + Y ) = D ( X ) + D ( Y ) \mathbf{D}(\boldsymbol{X}+\boldsymbol{Y})=\boldsymbol{D}(\boldsymbol{X})+\boldsymbol{D}(\boldsymbol{Y}) D(X+Y)=D(X)+D(Y)
通过以上三个公式可以推导出两个相互独立的随机变量相乘的方差为: D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) + D ( X ) ∗ [ E ( Y ) ] 2 + D ( Y ) ∗ [ E ( X ) ] 2 D(X * Y)=D(X) * D(Y)+D(X) *[E(Y)]^{2}+D(Y) *[E(X)]^{2} D(XY)=D(X)D(Y)+D(X)[E(Y)]2+D(Y)[E(X)]2上面公式中的X和Y默认为均值为0,标准差为1,即 E ( X ) = 0 , E ( Y ) = 0 E(X)=0,E(Y)=0 E(X)=0,E(Y)=0,因此可以得到简化的公式: D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X*Y)=D(X)*D(Y) D(XY)=D(X)D(Y)

下面观察网络层的标准差,观察第一个隐藏层的第一个神经元,设置为 H 11 H_{11} H11 H 11 H_{11} H11的计算公式如下: H 11 = ∑ i = 0 n X i ∗ W 1 i \mathrm{H}_{11}=\sum_{i=0}^{n} X_{i} * W_{1 i} H11=i=0nXiW1i接着使用上面推导得到的公式: D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) D(X*Y)=D(X)*D(Y) D(XY)=D(X)D(Y)来求取 H 11 H_{11} H11的方差,由于X和W都是零均值,1标准差的数据,因此 H 11 H_{11} H11的方差可以表示为: D ( H 11 ) = ∑ i = 0 n D ( X i ) ∗ D ( W 1 i ) = n ∗ ( 1 ∗ 1 ) = n \mathbf{D}\left(\mathrm{H}_{11}\right)=\sum_{i=0}^{n} \boldsymbol{D}\left(\boldsymbol{X}_{i}\right) * \boldsymbol{D}\left(W_{1 i}\right)=n*(1*1)=n D(H11)=i=0nD(Xi)D(W1i)=n(11)=n公式中的n表示神经元的个数,后面的1代表 X i X_i Xi的方差和 W 1 i W_{1i} W1i的方差,由于输入X服从零均值,1标准差的分布,W也是一个标准正态分布,所以 H 11 H_{11} H11的方差为n,从而可以得到 H 11 H_{11} H11的标准差为 std ⁡ ( H 11 ) = D ( H 11 ) = n \operatorname{std}\left(\mathrm{H}_{11}\right)=\sqrt{\mathrm{D}\left(\mathrm{H}_{11}\right)}=\sqrt{n} std(H11)=D(H11) =n 从公式推导可以发现,第一个隐藏层的输出值的方差变为n,而输入数据的方差为1,经过一个网络层的前向传播,数据的方差就扩大了n倍,标准差扩大了根号n倍。同理,从第一个隐藏层到第二个隐藏层,标准差就变为n。不断往后传播,每经过一层,输出值的尺度范围都会不断扩大根号n倍,最终超出精度可以表示的范围,最终变为nan。

从公式中可以发现,标准差由三个因素决定,第一个是n,就是每一层的神经元个数,第二个是X的方差,也就是输入值的方差,第三个是W的方差,也就是网络层权值的方差。从这个公式中可以看到,如果想让网络层的方差保持尺度不变,只能让方差等于1,因为层与层之间的方差是进行相乘得到的。让方差为1,这样多个1相乘得到的方差结果仍为1。

为了让每一层的方差为1,也就是: D ( H 1 ) = n ∗ D ( X ) ∗ D ( W ) = 1 \mathbf{D}\left(\mathbf{H}_{\mathbf{1}}\right)=\boldsymbol{n} * \boldsymbol{D}(\boldsymbol{X}) * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1} D(H1)=nD(X)D(W)=1因此可以推导出W的方差为: D ( W ) = 1 n ⇒ std ⁡ ( W ) = 1 n \boldsymbol{D}(\boldsymbol{W})=\frac{1}{n} \Rightarrow \operatorname{std}(W)=\sqrt{\frac{1}{n}} D(W)=n1std(W)=n1 这样可以使得每一个网络层的输出的方差为1。

下面回到代码中,采用一个零均值,标准差为 1 n \sqrt{\frac{1}{n}} n1 的分布去初始化权值,再来观察网络层的输出的标准差,代码如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

现在来看一下代码的输出:

layer:0, std:0.9974957704544067
layer:1, std:1.0024365186691284
layer:2, std:1.002745509147644
layer:3, std:1.0006227493286133
layer:4, std:0.9966009855270386
layer:5, std:1.019859790802002
layer:6, std:1.026173710823059
layer:7, std:1.0250457525253296
layer:8, std:1.0378952026367188
layer:9, std:1.0441951751708984
layer:10, std:1.0181655883789062
layer:11, std:1.0074602365493774
layer:12, std:0.9948930144309998
layer:13, std:0.9987586140632629
layer:14, std:0.9981392025947571
layer:15, std:1.0045733451843262
layer:16, std:1.0055204629898071
layer:17, std:1.0122840404510498
layer:18, std:1.0076017379760742
layer:19, std:1.000280737876892
layer:20, std:0.9943006038665771
layer:21, std:1.012800931930542
layer:22, std:1.012657642364502
layer:23, std:1.018149971961975
layer:24, std:0.9776086211204529
layer:25, std:0.9592394828796387
layer:26, std:0.9317858815193176
layer:27, std:0.9534041881561279
layer:28, std:0.9811319708824158
layer:29, std:0.9953019022941589
layer:30, std:0.9773916006088257
layer:31, std:0.9655940532684326
layer:32, std:0.9270440936088562
layer:33, std:0.9329946637153625
layer:34, std:0.9311841726303101
layer:35, std:0.9354336261749268
layer:36, std:0.9492132067680359
layer:37, std:0.9679954648017883
layer:38, std:0.9849981665611267
layer:39, std:0.9982335567474365
layer:40, std:0.9616852402687073
layer:41, std:0.9439758658409119
layer:42, std:0.9631161093711853
layer:43, std:0.958673894405365
layer:44, std:0.9675614237785339
layer:45, std:0.9837557077407837
layer:46, std:0.9867278337478638
layer:47, std:0.9920817017555237
layer:48, std:0.9650403261184692
layer:49, std:0.9991624355316162
layer:50, std:0.9946174025535583
layer:51, std:0.9662044048309326
layer:52, std:0.9827387928962708
layer:53, std:0.9887880086898804
layer:54, std:0.9932605624198914
layer:55, std:1.0237400531768799
layer:56, std:0.9702046513557434
layer:57, std:1.0045380592346191
layer:58, std:0.9943899512290955
layer:59, std:0.9900636076927185
layer:60, std:0.99446702003479
layer:61, std:0.9768352508544922
layer:62, std:0.9797843098640442
layer:63, std:0.9951220750808716
layer:64, std:0.9980446696281433
layer:65, std:1.0086933374404907
layer:66, std:1.0276142358779907
layer:67, std:1.0429234504699707
layer:68, std:1.0197855234146118
layer:69, std:1.0319130420684814
layer:70, std:1.0540012121200562
layer:71, std:1.026781439781189
layer:72, std:1.0331352949142456
layer:73, std:1.0666675567626953
layer:74, std:1.0413838624954224
layer:75, std:1.0733673572540283
layer:76, std:1.0404183864593506
layer:77, std:1.0344083309173584
layer:78, std:1.0022705793380737
layer:79, std:0.99835205078125
layer:80, std:0.9732587337493896
layer:81, std:0.9777462482452393
layer:82, std:0.9753198623657227
layer:83, std:0.9938382506370544
layer:84, std:0.9472599029541016
layer:85, std:0.9511011242866516
layer:86, std:0.9737769961357117
layer:87, std:1.005651831626892
layer:88, std:1.0043526887893677
layer:89, std:0.9889539480209351
layer:90, std:1.0130352973937988
layer:91, std:1.0030947923660278
layer:92, std:0.9993206262588501
layer:93, std:1.0342745780944824
layer:94, std:1.031973123550415
layer:95, std:1.0413124561309814
layer:96, std:1.0817031860351562
layer:97, std:1.128799557685852
layer:98, std:1.1617802381515503
layer:99, std:1.2215303182601929
tensor([[-1.0696, -1.1373,  0.5047,  ..., -0.4766,  1.5904, -0.1076],
        [ 0.4572,  1.6211,  1.9659,  ..., -0.3558, -1.1235,  0.0979],
        [ 0.3908, -0.9998, -0.8680,  ..., -2.4161,  0.5035,  0.2814],
        ...,
        [ 0.1876,  0.7971, -0.5918,  ...,  0.5395, -0.8932,  0.1211],
        [-0.0102, -1.5027, -2.6860,  ...,  0.6954, -0.1858, -0.8027],
        [-0.5871, -1.3739, -2.9027,  ...,  1.6734,  0.5094, -0.9986]],
       grad_fn=<MmBackward>)

通过分析输出,可以看到输出的范围基本在1左右。因此通过恰当的权重初始化方法可以实现多层的全连接网络的输出值的尺度维持在一定的范围,不会过大也不会过小。通过以上的例子,我们可以知道,需要保持每一个网络层输出的方差为1,但是这里还需要考虑激活函数的存在,下面学习具有激活函数的权值初始化方法。

现在我们在forward()函数中加一个tanh激活函数,观察网络的输出结果,其代码如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通过运行代码,可以发现网络的输出为:

layer:0, std:0.6273701786994934
layer:1, std:0.48910173773765564
layer:2, std:0.4099564850330353
layer:3, std:0.35637012124061584
layer:4, std:0.32117360830307007
layer:5, std:0.2981105148792267
layer:6, std:0.27730831503868103
layer:7, std:0.2589356303215027
layer:8, std:0.2468511462211609
layer:9, std:0.23721906542778015
layer:10, std:0.22171513736248016
layer:11, std:0.21079954504966736
layer:12, std:0.19820132851600647
layer:13, std:0.19069305062294006
layer:14, std:0.18555502593517303
layer:15, std:0.17953835427761078
layer:16, std:0.17485804855823517
layer:17, std:0.1702701896429062
layer:18, std:0.16508983075618744
layer:19, std:0.1591130942106247
layer:20, std:0.15480302274227142
layer:21, std:0.15263864398002625
layer:22, std:0.148549422621727
layer:23, std:0.14617665112018585
layer:24, std:0.13876433670520782
layer:25, std:0.13316625356674194
layer:26, std:0.12660598754882812
layer:27, std:0.12537944316864014
layer:28, std:0.12535445392131805
layer:29, std:0.1258980631828308
layer:30, std:0.11994212120771408
layer:31, std:0.11700888723134995
layer:32, std:0.11137298494577408
layer:33, std:0.11154613643884659
layer:34, std:0.10991233587265015
layer:35, std:0.10996390879154205
layer:36, std:0.10969001054763794
layer:37, std:0.10975217074155807
layer:38, std:0.11063199490308762
layer:39, std:0.11021336913108826
layer:40, std:0.10465587675571442
layer:41, std:0.10141163319349289
layer:42, std:0.1026025339961052
layer:43, std:0.10079070925712585
layer:44, std:0.10096712410449982
layer:45, std:0.10117629915475845
layer:46, std:0.10145658254623413
layer:47, std:0.09987485408782959
layer:48, std:0.09677786380052567
layer:49, std:0.099615179002285
layer:50, std:0.09867013245820999
layer:51, std:0.09398546814918518
layer:52, std:0.09388342499732971
layer:53, std:0.09352942556142807
layer:54, std:0.09336657077074051
layer:55, std:0.094817616045475
layer:56, std:0.08856320381164551
layer:57, std:0.09024856984615326
layer:58, std:0.0886448472738266
layer:59, std:0.08766943961381912
layer:60, std:0.08726290613412857
layer:61, std:0.08623497188091278
layer:62, std:0.08549781143665314
layer:63, std:0.08555219322443008
layer:64, std:0.08536665141582489
layer:65, std:0.08462796360254288
layer:66, std:0.08521939814090729
layer:67, std:0.08562128990888596
layer:68, std:0.08368432521820068
layer:69, std:0.08476376533508301
layer:70, std:0.08536301553249359
layer:71, std:0.08237562328577042
layer:72, std:0.08133520931005478
layer:73, std:0.08416961133480072
layer:74, std:0.08226993680000305
layer:75, std:0.08379077166318893
layer:76, std:0.08003699779510498
layer:77, std:0.07888863980770111
layer:78, std:0.07618381083011627
layer:79, std:0.07458438724279404
layer:80, std:0.07207277417182922
layer:81, std:0.07079191505908966
layer:82, std:0.0712786540389061
layer:83, std:0.07165778428316116
layer:84, std:0.06893911212682724
layer:85, std:0.06902473419904709
layer:86, std:0.07030880451202393
layer:87, std:0.07283663004636765
layer:88, std:0.07280216366052628
layer:89, std:0.07130247354507446
layer:90, std:0.07225216180086136
layer:91, std:0.0712454691529274
layer:92, std:0.07088855654001236
layer:93, std:0.0730612725019455
layer:94, std:0.07276969403028488
layer:95, std:0.07259569317102432
layer:96, std:0.0758652538061142
layer:97, std:0.07769152522087097
layer:98, std:0.07842093706130981
layer:99, std:0.08206242322921753
tensor([[-0.1103, -0.0739,  0.1278,  ..., -0.0508,  0.1544, -0.0107],
        [ 0.0807,  0.1208,  0.0030,  ..., -0.0385, -0.1887, -0.0294],
        [ 0.0321, -0.0833, -0.1482,  ..., -0.1133,  0.0206,  0.0155],
        ...,
        [ 0.0108,  0.0560, -0.1099,  ...,  0.0459, -0.0961, -0.0124],
        [ 0.0398, -0.0874, -0.2312,  ...,  0.0294, -0.0562, -0.0556],
        [-0.0234, -0.0297, -0.1155,  ...,  0.1143,  0.0083, -0.0675]],
       grad_fn=<TanhBackward>)

通过分析结果可以发现,网络层的标准差随着前向传播变得越来越小,从而导致梯度消失。针对存在激活函数的权值初始化问题,分别提出了Xavier方法和Kaiming方法。

2、 Xavier方法与Kaiming方法

2.1 Xavier方法

2010年,在论文《Understanding the difficulty of training deep feedforward neural networks》详细探讨了具有激活函数时如何进行初始化。在论文中,结合方差一致性原则,也就是让每一层的输出值的方差尽量为1,同时这种方法是针对饱和激活函数如Sigmoid,Tanh方法进行分析的。

通过文章中的公式推导,可以得到下面两个等式: n i ∗ D ( W ) = 1 \boldsymbol{n}_{\boldsymbol{i}} * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1} niD(W)=1 n i + 1 ∗ D ( W ) = 1 \boldsymbol{n}_{\boldsymbol{i}+1} * \boldsymbol{D}(\boldsymbol{W})=1 ni+1D(W)=1公式中的 n i n_i ni是输入的神经元个数, n i + 1 n_{i+1} ni+1是输出的神经元个数,这是同时考虑了前向传播和反向传播得到的两个等式,同时结合方差一致性原则,最终得到权值的方差为: D ( W ) = 2 n i + n i + 1 D(W)=\frac{2}{n_{i}+n_{i+1}} D(W)=ni+ni+12通常Xavier采用的是均匀分布,下面来推导一下均匀分布的上限和下限,假设均匀分布的下限为-a,上限为a,即: W ∼ U [ − a , a ] \boldsymbol{W} \sim \boldsymbol{U}[-\boldsymbol{a}, \boldsymbol{a}] WU[a,a] D ( W ) = ( − a − a ) 2 12 = ( 2 a ) 2 12 = a 2 3 D(W)=\frac{(-a-a)^{2}}{12}=\frac{(2 a)^{2}}{12}=\frac{a^{2}}{3} D(W)=12(aa)2=12(2a)2=3a2综合上面的公式,可以得到: 2 n i + n i + 1 = a 2 3 ⇒ α = 6 n i + n i + 1 \frac{2}{n_{i}+n_{i+1}}=\frac{a^{2}}{3} \Rightarrow \alpha=\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}} ni+ni+12=3a2α=ni+ni+1 6 W ∼ U [ − 6 n i + n i + 1 , 6 n i + n i + 1 ] W \sim U\left[-\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}\right] WU[ni+ni+1 6 ,ni+ni+1 6 ]下面通过Xavier初始化方法观察网络层的输出,其代码如下所示:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
                a = np.sqrt(6 / (self.neural_num + self.neural_num))  # Xavier初始化方法
                tanh_gain = nn.init.calculate_gain('tanh')
                a *= tanh_gain
                nn.init.uniform_(m.weight.data, -a, a)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通过观察代码的输出,可以发现网络层的输出稳定在一个固定值附近。同样,在Pytorch中实现了Xavier初始化方法:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
            	tanh_gain = nn.init.calculate_gain('tanh')
                nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

通过运行上面的代码可以发现和我们手动设计的Xavier初始化方法功能类似。Xavier针对Sigmoid方法、Tanh方法这种饱和激活函数提供了有效的初始化方法。但是对于非饱和激活函数Relu,Xavier不再适用。

2.2 Kaiming初始化

针对Xavier方法不能有效解决Relu非饱和激活函数的问题,2015年提出了Kaiming初始化方法。基于方差一致性原则,Kaiming初始化方法保持数据尺度维持在恰当范围,通常方差为1,这种方法针对的激活函数为ReLU及其变种。

针对ReLU激活函数,通过公式推导可以得到权值的方差等于: D ( W ) = 2 n i D(W)=\frac{2}{n_i} D(W)=ni2公式中 n i n_i ni是输入神经元个数。针对ReLU的变种,也就是负半轴有一定的斜率,其权值的方差应该是: ( W ) = 2 ( 1 + a 2 ) ∗ n i (W)=\frac{2}{\left(1+a^{2}\right) * n_{i}} (W)=(1+a2)ni2公式中a是负半轴的斜率。在ReLU中,其负半轴的斜率为0,即a=0。因此权值的标准差公式为: std ⁡ ( W ) = 2 ( 1 + a 2 ) ∗ n i \operatorname{std}(W)=\sqrt{\frac{2}{\left(1+a^{2}\right) * n_{i}}} std(W)=(1+a2)ni2 下面通过代码实现Kaiming初始化方法,具体代码如下:

import os
import torch
import random
import numpy as np
import torch.nn as nn
from toolss.common_tools import set_seed

set_seed(1)  # 设置随机种子


class MLP(nn.Module):  # 建立全连接模型
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.tanh(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):  # 如果为nan,则停止
                print("output is nan in {} layers".format(i))
                break
        return x

    def initialize(self):  # 初始化模型参数
        for m in self.modules():
            if isinstance(m, nn.Linear):
            	nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))

layer_nums = 100
neural_nums = 256
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize()

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

同样的,在Pytorch的init中也实现了Kaiming初始化方法,其代码如下:

nn.init.kaiming_normal_(m.weight.data)

3、常用初始化方法

不良的初始化方法会导致输出的结果发生梯度消失或者梯度爆炸,最终导致模型没有办法正常训练。为了避免这一现象的发生,我们要控制网络层的输出值的尺度范围。从公式推导可以知道,要使每一层的输出值的方差尽量是1,争取方差一致性原则,保持网络层的输出值在1附近,下面来认识一下Pytorch提供的十种权值初始化方法:

  1. Xavier均匀分布;
  2. Xavier正态分布;
  3. Kaiming均匀分布;
  4. Kaiming正态分布;
  5. 均匀分布;
  6. 正态分布;
  7. 常数分布;
  8. 正交矩阵初始化;
  9. 单位矩阵初始化;
    10.稀疏矩阵初始化;

在权值初始化的时候,选择哪一种初始化方法得根据具体问题进行分析。

现在学习一个特殊函数nn.init.calculate_gain

nn.init.calculate_gain(nonlinearoty, param=None)

主要功能

  • 计算激活函数的方差变化尺度;

主要参数

  • nonlinearity:激活函数名称;
  • param:激活函数的参数,如Leaky ReLU的negative_slop;

方差变化尺度意思就是输入数据的方差除于经过激活函数之后的输出数据的方差,也就是方差的比例。

下面通过代码分析这个函数的功能:

x = torch.randn(10000)
out = torch.tanh(x)

gain = x.std() / out.std()
print('gain:{}'.format(gain))

tanh_gain = nn.init.calculate_gain('tanh')
print('tanh_gain in PyTorch:', tanh_gain)
  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值