文章目录
1 参数说明
2 反向传播
为了弄清楚为何会出现消失的梯度,来看看一个极简单的深度神经网络(详细理论推导请看本博客第7节内容):每一层都只有一个单一的神经元。下面就是有三层隐藏层的神经网络:
σ \sigma σ表示sigmoid激活函数, L ( L o s s ) L(Loss) L(Loss)也可以用 C ( C o s t ) C (Cost) C(Cost)表示
Note:导数值小于0.25
其中
z
1
=
x
w
1
+
b
1
z_{1} = xw_{1}+ b_{1}
z1=xw1+b1
a
1
=
σ
(
z
1
)
a_{1} = \sigma (z_{1})
a1=σ(z1)
z
2
=
a
1
w
2
+
b
2
z_{2} = a_{1}w_{2} + b_{2}
z2=a1w2+b2
a
2
=
σ
(
z
2
)
a_{2} = \sigma (z_{2})
a2=σ(z2)
z
3
=
a
2
w
3
+
b
3
z_{3} = a_{2}w_{3} + b_{3}
z3=a2w3+b3
a
3
=
σ
(
z
3
)
a_{3} = \sigma (z_{3})
a3=σ(z3)
z
4
=
a
3
w
4
+
b
4
z_{4} = a_{3}w_{4} + b_{4}
z4=a3w4+b4
a
4
=
y
=
σ
(
z
4
)
a_{4} = y = \sigma (z_{4})
a4=y=σ(z4)
代入推导出:
y = a 4 = σ ( z 4 ) = σ ( a 3 w 4 + b 4 ) = σ ( σ ( a 2 w 3 + b 3 ) w 4 + b 4 ) = σ ( σ ( σ ( a 1 w 2 + b 2 ) w 3 + b 3 ) w 4 + b 4 ) = σ ( σ ( σ ( σ ( x w 1 + b 1 ) w 2 + b 2 ) w 3 + b 3 ) w 4 + b 4 ) y = a4 = \sigma (z_{4}) = \sigma (a_{3}w_{4} + b_{4}) = \sigma (\sigma (a_{2}w_{3} + b_{3})w_{4} + b_{4}) = \sigma (\sigma (\sigma (a_{1}w_{2} + b_{2})w_{3} + b_{3})w_{4} + b_{4}) = \sigma (\sigma (\sigma (\sigma (xw_{1}+ b_{1})w_{2} + b_{2})w_{3} + b_{3})w_{4} + b_{4}) y=a4=σ(z4)=σ(a3w4+b4)=σ(σ(a2w3+b3)w4+b4)=σ(σ(σ(a1w2+b2)w3+b3)w4+b4)=σ(σ(σ(σ(xw1+b1)w2+b2)w3+b3)w4+b4)
C
=
f
(
y
)
C = f(y)
C=f(y)
eg:
C
=
(
y
−
y
^
)
2
C = (y-\hat{y})^{2}
C=(y−y^)2
则
C
=
f
(
σ
(
σ
(
σ
(
σ
(
x
w
1
+
b
1
)
w
2
+
b
2
)
w
3
+
b
3
)
w
4
+
b
4
)
)
C = f(\sigma (\sigma (\sigma (\sigma (xw_{1}+ b_{1})w_{2} + b_{2})w_{3} + b_{3})w_{4} + b_{4}))
C=f(σ(σ(σ(σ(xw1+b1)w2+b2)w3+b3)w4+b4))
部分变量求导关系:
∂ a 4 ∂ a 3 = σ ′ ( z 4 ) w 4 \frac{\partial a_{4}}{\partial a_{3}} = \sigma{}' (z_{4})w_{4} ∂a3∂a4=σ′(z4)w4
∂ a 3 ∂ a 2 = σ ′ ( z 3 ) w 3 \frac{\partial a_{3}}{\partial a_{2}} = \sigma{}' (z_{3})w_{3} ∂a2∂a3=σ′(z3)w3
∂ a 2 ∂ a 1 = σ ′ ( z 2 ) w 2 \frac{\partial a_{2}}{\partial a_{1}} = \sigma{}' (z_{2})w_{2} ∂a1∂a2=σ′(z2)w2
则
C
C
C 对
w
1
w_{1}
w1 求导结果为:
∂
C
∂
w
1
=
∂
C
∂
a
4
∂
a
4
∂
a
3
∂
a
3
∂
w
1
=
…
=
∂
C
∂
a
4
∂
a
4
∂
a
3
∂
a
3
∂
a
2
∂
a
2
∂
a
1
∂
a
1
∂
w
1
\frac{\partial C}{\partial w_{1}} = \frac{\partial C}{\partial a_{4}}\frac{\partial a_{4}}{\partial a_{3}}\frac{\partial a_{3}}{\partial w_{1}} =…= \frac{\partial C}{\partial a_{4}}\frac{\partial a_{4}}{\partial a_{3}}\frac{\partial a_{3}}{\partial a_{2}}\frac{\partial a_{2}}{\partial a_{1}}\frac{\partial a_{1}}{\partial w_{1}}
∂w1∂C=∂a4∂C∂a3∂a4∂w1∂a3=…=∂a4∂C∂a3∂a4∂a2∂a3∂a1∂a2∂w1∂a1
=
∂
C
∂
a
4
σ
′
(
z
4
)
w
4
σ
′
(
z
3
)
w
3
σ
′
(
z
2
)
w
2
σ
′
(
z
1
)
x
= \frac{\partial C}{\partial a_{4}}\sigma{}' (z_{4})w_{4}\sigma{}' (z_{3})w_{3}\sigma{}' (z_{2})w_{2}\sigma{}' (z_{1})x
=∂a4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)x
同理
C
C
C 对
b
1
b_{1}
b1 求导结果为:
∂
C
∂
b
1
=
∂
C
∂
a
4
σ
′
(
z
4
)
w
4
σ
′
(
z
3
)
w
3
σ
′
(
z
2
)
w
2
σ
′
(
z
1
)
\frac{\partial C}{\partial b_{1}} = \frac{\partial C}{\partial a_{4}}\sigma{}' (z_{4})w_{4}\sigma{}' (z_{3})w_{3}\sigma{}' (z_{2})w_{2}\sigma{}' (z_{1})
∂b1∂C=∂a4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)
C
C
C 对
w
2
w_{2}
w2 求导结果为:
∂
C
∂
w
2
=
∂
C
∂
a
4
∂
a
4
∂
a
3
∂
a
3
∂
w
2
=
…
=
∂
C
∂
a
4
∂
a
4
∂
a
3
∂
a
3
∂
a
2
∂
a
2
∂
w
2
\frac{\partial C}{\partial w_{2}} = \frac{\partial C}{\partial a_{4}}\frac{\partial a_{4}}{\partial a_{3}}\frac{\partial a_{3}}{\partial w_{2}} =…= \frac{\partial C}{\partial a_{4}}\frac{\partial a_{4}}{\partial a_{3}}\frac{\partial a_{3}}{\partial a_{2}}\frac{\partial a_{2}}{\partial w_{2}}
∂w2∂C=∂a4∂C∂a3∂a4∂w2∂a3=…=∂a4∂C∂a3∂a4∂a2∂a3∂w2∂a2
=
∂
C
∂
a
4
σ
′
(
z
4
)
w
4
σ
′
(
z
3
)
w
3
σ
′
(
z
2
)
a
1
= \frac{\partial C}{\partial a_{4}}\sigma{}' (z_{4})w_{4}\sigma{}' (z_{3})w_{3}\sigma{}' (z_{2})a_{1}
=∂a4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)a1
同理
C
C
C 对
b
2
b_{2}
b2 求导结果为:
∂
C
∂
b
2
=
∂
C
∂
a
4
σ
′
(
z
4
)
w
4
σ
′
(
z
3
)
w
3
σ
′
(
z
2
)
\frac{\partial C}{\partial b_{2}} = \frac{\partial C}{\partial a_{4}}\sigma{}' (z_{4})w_{4}\sigma{}' (z_{3})w_{3}\sigma{}' (z_{2})
∂b2∂C=∂a4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)
3 梯度消失
比较一下 ∂ C ∂ b 1 \frac{\partial C}{\partial b_{1}} ∂b1∂C和 ∂ C ∂ b 3 \frac{\partial C}{\partial b_{3}} ∂b3∂C可知, ∂ C ∂ b 1 \frac{\partial C}{\partial b_{1}} ∂b1∂C要远远小于 ∂ C ∂ b 3 \frac{\partial C}{\partial b_{3}} ∂b3∂C。
因此,梯度消失的本质原因是: w j σ ′ ( z j ) < 1 4 w_{j}\sigma {}'(z_{j}) < \frac{1}{4} wjσ′(zj)<41的约束。
4 梯度爆炸(激增)
网络的权重设置的比较大且偏置使得 σ ′ ( z j ) \sigma {}'(z_{j}) σ′(zj) 项不会太小。
5 梯度不稳定
不稳定的梯度问题:根本的问题其实并非是消失的梯度问题或者激增的梯度问题,而是在前面的层上的梯度是来自后面的层上项的乘积。当存在过多的层次时,就出现了内在本质上的不稳定场景。唯一让所有层都接近相同的学习速度的方式是所有这些项的乘积都能得到一种平衡。如果没有某种机制或者更加本质的保证来达成平衡,那网络就很容易不稳定了。简而言之,真实的问题就是神经网络受限于不稳定梯度的问题。所以,如果我们使用标准的基于梯度的学习算法,在网络中的不同层会出现按照不同学习速度学习的情况。
6 梯度消失、爆炸的解决方案
- 方案1 预训练加微调(Hinton)
- 方案2 梯度剪切(设置阈值,控制爆炸)、正则(控制爆炸)
- 方案3 relu、leakrelu、elu等激活函数
- 方案4 batchnorm(batchnorm全名是batch normalization,简称BN,批规范化,通过规范化操作将输出信号x规范化到均值为0,方差为1保证网络的稳定性)
- 方案5 残差结构
- 方案6 LSTM
7 反向传播公式推导
第2节用一个简单的网络推导了反向传播公式,本节用复杂的网络进行详细的推导(内容来自Backpropagation)
7.1 问题描述
7.2 Chain rule
第二节我们采用 ∂ C ∂ w = ∂ C ∂ a ∂ a ∂ w \frac{\partial C}{\partial w} = \frac{\partial C}{\partial a}\frac{\partial a}{\partial w} ∂w∂C=∂a∂C∂w∂a链式法则,比较直观一点,此节,我们采用 ∂ C ∂ w = ∂ C ∂ z ∂ z ∂ w \frac{\partial C}{\partial w} = \frac{\partial C}{\partial z}\frac{\partial z}{\partial w} ∂w∂C=∂z∂C∂w∂z链式法则,大同小异! a a a和 z z z都是中间变量而已,一个作为layer的输出,一个作为layer的输入!
7.2.1 First Term
链式求导法则的第一部分,分两种情况,一种是layer大于1,一种是layer等于1时
7.2.2 Second Term
用符号 δ \delta δ 来代替链式求导的第二部分
结合下图可以看出,Second Term的求解落脚到两个问题
- How to compute δ L \delta^{L} δL
- The relation of δ l \delta^{l} δl and δ l + 1 \delta^{l+1} δl+1
1) How to compute δ L \delta^{L} δL
2) The relation of
δ
l
\delta^{l}
δl and
δ
l
+
1
\delta^{l+1}
δl+1
化简一下
反向传播浮现庐山真面目
进一步化简,用向量表示
7.3 Summary
7.4 Compare with forward propagation
7.5 Verification
第2节
C
C
C 对
w
1
w_{1}
w1 求导结果为:
∂
C
∂
w
1
=
∂
C
∂
a
4
σ
′
(
z
4
)
w
4
σ
′
(
z
3
)
w
3
σ
′
(
z
2
)
w
2
σ
′
(
z
1
)
x
\frac{\partial C}{\partial w_{1}} = \frac{\partial C}{\partial a_{4}}\sigma{}' (z_{4})w_{4}\sigma{}' (z_{3})w_{3}\sigma{}' (z_{2})w_{2}\sigma{}' (z_{1})x
∂w1∂C=∂a4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)x
第7节
C
C
C 对
w
1
w_{1}
w1 求导结果为:
∂
C
∂
w
1
=
∂
C
∂
z
1
∂
z
1
∂
w
1
=
δ
1
x
\frac{\partial C}{\partial w_{1}} = \frac{\partial C}{\partial z_{1}}\frac{\partial z_{1}}{\partial w_{1}} =\delta ^{1}x
∂w1∂C=∂z1∂C∂w1∂z1=δ1x
=
σ
′
(
z
1
)
w
2
δ
2
x
= \sigma {}'(z_{1})w_{2}\delta ^{2}x
=σ′(z1)w2δ2x
=
σ
′
(
z
1
)
w
2
σ
′
(
z
2
)
w
3
δ
3
x
= \sigma {}'(z_{1})w_{2}\sigma {}'(z_{2})w_{3}\delta ^{3}x
=σ′(z1)w2σ′(z2)w3δ3x
=
σ
′
(
z
1
)
w
2
σ
′
(
z
2
)
w
3
σ
′
(
z
3
)
w
4
δ
4
x
= \sigma {}'(z_{1})w_{2}\sigma {}'(z_{2})w_{3}\sigma {}'(z_{3})w_{4}\delta ^{4}x
=σ′(z1)w2σ′(z2)w3σ′(z3)w4δ4x
=
σ
′
(
z
1
)
w
2
σ
′
(
z
2
)
w
3
σ
′
(
z
3
)
w
4
σ
′
(
z
4
)
∂
C
∂
y
x
= \sigma {}'(z_{1})w_{2}\sigma {}'(z_{2})w_{3}\sigma {}'(z_{3})w_{4}\sigma {}'(z_{4})\frac{\partial C}{\partial y}x
=σ′(z1)w2σ′(z2)w3σ′(z3)w4σ′(z4)∂y∂Cx
=
σ
′
(
z
1
)
w
2
σ
′
(
z
2
)
w
3
σ
′
(
z
3
)
w
4
σ
′
(
z
4
)
∂
C
∂
a
4
x
= \sigma {}'(z_{1})w_{2}\sigma {}'(z_{2})w_{3}\sigma {}'(z_{3})w_{4}\sigma {}'(z_{4})\frac{\partial C}{\partial a_{4}}x
=σ′(z1)w2σ′(z2)w3σ′(z3)w4σ′(z4)∂a4∂Cx
结果一样,推导没问题,美滋滋!!!
可以看出,梯度跟输入有关,跟权重大小有关,跟 activation function 的导数有关!
- 输入不能太大,太大 sigmoid 容易饱和,sigmoid 的导数就很小了,同时也不能太小,因为梯度也和输入有关,也会导致梯度过小,影响网络的学习能力,一个好的建议是重新调整输入值,将其范围控制在 0.0-1
- 输出的设计,匹配激活函数的可能输出,注意避开激活函数不可能达到的值,比如 sigmoid 只能输出 0-1,如果目标函数超过了这个范围,有驱动产生过大权重的风险!
- 权重初始化,初始化太大,激活函数容易饱和,导数就很小了,太小的话,因为梯度也和权重相关,也会导致梯度过小,影响网络的学习能力。可以从 -1.0~1.0 之间随机均匀地选择初始权重!数学家所得到的经验规则是,我们可以在一个节点传入链接数量平方根倒数的大致范围内随机采样,初始化权重!eg,如果每个节点具有3条传入链接,那么初始权重的范围应该从 − 3 -\sqrt{3} −3 到 3 \sqrt{3} 3,也即 ± 0.577 \pm 0.577 ±0.577。(《Python 神经网络编程》[英] Tariq Rashid 著)
8 Demo
8.1 Demo1
先看个简单的例子(来自 Backpropagation(反向传播)方法介绍)
8.2 Demo2
再来个稍微复杂一点点的
套用下之前的表达方式,见下图的红色字体部分!!!
图片来源:《Python 神经网络编程》[英] Tariq Rashid 著(上图中的最后一行
a
3
a^3
a3 应该写成
a
2
a^2
a2)
来求一下
W
1
,
1
W_{1,1}
W1,1 的梯度,假设
C
=
1
2
(
y
−
a
)
2
=
∑
i
=
1
2
1
2
(
y
i
−
a
i
3
)
2
=
∑
i
=
1
2
1
2
e
i
2
C = \frac{1}{2} (y-a)^2 = \sum_{i=1}^{2}\frac{1}{2}(y_i-a_i^3)^2 = \sum_{i=1}^{2}\frac{1}{2}e_i^2
C=21(y−a)2=i=1∑221(yi−ai3)2=i=1∑221ei2
其中 y y y 是 ground truth, a a a 是 neural network 的 output, e = ( y − a ) e=(y-a) e=(y−a)
hidden layer 2 的 1 号神经元输入为 a 1 2 a_1^2 a12,也即 o j = 1 o_{j=1} oj=1
∂ C ∂ W 1 , 1 = ∂ C ∂ a 1 3 ⋅ σ ′ ( z 1 3 ) ⋅ a 1 2 = ∂ C ∂ W 1 , 1 ⋅ σ ′ ( W 1 , 1 ⋅ a 1 2 + W 2 , 1 ⋅ a 2 2 ) ⋅ a 1 2 = ∂ C ∂ W 1 , 1 ⋅ σ ( W 1 , 1 ⋅ a 1 2 + W 2 , 1 ⋅ a 2 2 ) ⋅ σ ( 1 − ( W 1 , 1 ⋅ a 1 2 + W 2 , 1 ⋅ a 2 2 ) ) ⋅ a 1 2 = − e 1 ⋅ σ ( W 1 , 1 ⋅ o j = 1 + W 2 , 1 ⋅ o j = 2 ) ⋅ σ ( 1 − ( W 1 , 1 ⋅ o j = 1 + W 2 , 1 ⋅ o j = 2 ) ) ⋅ o j = 1 = − 0.8 × σ ( 2.3 ) × ( 1 − σ ( 2.3 ) ) × 0.4 = − 0.0265 \begin{aligned} \frac{\partial C}{\partial W_{1,1}} & = \frac{\partial C} {\partial a_{1}^3} \cdot \sigma{}' (z_{1}^3) \cdot a_{1}^2 \\ & = \frac{\partial C}{\partial W_{1,1}} \cdot \sigma{}' (W_{1,1}\cdot a_{1}^2+ W_{2,1}\cdot a_{2}^2) \cdot a_{1}^2 \\ & = \frac{\partial C}{\partial W_{1,1}} \cdot \sigma(W_{1,1}\cdot a_{1}^2+ W_{2,1}\cdot a_{2}^2) \cdot \sigma(1-(W_{1,1}\cdot a_{1}^2+ W_{2,1}\cdot a_{2}^2))\cdot a_{1}^2 \\ & = -e_1 \cdot \sigma(W_{1,1}\cdot o_{j=1}+ W_{2,1}\cdot o_{j=2}) \cdot \sigma(1-(W_{1,1}\cdot o_{j=1}+ W_{2,1}\cdot o_{j=2}))\cdot o_{j=1} \\ & = -0.8 \times \sigma(2.3) \times (1-\sigma(2.3)) \times 0.4 \\ & = -0.0265 \end{aligned} ∂W1,1∂C=∂a13∂C⋅σ′(z13)⋅a12=∂W1,1∂C⋅σ′(W1,1⋅a12+W2,1⋅a22)⋅a12=∂W1,1∂C⋅σ(W1,1⋅a12+W2,1⋅a22)⋅σ(1−(W1,1⋅a12+W2,1⋅a22))⋅a12=−e1⋅σ(W1,1⋅oj=1+W2,1⋅oj=2)⋅σ(1−(W1,1⋅oj=1+W2,1⋅oj=2))⋅oj=1=−0.8×σ(2.3)×(1−σ(2.3))×0.4=−0.0265
8.3 Demo3
再来个知行合一的,学习参考来自 Backpropagation(反向传播)方法介绍
先定义一个简单的 MLP,输出 out,中间变量 y1,y2 是为了手动计算梯度
import torch
import torch.nn as nn
import numpy as np
from torch.autograd import Variable
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.l1 = nn.Linear(in_features=1, out_features=2, bias=False)
self.l2 = nn.Linear(in_features=2, out_features=2, bias=False)
self.act = nn.Sigmoid()
def forward(self, x):
out = self.l1(x)
# print("l1:", out) # l1: tensor([1., 2.], grad_fn=<SqueezeBackward3>)
out = self.act(out)
# print("act:", out) # act: tensor([0.7311, 0.8808], grad_fn=<SigmoidBackward0>)
y1 = out.detach().numpy().reshape(2, 1)
# print(y1)
"""
[[0.7310586]
[0.880797 ]]
"""
out = self.l2(out)
# print("l2:", out) # l2: tensor([5.7164, 8.9401], grad_fn=<SqueezeBackward3>)
out = self.act(out)
# print("act:", out) # act: tensor([0.9967, 0.9999], grad_fn=<SigmoidBackward0>)
y2 = out.detach().numpy().reshape(2, 1)
# print(y2)
"""
[[0.9967192]
[0.999869 ]]
"""
grad_W2 = np.dot((y2-[[1], [0]]) * y2 * (1-y2), y1.T)
# print("grad W2:", grad_W2)
"""
grad W2: [[-7.84312986e-06 -9.44959200e-06]
[ 9.57516307e-05 1.15363874e-04]]
"""
grad_W1 = np.dot([[3, 5], [4, 6]], (y2-[[1], [0]]) * y2 * (1-y2)) * y1 * (1 - y1) * [1]
# print("grad W1:", grad_W1)
"""
grad W1: [[1.22429862e-04]
[7.80046216e-05]]
loss: tensor(0.4999, grad_fn=<MseLossBackward0>)
"""
return out
网络结构入下
接下来初始化权重
def init_weights(m):
if type(m) == nn.Linear:
# torch.nn.init.xavier_uniform_(m.weight)
torch.nn.init.uniform_(m.weight, a=0, b=1)
model = CNN()
# cnn.apply(init_weights)
for num, (name, params) in enumerate(model.named_parameters()):
print(num, "-", name, "-", params.data.shape, "\n", params.data)
output
0 - l1.weight - torch.Size([2, 1])
tensor([[ 0.0711],
[-0.0089]])
1 - l2.weight - torch.Size([2, 2])
tensor([[ 0.1987, 0.5589],
[ 0.2070, -0.1293]])
不设置的话,随机初始化,可以通过 def init_weights
指定初始化方式,为了方便计算梯度,我们把权重设置为如下简单的形式
model.l1.weight.data = torch.FloatTensor([[1], [2]])
model.l2.weight.data = torch.FloatTensor([[3, 4], [5, 6]])
for num, (name, params) in enumerate(model.named_parameters()):
print(num, "-", name, "-", params.data.shape, "\n", params.data)
output
0 - l1.weight - torch.Size([2, 1])
tensor([[1.],
[2.]])
1 - l2.weight - torch.Size([2, 2])
tensor([[3., 4.],
[5., 6.]])
前向传播,计算得到损失
"======forward======"
input_data = Variable(torch.FloatTensor([1]))
out = model(input_data)
label = Variable(torch.FloatTensor([1, 0]))
loss_fn = nn.MSELoss(reduction="mean")
loss = loss_fn(out, label)
print("loss:", loss)
output
loss: tensor(0.4999, grad_fn=<MseLossBackward0>)
此时没有梯度,权重也没有更新
print(model.l1.weight.grad)
print(model.l2.weight.grad)
for num, (name, params) in enumerate(model.named_parameters()):
print(num, "-", name, "-", params.data.shape, "\n", params.data)
output
None
None
0 - l1.weight - torch.Size([2, 1])
tensor([[1.],
[2.]])
1 - l2.weight - torch.Size([2, 2])
tensor([[3., 4.],
[5., 6.]])
反向传播,看看梯度
"======backward======"
loss.backward()
print(model.l1.weight.grad)
print(model.l2.weight.grad)
for num, (name, params) in enumerate(model.named_parameters()):
print(num, "-", name, "-", params.data.shape, "\n", params.data)
output
tensor([[1.2243e-04],
[7.8005e-05]])
tensor([[-7.8431e-06, -9.4496e-06],
[ 9.5752e-05, 1.1536e-04]])
0 - l1.weight - torch.Size([2, 1])
tensor([[1.],
[2.]])
1 - l2.weight - torch.Size([2, 2])
tensor([[3., 4.],
[5., 6.]])
此时有了梯度,但是还没有更新权重,下面更新下权重
"======upgrade weight======"
optimiser = torch.optim.SGD(params=model.parameters(), lr=1)
optimiser.step()
optimiser.zero_grad()
print(model.l1.weight.grad)
print(model.l2.weight.grad)
for num, (name, params) in enumerate(model.named_parameters()):
print(num, "-", name, "-", params.data.shape, "\n", params.data)
output
tensor([[0.],
[0.]])
tensor([[0., 0.],
[0., 0.]])
0 - l1.weight - torch.Size([2, 1])
tensor([[0.9999],
[1.9999]])
1 - l2.weight - torch.Size([2, 2])
tensor([[3.0000, 4.0000],
[4.9999, 5.9999]])
"""
OK,此刻完整的跑了一次前向和反向传播
我们来手动计算的下权重的梯度,答案如下
model.l1.weight.grad
model.l2.weight.grad
tensor([[1.2243e-04],
[7.8005e-05]])
tensor([[-7.8431e-06, -9.4496e-06],
[ 9.5752e-05, 1.1536e-04]])
公式推导
∂
l
o
s
s
∂
y
2
\frac{\partial{loss}}{\partial{y_2}}
∂y2∂loss 和
∂
y
2
∂
z
2
\frac{\partial{y_2}}{\partial{z_2}}
∂z2∂y2 上图都写出了公式,下面我们推导下
∂
z
2
∂
w
2
\frac{\partial{z_2}}{\partial{w_2}}
∂w2∂z2
y 1 ⃗ = [ 0.7311 0.8808 ] = [ y 1 y 2 ] \vec{y_1} = \begin{bmatrix} 0.7311 \\ 0.8808 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} y1=[0.73110.8808]=[y1y2]
z 2 ⃗ = [ 5.7164 8.9401 ] = [ z 1 z 2 ] \vec{z_2} = \begin{bmatrix} 5.7164 \\ 8.9401 \end{bmatrix} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} z2=[5.71648.9401]=[z1z2]
W 2 = [ 3 4 5 6 ] = [ w 11 w 12 w 21 w 22 ] W_2 = \begin{bmatrix} 3 & 4\\ 5 & 6 \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12}\\ w_{21} & w_{22} \end{bmatrix} W2=[3546]=[w11w21w12w22]
z 2 ⃗ = W 2 ⋅ y 1 ⃗ = [ z 1 z 2 ] = [ w 11 ∗ y 1 + w 12 ∗ y 2 w 21 ∗ y 1 + w 22 ∗ y 2 ] \vec{z_2} = W_2 \cdot \vec{y_1} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} = \begin{bmatrix} w_{11}*y_1 + w_{12}*y_2\\ w_{21}*y_1 + w_{22}*y_2 \end{bmatrix} z2=W2⋅y1=[z1z2]=[w11∗y1+w12∗y2w21∗y1+w22∗y2]
本博客附录公式(6)(14)
∂ z 2 ⃗ ∂ W 2 = [ z 1 ∂ W 2 z 2 ∂ W 2 ] = [ w 11 ∗ y 1 + w 12 ∗ y 2 ∂ W 2 w 21 ∗ y 1 + w 22 ∗ y 2 ∂ W 2 ] = [ w 11 ∗ y 1 + w 12 ∗ y 2 ∂ w 11 w 11 ∗ y 1 + w 12 ∗ y 2 ∂ w 12 w 11 ∗ y 1 + w 12 ∗ y 2 ∂ w 21 w 11 ∗ y 1 + w 12 ∗ y 2 ∂ w 22 w 21 ∗ y 1 + w 22 ∗ y 2 ∂ w 11 w 21 ∗ y 1 + w 22 ∗ y 2 ∂ w 12 w 21 ∗ y 1 + w 22 ∗ y 2 ∂ w 21 w 21 ∗ y 1 + w 22 ∗ y 2 ∂ w 22 ] = [ y 1 y 2 y 1 y 2 ] = [ 0.7311 0.8808 0.7311 0.8808 ] \frac{\partial \vec{z_2}}{\partial{W_2}} = \begin{bmatrix} \frac{z_1}{\partial{W_2}}\\ \frac{z_2}{\partial{W_2}} \end{bmatrix} = \begin{bmatrix} \frac{w_{11}*y_1 + w_{12}*y_2}{\partial{W_2}}\\ \frac{w_{21}*y_1 + w_{22}*y_2}{\partial{W_2}} \end{bmatrix} = \begin{bmatrix} \frac{w_{11}*y_1 + w_{12}*y_2}{\partial{w_{11}}} & \frac{w_{11}*y_1 + w_{12}*y_2}{\partial{w_{12}}} \\ \frac{w_{11}*y_1 + w_{12}*y_2}{\partial{w_{21}}} & \frac{w_{11}*y_1 + w_{12}*y_2}{\partial{w_{22}}} \\ \frac{w_{21}*y_1 + w_{22}*y_2}{\partial{w_{11}}} & \frac{w_{21}*y_1 + w_{22}*y_2}{\partial{w_{12}}} \\ \frac{w_{21}*y_1 + w_{22}*y_2}{\partial{w_{21}}} & \frac{w_{21}*y_1 + w_{22}*y_2}{\partial{w_{22}}} \end{bmatrix} = \begin{bmatrix} y_1 & y_2\\ y_1 & y_2 \end{bmatrix} = \begin{bmatrix} 0.7311 & 0.8808\\ 0.7311 & 0.8808 \end{bmatrix} ∂W2∂z2=[∂W2z1∂W2z2]=[∂W2w11∗y1+w12∗y2∂W2w21∗y1+w22∗y2]= ∂w11w11∗y1+w12∗y2∂w21w11∗y1+w12∗y2∂w11w21∗y1+w22∗y2∂w21w21∗y1+w22∗y2∂w12w11∗y1+w12∗y2∂w22w11∗y1+w12∗y2∂w12w21∗y1+w22∗y2∂w22w21∗y1+w22∗y2 =[y1y1y2y2]=[0.73110.73110.88080.8808]
结果 OK
对谁求导,结果就是谁的转置
∂
l
o
s
s
∂
w
2
\frac{\partial{loss}}{\partial{w_2}}
∂w2∂loss 最终结果代码在 def forward
中已体现,grad_W2
即是
grad W2: [[-7.84312986e-06 -9.44959200e-06]
[ 9.57516307e-05 1.15363874e-04]]
∂ l o s s ∂ w 1 \frac{\partial{loss}}{\partial{w_1}} ∂w1∂loss 比 ∂ l o s s ∂ w 2 \frac{\partial{loss}}{\partial{w_2}} ∂w2∂loss 要复杂,不过有部分计算是公用的
上图框出来的部分,计算与 ∂ l o s s ∂ w 2 \frac{\partial{loss}}{\partial{w_2}} ∂w2∂loss 公用,只需计算后续三个偏导
∂ z 2 ∂ y 1 = w 2 T \frac{\partial{z_2}}{\partial{y_1}} = w_2^T ∂y1∂z2=w2T ( w 2 w_2 w2 的转置)
∂ y 1 ∂ z 1 \frac{\partial{y_1}}{\partial{z_1}} ∂z1∂y1 的计算同 ∂ y 2 ∂ z 2 \frac{\partial{y_2}}{\partial{z_2}} ∂z2∂y2,结果为 s i g m o i d ( z 1 ) ⋅ [ 1 − s i g m o i d ( z 1 ) ] = y 1 ⋅ ( 1 − y 1 ) sigmoid(z_1) \cdot [1-sigmoid(z_1)] = y_1 \cdot (1-y_1) sigmoid(z1)⋅[1−sigmoid(z1)]=y1⋅(1−y1)
∂ z 1 ∂ w 1 = x T \frac{\partial{z_1}}{\partial{w_1}} = x^T ∂w1∂z1=xT
∂
l
o
s
s
∂
w
1
\frac{\partial{loss}}{\partial{w_1}}
∂w1∂loss 最终结果代码在 def forward
中已体现,grad_W1
即是
grad W1: [[1.22429862e-04]
[7.80046216e-05]]
loss: tensor(0.4999, grad_fn=<MseLossBackward0>)
附录——矩阵求导
参考资料
【1】[Machine Learning] 深度学习中消失的梯度
【2】神经网络梯度消失的解释
【3】梯度下降法解神经网络
【5】Backpropagation(李宏毅老师上课PPT)