计算图
用计算图来表示任何函数,其中图的节点表示我们要执行的每一步计算。如上图的线性分类器中,输入是
x
x
x和
W
W
W,
∗
*
∗表示矩阵乘法,即
W
∗
x
W*x
W∗x,输出得分向量。另一个节点表示 hinge loss,计算数据损失项
L
i
L_{i}
Li,还有一个正则项,在右下角。在最后的总的损失
L
L
L,是正则项和数据项的和。
画出计算图后,可以用链式求导法则得到每个节点的梯度。
(x+y)z的链式求导公式
令
f
(
x
,
y
,
z
)
=
(
x
+
y
)
z
f_{(x, y, z)}=(x+y)z
f(x,y,z)=(x+y)z,
q
(
x
,
y
)
=
x
+
y
q_{(x, y)}=x+y
q(x,y)=x+y,则
∂
f
∂
x
=
∂
f
∂
q
×
∂
q
∂
x
=
z
×
1
=
z
\frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\times \frac{\partial q}{\partial x}=z\times 1=z
∂x∂f=∂q∂f×∂x∂q=z×1=z
∂
f
∂
y
=
∂
f
∂
q
×
∂
q
∂
y
=
z
×
1
=
z
\frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\times \frac{\partial q}{\partial y}=z\times 1=z
∂y∂f=∂q∂f×∂y∂q=z×1=z
∂
f
∂
z
=
q
=
x
+
y
\frac{\partial f}{\partial z}=q=x+y
∂z∂f=q=x+y
反向求梯度的例子
正向传播计算图如图所示,反向传播过程为:
开始第一个梯度为1。
令
f
(
x
)
=
1
x
f_{(x)}=\frac{1}{x}
f(x)=x1,则求导得
f
(
x
)
′
=
−
1
x
2
f_{(x)}^{'}=-\frac{1}{x^{2}}
f(x)′=−x21,将
x
=
1.37
x=1.37
x=1.37代入得
f
(
x
)
′
=
−
0.53
f_{(x)}^{'}=-0.53
f(x)′=−0.53,故其梯度为
−
0.53
×
1
=
−
0.53
-0.53\times 1=-0.53
−0.53×1=−0.53。
令
f
(
x
)
=
x
+
1
f_{(x)}=x+1
f(x)=x+1,则求导得
f
(
x
)
′
=
1
f_{(x)}^{'}=1
f(x)′=1,故其梯度为
1
×
−
0.53
=
−
0.53
1\times -0.53=-0.53
1×−0.53=−0.53。
令
f
(
x
)
=
e
x
f_{(x)}=e^{x}
f(x)=ex,则求导得
f
(
x
)
′
=
e
x
f_{(x)}^{'}=e^{x}
f(x)′=ex,将
x
=
−
1
x=-1
x=−1代入得
f
(
x
)
′
=
0.37
f_{(x)}^{'}=0.37
f(x)′=0.37,故其梯度为
0.37
×
−
0.53
=
−
0.2
0.37\times -0.53=-0.2
0.37×−0.53=−0.2。
以此类推,得到所有梯度为:
上图中画框的地方其实是
s
i
g
m
o
i
d
sigmoid
sigmoid函数,可以不用一步一步地从开始求解到0.20处,直接用
s
i
g
m
o
i
d
sigmoid
sigmoid求导得到梯度。
sigmoid求导
σ
(
x
)
=
1
1
+
e
−
x
\sigma_{(x)}=\frac{1}{1+e^{-x}}
σ(x)=1+e−x1
d
σ
(
x
)
d
x
=
e
−
x
(
1
+
e
−
x
)
2
=
(
1
+
e
−
x
−
1
1
+
e
−
x
)
(
1
1
+
e
−
x
)
=
(
1
−
σ
(
x
)
)
σ
(
x
)
\frac{d\sigma_{(x)}}{dx}=\frac{e^{-x}}{(1+e^{-x})^{2}}=(\frac{1+e^{-x}-1}{1+e^{-x}})(\frac{1}{1+e^{-x}})=(1-\sigma_{(x)})\sigma_{(x)}
dxdσ(x)=(1+e−x)2e−x=(1+e−x1+e−x−1)(1+e−x1)=(1−σ(x))σ(x)
向量的反向传播
如上图所示,对
f
(
q
i
)
f_{(q_{i})}
f(qi)求导,得到
∂
f
∂
q
i
=
2
q
i
\frac{\partial f}{\partial q_{i}}=2q_{i}
∂qi∂f=2qi,即反向求导后得到梯度
[
0.44
0.52
]
\begin{bmatrix} 0.44 \\ 0.52 \\ \end{bmatrix}
[0.440.52]
用
q
1
q_{1}
q1(即
W
1
,
1
x
1
+
W
1
,
2
x
2
W_{1, 1}x_{1}+W_{1, 2}x_{2}
W1,1x1+W1,2x2)对
W
1
,
1
W_{1, 1}
W1,1求导,得
∂
q
1
∂
W
1
,
1
=
x
1
=
0.2
\frac{\partial q_{1}}{\partial W_{1, 1}}=x_{1}=0.2
∂W1,1∂q1=x1=0.2
用
q
1
q_{1}
q1对
W
1
,
2
W_{1, 2}
W1,2求导,得
∂
q
1
∂
W
1
,
2
=
x
2
=
0.4
\frac{\partial q_{1}}{\partial W_{1, 2}}=x_{2}=0.4
∂W1,2∂q1=x2=0.4
用
q
1
q_{1}
q1对
W
2
,
1
W_{2, 1}
W2,1求导,得
∂
q
1
∂
W
2
,
1
=
0
\frac{\partial q_{1}}{\partial W_{2, 1}}=0
∂W2,1∂q1=0
用
q
1
q_{1}
q1对
W
2
,
2
W_{2, 2}
W2,2求导,得
∂
q
1
∂
W
2
,
2
=
0
\frac{\partial q_{1}}{\partial W_{2, 2}}=0
∂W2,2∂q1=0
同理,
∂
q
2
∂
W
1
,
1
=
0
\frac{\partial q_{2}}{\partial W_{1, 1}}=0
∂W1,1∂q2=0,
∂
q
2
∂
W
1
,
2
=
0
\frac{\partial q_{2}}{\partial W_{1, 2}}=0
∂W1,2∂q2=0,
∂
q
2
∂
W
2
,
1
=
x
1
=
0.2
\frac{\partial q_{2}}{\partial W_{2, 1}}=x_{1}=0.2
∂W2,1∂q2=x1=0.2,
∂
q
2
∂
W
2
,
2
=
x
2
=
0.4
\frac{\partial q_{2}}{\partial W_{2, 2}}=x_{2}=0.4
∂W2,2∂q2=x2=0.4。
即:
∂
q
k
∂
W
i
,
j
=
1
k
=
i
x
j
\frac{\partial q_{k}}{\partial W_{i, j}}=1_{k=i}x_{j}
∂Wi,j∂qk=1k=ixj
其中
1
k
=
i
1_{k=i}
1k=i指:如果
k
=
i
k=i
k=i,则
1
k
=
i
=
1
1_{k=i}=1
1k=i=1,否则等于
0
0
0。
故:
∂
f
∂
W
i
,
j
=
∑
k
∂
f
∂
q
k
∂
q
k
∂
W
i
,
j
=
∑
k
(
2
q
k
)
(
1
k
=
i
x
j
)
=
2
q
i
x
j
\frac{\partial f}{\partial W_{i, j}}=\sum_{k}\frac{\partial f}{\partial q_{k}}\frac{\partial q_{k}}{\partial W_{i, j}}=\sum_{k}(2q_{k})(1_{k=i}x_{j})=2q_{i}x_{j}
∂Wi,j∂f=k∑∂qk∂f∂Wi,j∂qk=k∑(2qk)(1k=ixj)=2qixj
故:
∂
f
∂
W
1
,
1
=
2
q
1
x
1
=
0.088
\frac{\partial f}{\partial W_{1, 1}}=2q_{1}x_{1}=0.088
∂W1,1∂f=2q1x1=0.088
∂
f
∂
W
1
,
2
=
2
q
1
x
2
=
0.176
\frac{\partial f}{\partial W_{1, 2}}=2q_{1}x_{2}=0.176
∂W1,2∂f=2q1x2=0.176
∂
f
∂
W
2
,
1
=
2
q
2
x
1
=
0.104
\frac{\partial f}{\partial W_{2, 1}}=2q_{2}x_{1}=0.104
∂W2,1∂f=2q2x1=0.104
∂
f
∂
W
2
,
2
=
2
q
2
x
2
=
0.208
\frac{\partial f}{\partial W_{2, 2}}=2q_{2}x_{2}=0.208
∂W2,2∂f=2q2x2=0.208
最终得到:
∂
f
∂
W
=
[
0.088
0.176
0.104
0.208
]
\frac{\partial f}{\partial W}= \begin{bmatrix} 0.088 & 0.176 \\ 0.104 & 0.208 \\ \end{bmatrix}
∂W∂f=[0.0880.1040.1760.208]
继续用
q
1
q_{1}
q1对
x
1
x_{1}
x1求导,得
∂
q
1
∂
x
1
=
W
1
,
1
=
0.1
\frac{\partial q_{1}}{\partial x_{1}}=W_{1, 1}=0.1
∂x1∂q1=W1,1=0.1
同理得
∂
q
1
∂
x
2
=
W
1
,
2
=
0.5
\frac{\partial q_{1}}{\partial x_{2}}=W_{1, 2}=0.5
∂x2∂q1=W1,2=0.5
∂
q
2
∂
x
1
=
W
2
,
1
=
−
0.3
\frac{\partial q_{2}}{\partial x_{1}}=W_{2, 1}=-0.3
∂x1∂q2=W2,1=−0.3
∂
q
2
∂
x
2
=
W
2
,
2
=
0.8
\frac{\partial q_{2}}{\partial x_{2}}=W_{2, 2}=0.8
∂x2∂q2=W2,2=0.8
即:
∂
q
k
∂
x
i
=
W
k
,
i
\frac{\partial q_{k}}{\partial x_{i}}=W_{k, i}
∂xi∂qk=Wk,i
∂
f
∂
x
i
=
∑
k
∂
f
∂
q
k
∂
q
k
∂
x
i
=
∑
k
2
q
k
W
k
,
i
\frac{\partial f}{\partial x_{i}}=\sum_{k}\frac{\partial f}{\partial q_{k}}\frac{\partial q_{k}}{\partial x_{i}}=\sum_{k}2q_{k}W_{k, i}
∂xi∂f=k∑∂qk∂f∂xi∂qk=k∑2qkWk,i
故:
∂
f
∂
x
1
=
2
q
1
W
1
,
1
+
2
q
2
W
2
,
1
=
−
0.112
\frac{\partial f}{\partial x_{1}}=2q_{1}W_{1, 1}+2q_{2}W_{2, 1}=-0.112
∂x1∂f=2q1W1,1+2q2W2,1=−0.112
∂
f
∂
x
2
=
2
q
1
W
1
,
2
+
2
q
2
W
2
,
2
=
0.636
\frac{\partial f}{\partial x_{2}}=2q_{1}W_{1, 2}+2q_{2}W_{2, 2}=0.636
∂x2∂f=2q1W1,2+2q2W2,2=0.636
故:
∂
f
∂
x
=
[
−
0.112
0.636
]
\frac{\partial f}{\partial x}=\begin{bmatrix} -0.112 \\ 0.636 \\ \end{bmatrix}
∂x∂f=[−0.1120.636]
最终: