Darknet 正向预测与反向传播
这篇文章是没有配图的,很多事情,配图反而说不清楚。
关于正向与反向传播,我仅仅类比梯度下降一个做计算,并结合源码。但这样做的意义和效果不做分析。能力有限。更不知道网络每一层到底干了什么。玄学
∑
k
=
1
n
∂
f
(
x
)
∂
x
k
c
o
s
β
k
\sum_{k=1}^n\frac{\partial f(x)}{\partial x_k}cos \beta_k
k=1∑n∂xk∂f(x)cosβk
其中:
∑
k
=
1
n
c
o
s
2
β
k
=
1
\sum_{k=1}^ncos^2\beta_k=1
k=1∑ncos2βk=1
则:
∑
k
=
1
n
∂
f
(
x
)
∂
x
i
c
o
s
β
i
≤
(
∑
k
=
1
n
(
∂
f
(
x
)
∂
x
i
)
2
)
(
∑
k
=
1
n
c
o
s
2
β
k
)
\sum_{k=1}^n\frac{\partial f(x)}{\partial x_i}cos \beta_i\leq \Big(\sum_{k=1}^{n}\Big(\frac{\partial f(x)}{\partial x_i}\Big)^2\Big)\Big(\sum_{k=1}^{n}cos^2\beta_k\Big)
k=1∑n∂xi∂f(x)cosβi≤(k=1∑n(∂xi∂f(x))2)(k=1∑ncos2βk)
梯度下降很简单,根据施瓦茨不等式和全微分知识,我门可以知道,对于连续函数,梯度方向是其曾长最快的方向,这里最快有点瞬时的味道,就是说只仅限于当前点,那么反向是其下降最快的方向。而沿着与梯度正交的方向运动,则相当于沿着等势面运动。此时函数值不变。
沿着梯度反方向搜索,可以保证函数值下降,直到梯度消失,当梯度消失时,算法结束在一个极小值点。有人说不存在最优结果,对于训练集来说。这样理解是不对的。其实只能说有时候不存在解析解,最优解还是存在的,因为
l
o
s
s
loss
loss函数是有下界的。
全联接层其实可以看作是特殊的卷积层。你也可以叫他不全联接。我觉得这个名字不错。
全联接层
全联接层的前向
(如果不知道什么是全联接可以百度,介绍还是比较多的)
令
c
t
c_t
ct为
t
t
t层神经元个数,全联接运算矩阵表示:
[
w
t
(
0
,
0
)
w
t
(
0
,
1
)
…
w
t
(
0
,
c
t
−
1
)
w
t
(
1
,
0
)
w
t
(
1
,
1
)
…
w
t
(
1
,
c
t
−
1
)
⋮
⋮
⋱
⋮
w
t
(
c
t
,
0
)
w
t
(
c
t
,
1
)
…
w
t
(
c
t
,
c
t
−
1
)
]
[
y
t
−
1
(
0
)
y
t
−
1
(
1
)
⋮
y
t
−
1
(
c
t
−
1
)
]
=
v
t
\left[\begin{matrix} w_t(0,0)&w_t(0,1)&\dots&w_t(0,c_{t-1})\\ w_t(1,0)&w_t(1,1)&\dots&w_t(1,c_{t-1})\\ \vdots&\vdots&\ddots &\vdots\\ w_t(c_t,0)&w_t(c_t,1)&\dots&w_t(c_t,c_{t-1}) \end{matrix}\right]\left[\begin{matrix}y_{t-1}(0)\\ y_{t-1}(1)\\ \vdots\\ y_{t-1}(c_{t-1}) \end{matrix}\right]=v_t
⎣⎢⎢⎢⎡wt(0,0)wt(1,0)⋮wt(ct,0)wt(0,1)wt(1,1)⋮wt(ct,1)……⋱…wt(0,ct−1)wt(1,ct−1)⋮wt(ct,ct−1)⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡yt−1(0)yt−1(1)⋮yt−1(ct−1)⎦⎥⎥⎥⎤=vt
这里,
y
t
−
1
y_{t-1}
yt−1是上一层输出,
w
t
w_t
wt表示各个突触权重。
v
t
v_t
vt就是还未激活的信号(局部诱导域)。
记:
φ
t
(
v
t
)
=
[
φ
t
(
v
t
(
0
)
)
φ
t
(
v
t
(
1
)
)
⋮
φ
t
(
v
t
(
c
t
)
)
]
\varphi_t(v_t)=\left[\begin{matrix}\varphi_t(v_{t}(0))\\ \varphi_t(v_{t}(1))\\ \vdots\\ \varphi_t(v_{t}(c_t)) \end{matrix}\right]
φt(vt)=⎣⎢⎢⎢⎡φt(vt(0))φt(vt(1))⋮φt(vt(ct))⎦⎥⎥⎥⎤
其中
φ
t
\varphi_t
φt表示第
t
t
t层的激活函数。
简化矩阵表示为:
v
t
=
w
t
y
t
−
1
y
t
=
φ
t
(
v
t
)
v_t=w_ty_{t-1}\\ y_t=\varphi_t(v_t)
vt=wtyt−1yt=φt(vt)
这部分操作于代码是对应的。
void forward_connected_layer(connected_layer l, network_state state)
{
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.batch;
int k = l.inputs;
int n = l.outputs;
float *a = state.input;
float *b = l.weights;
float *c = l.output;
gemm(0, 1, m, n, k, 1, a, k, b, k, 1, c, n);
activate_array(l.output, l.outputs*l.batch, l.activation);
}
上面代码删掉了
b
a
t
c
h
n
o
r
m
a
l
i
z
e
batch\ normalize
batch normalize部分的代码。这部分代码可以先不用看。
其中,
g
e
m
m
gemm
gemm函数负责矩阵乘法。
b
a
t
c
h
batch
batch是训练用的参数,可以自行百度。在预测时,这个数值被强行置为
1
1
1.
g
e
m
m
gemm
gemm的前两个参数表示矩阵是否进行转置。
m
m
m表述数据个数。
n
,
k
n,k
n,k为维度答案保存在
c
c
c中,也就是
l
.
o
u
t
p
u
t
l.output
l.output.
对应关系:
c
:
v
t
s
:
y
t
−
1
a
c
t
i
v
a
t
e
_
a
r
r
a
y
(
)
:
φ
(
v
t
)
c :v_t \\ s : y_{t-1}\\ activate\_array():\varphi(v_t)
c:vts:yt−1activate_array():φ(vt)
全联接层的反向传播
这里以
b
a
t
c
h
=
1
batch=1
batch=1来考虑定义每一层的代价函数:
J
t
=
1
2
∑
k
=
0
c
t
(
d
t
(
k
)
−
y
t
(
k
)
)
2
=
1
2
[
e
t
,
e
t
]
e
t
(
k
)
=
d
t
(
k
)
−
y
t
(
k
)
J_t=\frac{1}{2}\sum_{k=0}^{c_t}(d_t(k)-y_t(k))^2=\frac{1}{2}[e_t,e_t]\\ e_t(k)=d_t(k)-y_t(k)
Jt=21k=0∑ct(dt(k)−yt(k))2=21[et,et]et(k)=dt(k)−yt(k)
其中,
d
t
(
k
)
d_t(k)
dt(k)是我门期望的第
t
t
t层的输出。对于最后一层输出层来说,这个期望就是我门的标注数据。
当
t
t
t为输出层,也就是说,我门可以直接获得期望输出和实输出的误差
e
t
e_t
et
那么可以很方便的计算最外层的权重梯度
∇
J
t
\nabla J_t
∇Jt。
∂
J
t
∂
w
t
(
a
,
b
)
=
∑
k
=
0
c
t
e
t
(
k
)
∂
e
t
(
k
)
∂
w
t
(
a
,
b
)
=
∑
k
=
0
c
t
e
t
(
k
)
∂
e
t
(
k
)
∂
v
t
(
k
)
∂
v
t
(
k
)
∂
w
t
(
a
,
b
)
=
−
e
t
(
a
)
φ
t
′
(
v
t
(
a
)
)
∂
∑
i
w
t
(
a
,
i
)
y
t
−
1
(
i
)
∂
w
t
(
a
,
b
)
=
−
e
t
(
a
)
φ
t
′
(
v
t
(
a
)
)
y
t
−
1
(
b
)
\frac{\partial J_t}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_t(a,b)}\\ =-e_t(a)\varphi'_t(v_t(a))\frac{\partial \sum_{i}w_{t}(a,i)y_{t-1}(i)}{\partial w_t(a,b)}\\= -e_t(a)\varphi'_t(v_t(a))y_{t-1}(b)
∂wt(a,b)∂Jt=k=0∑ctet(k)∂wt(a,b)∂et(k)=k=0∑ctet(k)∂vt(k)∂et(k)∂wt(a,b)∂vt(k)=−et(a)φt′(vt(a))∂wt(a,b)∂∑iwt(a,i)yt−1(i)=−et(a)φt′(vt(a))yt−1(b)
但这毕竟是外层神经元,对于内层,虽然假设了每一层的期望输出,但内层的期望输出是不知道的,这也是现阶段对网络了解过少所限制的。
此时算法是使用最外层的误差作为内层误差的。
接着上一部分计算,记:
g
t
(
a
)
=
e
t
(
a
)
φ
t
′
(
v
t
(
a
)
)
g_t(a)=e_t(a)\varphi'_t(v_t(a))
gt(a)=et(a)φt′(vt(a))
则:
∂
J
t
∂
w
t
(
a
,
b
)
=
−
g
t
(
a
)
y
t
−
1
(
b
)
\frac{\partial J_t}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)
∂wt(a,b)∂Jt=−gt(a)yt−1(b)
那么考虑计算:
∂
J
t
∂
w
t
−
1
(
a
,
b
)
=
∑
k
=
0
c
t
e
t
(
k
)
∂
e
t
(
k
)
∂
v
t
(
k
)
∂
v
t
(
k
)
∂
w
t
−
1
(
a
,
b
)
=
∑
k
=
0
c
t
−
g
t
(
k
)
∂
∑
i
w
t
(
k
,
i
)
y
t
−
1
(
i
)
∂
w
t
−
1
(
a
,
b
)
=
∑
k
=
0
c
t
−
g
t
(
k
)
∑
i
=
0
c
t
−
1
∂
w
t
(
k
,
i
)
y
t
−
1
(
i
)
∂
y
t
−
1
(
i
)
∂
y
t
−
1
(
i
)
∂
w
t
−
1
(
a
,
b
)
=
∑
k
=
0
c
t
−
g
t
(
k
)
∑
i
=
0
c
t
−
1
w
t
(
k
,
i
)
φ
t
−
1
(
v
t
−
1
(
i
)
)
∂
∑
j
w
t
−
1
(
i
,
j
)
y
t
−
2
(
j
)
∂
w
t
−
1
(
a
,
b
)
=
∑
k
=
0
c
t
−
g
t
(
k
)
w
t
(
k
,
a
)
φ
t
−
1
(
v
t
−
1
(
a
)
)
y
t
−
2
(
b
)
=
y
t
−
2
(
b
)
φ
t
−
1
′
(
v
t
−
1
(
a
)
)
∑
k
=
0
c
t
−
g
t
(
k
)
w
t
(
k
,
a
)
\frac{\partial J_{t}}{\partial w_{t-1}(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_{t -1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\frac{\partial \sum_iw_t(k,i)y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}\frac{\partial w_t(k,i)y_{t-1}(i)}{\partial y_{t-1}(i)}\frac{\partial y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ = \sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}w_{t}(k,i)\varphi_{t-1}(v_{t-1}(i))\frac{\partial \sum_jw_{t-1}(i,j)y_{t-2}(j)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_{t}(k)w_t(k,a)\varphi_{t-1}(v_{t-1}(a))y_{t-2}(b)\\=y_{t-2}(b)\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}-g_t(k)w_t(k,a)
∂wt−1(a,b)∂Jt=k=0∑ctet(k)∂vt(k)∂et(k)∂wt−1(a,b)∂vt(k)=k=0∑ct−gt(k)∂wt−1(a,b)∂∑iwt(k,i)yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1∂yt−1(i)∂wt(k,i)yt−1(i)∂wt−1(a,b)∂yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1wt(k,i)φt−1(vt−1(i))∂wt−1(a,b)∂∑jwt−1(i,j)yt−2(j)=k=0∑ct−gt(k)wt(k,a)φt−1(vt−1(a))yt−2(b)=yt−2(b)φt−1′(vt−1(a))k=0∑ct−gt(k)wt(k,a)
进而:
g
t
−
1
(
a
)
=
φ
t
−
1
′
(
v
t
−
1
(
a
)
)
∑
k
=
0
c
t
g
t
(
k
)
w
t
(
k
,
a
)
∂
J
t
∂
w
t
−
1
(
a
,
b
)
=
−
g
t
−
1
(
a
)
y
t
−
2
(
b
)
g_{t-1}(a)=\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}g_t(k)w_t(k,a)\\ \frac{\partial J_t}{\partial w_{t-1}(a,b)}=-g_{t-1}(a)y_{t-2}(b)
gt−1(a)=φt−1′(vt−1(a))k=0∑ctgt(k)wt(k,a)∂wt−1(a,b)∂Jt=−gt−1(a)yt−2(b)
这种关系并非只存在
t
t
t为输出层时的
t
−
1
t-1
t−1层,而是一直在传递。
考虑第
l
l
l层全联接以输出层
t
t
t的损失函数为目标函数进行梯度计算:
∂
J
t
∂
w
l
(
a
,
b
)
=
∂
J
t
∂
y
l
(
a
)
∂
y
l
(
a
)
∂
w
l
(
a
,
b
)
=
∂
J
t
∂
y
l
(
a
)
φ
l
′
(
v
l
(
a
)
)
y
l
−
1
(
b
)
\frac{\partial J_t}{\partial w_{l}(a,b)}=\frac{\partial J_t}{\partial y_{l}(a)}\frac{\partial y_l(a)}{\partial w_{l}(a,b)}\\=\frac{\partial J_t}{\partial y_{l}(a)}\varphi_l'(v_l(a))y_{l-1}(b)
∂wl(a,b)∂Jt=∂yl(a)∂Jt∂wl(a,b)∂yl(a)=∂yl(a)∂Jtφl′(vl(a))yl−1(b)
由于误差的相互独立,回顾前向传播,可以有(非常不好理解):
∂
J
t
∂
y
l
(
a
)
=
∑
k
=
0
c
l
+
1
∂
J
t
∂
y
l
+
1
(
k
)
∂
y
l
+
1
(
k
)
∂
y
l
(
a
)
=
∑
k
=
0
c
l
+
1
∂
J
t
∂
y
l
+
1
(
k
)
∂
y
l
+
1
(
k
)
∂
v
l
+
1
(
k
)
w
l
+
1
(
k
,
a
)
\frac{\partial J_t}{\partial y_l(a)} = \sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{\partial y_{l+1}(k)}{\partial y_l(a)}\\=\sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{ \partial y_{l+1}(k)}{\partial v_{l+1}(k)}w_{l+1}(k,a)
∂yl(a)∂Jt=k=0∑cl+1∂yl+1(k)∂Jt∂yl(a)∂yl+1(k)=k=0∑cl+1∂yl+1(k)∂Jt∂vl+1(k)∂yl+1(k)wl+1(k,a)
其中:
∂
J
t
∂
w
l
(
k
,
a
)
=
∂
J
t
∂
y
l
(
k
)
∂
y
l
(
k
)
∂
v
l
(
k
)
∂
v
l
(
k
)
∂
w
l
(
k
,
a
)
=
∂
J
t
∂
y
l
(
k
)
∂
y
l
(
k
)
∂
v
l
(
k
)
y
(
a
)
\frac{\partial J_t}{\partial w_{l}(k,a)}=\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}\frac{\partial v_l(k)}{\partial w_l(k,a)}\\ =\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}y(a)
∂wl(k,a)∂Jt=∂yl(k)∂Jt∂vl(k)∂yl(k)∂wl(k,a)∂vl(k)=∂yl(k)∂Jt∂vl(k)∂yl(k)y(a)
此时相当于重新定义了
g
g
g,可知:
∂
J
t
∂
w
l
(
k
,
a
)
=
−
g
(
k
)
y
(
a
)
\frac{\partial J_t}{\partial w_l(k,a)}=-g(k)y(a)
∂wl(k,a)∂Jt=−g(k)y(a)
综上,上述证明中,假设
t
t
t作为了输出层,下面的总结以
n
n
n作为输出层。:
∂
J
n
∂
w
t
(
a
,
b
)
=
−
g
t
(
a
)
y
t
−
1
(
b
)
g
t
(
a
)
=
φ
t
′
(
v
t
(
a
)
)
∑
k
=
0
c
t
+
1
g
t
+
1
(
k
)
w
t
+
1
(
k
,
a
)
\frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\
∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
注意这里
g
(
t
)
g(t)
g(t)不是误差。它隐士的包含了误差传递。
反向传播的矩阵形式
∂
J
n
∂
w
t
(
a
,
b
)
=
−
g
t
(
a
)
y
t
−
1
(
b
)
g
t
(
a
)
=
φ
t
′
(
v
t
(
a
)
)
∑
k
=
0
c
t
+
1
g
t
+
1
(
k
)
w
t
+
1
(
k
,
a
)
\frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\
∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
根据这组关系,令:
g
t
=
[
g
t
(
0
)
g
t
(
1
)
⋮
g
t
(
c
t
)
]
g_t = \left[\begin{matrix}g_{t}(0)\\g_{t}(1)\\ \vdots\\g_{t}(c_t)\end{matrix}\right]
gt=⎣⎢⎢⎢⎡gt(0)gt(1)⋮gt(ct)⎦⎥⎥⎥⎤
令所有向量都为列向量。
则:
∂
J
n
∂
w
t
=
−
g
t
y
t
−
1
T
g
t
=
d
i
n
g
(
φ
t
′
(
v
t
)
)
g
t
+
1
T
w
t
+
1
=
φ
t
′
(
v
t
)
⨀
g
t
+
1
T
w
t
+
1
\frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T\\ \ \\ g_t=ding(\varphi'_t(v_t))g_{t+1}^Tw_{t+1} \\=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1}
∂wt∂Jn=−gtyt−1T gt=ding(φt′(vt))gt+1Twt+1=φt′(vt)⨀gt+1Twt+1
⨀
\bigodot
⨀这个运算符表示矩阵逐元素相乘得到的新矩阵:
c
i
,
j
=
a
i
,
j
b
i
,
j
c_{i,j}=a_{i,j}b_{i,j}
ci,j=ai,jbi,j
卷积层(非全联接层)
void forward_convolutional_layer(convolutional_layer l, network_state state)
{
int out_h = convolutional_out_height(l);
int out_w = convolutional_out_width(l);
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.n;
int k = l.size*l.size*l.c;
int n = out_h*out_w;
float *a = l.weights;
float *b = state.workspace;
float *c = l.output;
static int u = 0;
u++;
for(i = 0; i < l.batch; ++i){
im2col_cpu_custom(state.input, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
gemm(0, 0, m, n, k, 1, a, k, b, n, 1, c, n);
c += n*m;
state.input += l.c*l.h*l.w;
}
add_bias(l.output, l.biases, l.batch, l.n, out_h*out_w);
activate_array_cpu_custom(l.output, m*n*l.batch, l.activation);
}
基本操作:
i
m
2
c
o
l
_
c
p
u
_
c
u
s
t
o
m
im2col\_cpu\_custom
im2col_cpu_custom
这个操作对上一层的输入进行处理,得到形如
[
(
s
i
z
e
)
2
l
.
c
]
×
[
o
u
t
_
h
×
o
u
t
_
w
]
[(size)^2l.c] \times[out\_h\times out\_w]
[(size)2l.c]×[out_h×out_w]的矩阵,即为当前层输入
y
y
y。
当前层权重
w
w
w为
f
i
l
t
e
r
s
×
(
s
i
z
e
)
2
l
.
c
filters \times (size)^2l.c
filters×(size)2l.c
这里,
f
i
l
t
e
r
s
filters
filters代表下一层通道数,你也可以理解为神经元个数或者卷积核个数。
s
i
z
e
size
size表示卷积核的长度(大小)。
这样很好理解,卷积操变成矩阵乘法。
w
×
y
w\times y
w×y得到
f
i
l
t
e
r
s
×
o
u
t
_
h
×
o
u
t
_
w
filters\times out\_h\times out\_w
filters×out_h×out_w的输出。
考虑计算当卷积层作为输出层时,其最外层权重梯度:
∂
J
n
∂
w
n
(
a
,
b
)
=
1
2
∑
k
∑
m
∂
e
n
2
(
k
,
m
)
∂
w
n
(
a
,
b
)
=
∑
k
∑
m
e
n
(
k
,
m
)
∂
e
n
(
k
,
m
)
∂
v
n
(
k
,
m
)
∂
v
n
(
k
,
m
)
∂
w
n
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
∂
∑
i
w
n
(
k
,
i
)
y
n
−
1
(
i
,
m
)
∂
w
n
(
a
,
b
)
=
∑
m
−
e
n
(
a
,
m
)
φ
n
′
(
v
n
(
a
,
m
)
)
y
n
−
1
(
b
,
m
)
\frac{\partial J_n}{\partial w_n(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_n(a,b)}\\ =\sum_{k}\sum_{m}e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_n(k,m)}\frac{\partial v_n(k,m)}{\partial w_n(a,b)} \\=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_{i}w_n(k,i)y_{n-1}(i,m)}{\partial w_n(a,b)} \\=\sum_{m}-e_n(a,m)\varphi'_n(v_n(a,m))y_{n-1}(b,m)
∂wn(a,b)∂Jn=21k∑m∑∂wn(a,b)∂en2(k,m)=k∑m∑en(k,m)∂vn(k,m)∂en(k,m)∂wn(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn(a,b)∂∑iwn(k,i)yn−1(i,m)=m∑−en(a,m)φn′(vn(a,m))yn−1(b,m)
对于
n
−
1
n-1
n−1层,依然以最外层作为损失作为目标函数:
∂
J
n
∂
w
n
−
1
(
a
,
b
)
=
1
2
∑
k
∑
m
∂
e
n
2
(
k
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
∂
e
n
(
k
,
m
)
∂
v
n
(
k
,
m
)
∂
v
n
(
k
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
∂
v
n
(
k
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
∂
∑
i
w
n
(
k
,
i
)
y
n
−
1
(
i
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
∑
i
∂
w
n
(
k
,
i
)
y
n
−
1
(
i
,
m
)
∂
v
n
−
1
(
i
,
m
)
∂
v
n
−
1
(
i
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
∑
i
w
n
(
k
,
i
)
φ
n
−
1
′
(
v
n
−
1
(
i
,
m
)
)
∂
∑
j
w
n
−
1
(
i
,
j
)
y
n
−
2
(
j
,
m
)
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
e
n
(
k
,
m
)
φ
n
′
(
v
n
(
k
,
m
)
)
w
n
(
k
,
a
)
φ
n
−
1
′
(
v
n
−
1
(
a
,
m
)
)
y
n
−
2
(
b
,
m
)
\frac{\partial J_n}{\partial w_{n-1}(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_{n}(k,m)}\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_iw_n(k,i)y_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_i\frac{\partial w_n(k,i)y_{n-1}(i,m)}{\partial v_{n-1}(i,m)}\frac{\partial v_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_iw_n(k,i)\varphi'_{n-1}(v_{n-1}(i,m))\frac{\partial \sum_{j}w_{n-1}(i,j)y_{n-2}(j,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)
∂wn−1(a,b)∂Jn=21k∑m∑∂wn−1(a,b)∂en2(k,m)=k∑m∑−en(k,m)∂vn(k,m)∂en(k,m)∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂∑iwn(k,i)yn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑∂vn−1(i,m)∂wn(k,i)yn−1(i,m)∂wn−1(a,b)∂vn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑wn(k,i)φn−1′(vn−1(i,m))∂wn−1(a,b)∂∑jwn−1(i,j)yn−2(j,m)=k∑m∑−en(k,m)φn′(vn(k,m))wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)
根据上述计算,令:
g
n
(
a
,
m
)
=
e
n
(
a
,
m
)
φ
n
′
(
v
n
(
a
,
m
)
)
g_n(a,m)=e_n(a,m)\varphi'_n(v_n(a,m))
gn(a,m)=en(a,m)φn′(vn(a,m))
则:
∂
J
n
∂
w
n
(
a
,
b
)
=
∑
m
−
g
n
(
a
,
m
)
y
n
−
1
(
b
,
m
)
\frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_n(a,m)y_{n-1}(b,m)
∂wn(a,b)∂Jn=m∑−gn(a,m)yn−1(b,m)
∂
J
n
∂
w
n
−
1
(
a
,
b
)
=
∑
k
∑
m
−
g
n
(
k
,
m
)
w
n
(
k
,
a
)
φ
n
−
1
′
(
v
n
−
1
(
a
,
m
)
)
y
n
−
2
(
b
,
m
)
=
∑
m
(
∑
k
−
g
n
(
k
,
m
)
w
n
(
k
,
a
)
)
φ
n
−
1
′
(
v
n
−
1
(
a
,
m
)
)
y
n
−
2
(
b
,
m
)
\frac{\partial J_n}{\partial w_{n-1}(a,b)}=\sum_{k}\sum_{m}-g_n(k,m)w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)\\ =\sum_{m}\Big(\sum_{k}-g_n(k,m)w_n(k,a)\Big)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)
∂wn−1(a,b)∂Jn=k∑m∑−gn(k,m)wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)=m∑(k∑−gn(k,m)wn(k,a))φn−1′(vn−1(a,m))yn−2(b,m)
这里,我们令:
g
n
−
1
(
a
,
m
)
=
φ
n
−
1
′
(
v
n
−
1
(
a
,
m
)
)
∑
k
g
n
(
k
,
m
)
w
n
(
k
,
a
)
g_{n-1}(a,m)=\varphi'_{n-1}(v_{n-1}(a,m))\sum_{k}g_n(k,m)w_{n}(k,a)
gn−1(a,m)=φn−1′(vn−1(a,m))k∑gn(k,m)wn(k,a)
那么:
∂
J
n
∂
w
n
(
a
,
b
)
=
∑
m
−
g
n
−
1
(
a
,
m
)
y
n
−
2
(
b
,
m
)
\frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_{n-1}(a,m)y_{n-2}(b,m)
∂wn(a,b)∂Jn=m∑−gn−1(a,m)yn−2(b,m)
其实
g
g
g的这种传递性依然可以保持。重新定义
g
g
g
∂
J
n
∂
w
t
(
a
,
b
)
=
∑
k
∑
m
∂
J
n
∂
y
t
(
k
,
m
)
∂
y
t
(
k
,
m
)
∂
w
t
(
a
,
b
)
=
∑
k
∑
m
∂
J
n
∂
y
t
(
k
,
m
)
∂
y
t
(
k
,
m
)
∂
v
t
(
k
,
m
)
∂
v
t
(
k
,
m
)
∂
w
t
(
a
,
b
)
=
∑
m
∂
J
n
∂
y
t
(
a
,
m
)
∂
y
t
(
a
,
m
)
∂
v
t
(
a
,
m
)
y
t
−
1
(
b
,
m
)
\frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial w_{t}(a,b)}\\ =\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial v_{t}(k,m)}\frac{\partial v_t(k,m)}{\partial w_t(a,b)}\\ =\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m)
∂wt(a,b)∂Jn=k∑m∑∂yt(k,m)∂Jn∂wt(a,b)∂yt(k,m)=k∑m∑∂yt(k,m)∂Jn∂vt(k,m)∂yt(k,m)∂wt(a,b)∂vt(k,m)=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
这里,令:
g
t
(
a
,
m
)
=
∂
J
n
∂
y
t
(
a
,
m
)
∂
y
t
(
a
,
m
)
∂
v
t
(
a
,
m
)
g_t(a,m)=\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}
gt(a,m)=∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)
显然,
t
=
n
,
n
−
1
t=n,n-1
t=n,n−1时,是成立的。
归纳有:
∂
J
n
∂
w
t
(
a
,
b
)
=
∑
m
∂
J
n
∂
y
t
(
a
,
m
)
∂
y
t
(
a
,
m
)
∂
v
t
(
a
,
m
)
y
t
−
1
(
b
,
m
)
\frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m)
∂wt(a,b)∂Jn=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
对于(非常不好理解):
∂
J
n
∂
y
t
(
a
,
m
)
∂
y
t
(
a
,
m
)
∂
v
t
(
a
,
m
)
=
∂
J
n
∂
y
t
(
a
,
m
)
φ
t
′
(
v
t
(
a
,
m
)
)
=
φ
t
′
(
v
t
(
a
,
m
)
)
∑
i
∂
J
n
∂
y
t
+
1
(
i
,
m
)
∂
y
t
+
1
(
i
,
m
)
∂
v
t
+
1
(
i
,
m
)
∂
v
t
+
1
(
i
,
m
)
∂
y
t
(
a
,
m
)
=
φ
t
′
(
v
t
(
a
,
m
)
)
∑
i
∂
J
n
∂
y
t
+
1
(
i
,
m
)
∂
y
t
+
1
(
i
,
m
)
∂
v
t
+
1
(
i
,
m
)
w
t
+
1
(
i
,
a
)
\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}=\frac{\partial J_n}{\partial y_{t}(a,m)}\varphi_{t}'(v_{t}(a,m))\\ =\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}\frac{\partial v_{t+1}(i,m)}{\partial y_t(a,m)}\\ =\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}w_{t+1}(i,a)
∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)=∂yt(a,m)∂Jnφt′(vt(a,m))=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)∂yt(a,m)∂vt+1(i,m)=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)wt+1(i,a)
故:
g
t
(
a
,
m
)
=
φ
t
′
(
v
t
(
a
,
m
)
)
∑
k
g
t
+
1
(
k
,
m
)
w
t
+
1
(
k
,
a
)
g_t(a,m)=\varphi'_t(v_t(a,m))\sum_{k}g_{t+1}(k,m)w_{t+1}(k,a)
gt(a,m)=φt′(vt(a,m))k∑gt+1(k,m)wt+1(k,a)
卷积反向传播也可以用矩阵表示:
g
t
=
φ
t
′
(
v
t
)
⨀
g
t
+
1
T
w
t
+
1
∂
J
n
∂
w
t
=
−
g
t
y
t
−
1
T
g_t=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1}\\ \frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T
gt=φt′(vt)⨀gt+1Twt+1∂wt∂Jn=−gtyt−1T
卷积与全连接更新方式一样
对于部分不好理解的地方,做一个解释:
对于上文中,两处不好理解的地方,其实你可以这样理解为连续函数
h
(
x
1
,
x
2
,
…
,
x
n
)
h(x_1,x_2,\dots,x_n)
h(x1,x2,…,xn)
令:
f
(
t
)
=
h
(
t
,
t
,
…
,
t
)
f(t)=h(t,t,\dots,t)
f(t)=h(t,t,…,t) 则:
d
f
(
t
)
d
t
=
∑
k
=
1
n
∂
h
∂
x
k
∣
x
k
=
t
\frac{df(t)}{dt}=\sum_{k=1}^n\frac{\partial h}{\partial x_k}\Big|_{x_k=t}
dtdf(t)=k=1∑n∂xk∂h∣∣∣xk=t
对于
t
t
t层第
k
k
k个神经元输出
y
t
(
a
)
y_t(a)
yt(a)对输出层
n
n
n的损失函数的影响是多维度的。只不过每个维度的输入变量是一个.显然,这种影响又是可微的。
∂
J
n
∂
y
t
(
a
)
=
∑
k
=
0
c
t
+
1
∂
J
n
∂
y
t
+
1
(
k
)
∂
y
t
+
1
(
k
)
∂
y
t
(
a
)
\frac{\partial J_n}{\partial y_t(a)}=\sum_{k=0}^{c_{t+1}}\frac{\partial J_n}{\partial y_{t+1}(k)}\frac{\partial y_{t+1}(k)}{\partial y_t(a)}
∂yt(a)∂Jn=k=0∑ct+1∂yt+1(k)∂Jn∂yt(a)∂yt+1(k)