1.前向传播
1.1 概念
前向传播是指数据输入神经网络中,逐层向前传输,一直运算到输出层为止。
经过前向传播,得到的最终结果与真实值之间的误差,这个误差就是损失函数。
1.2 前向传播运算
以一个简单的神经网络为例,激活函数是:sigmoid
n
e
t
h
1
=
w
1
i
1
+
w
2
i
2
+
b
=
0.15
×
0.05
+
0.2
×
0.1
+
0.35
=
0.3775
net_{h1}=w_1i_1+w_2i_2+b=0.15\times0.05+0.2\times0.1+0.35=0.3775
neth1=w1i1+w2i2+b=0.15×0.05+0.2×0.1+0.35=0.3775
n e t h 2 = w 3 i 1 + w 4 i 2 + b = 0.25 × 0.05 + 0.3 × 0.1 + 0.35 = 0.3925 net_{h2}=w_3i_1+w_4i_2+b=0.25\times0.05+0.3\times0.1+0.35=0.3925 neth2=w3i1+w4i2+b=0.25×0.05+0.3×0.1+0.35=0.3925
o u t h 1 = 1 1 + e − x = 1 1 + e − n e t h 1 = 1 1 + e − 0.3775 = 0.5933 out_{h1}=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-net_{h1}}}=\frac{1}{1+e^{-0.3775}}=0.5933 outh1=1+e−x1=1+e−neth11=1+e−0.37751=0.5933
o u t h 2 = 1 1 + e − x = 1 1 + e − n e t h 2 = 1 1 + e − 0.3925 = 0.5969 out_{h2}=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-net_{h2}}}=\frac{1}{1+e^{-0.3925}}=0.5969 outh2=1+e−x1=1+e−neth21=1+e−0.39251=0.5969
n e t o 1 = w 5 o u t h 1 + w 6 o u t h 2 + b = 0.4 × 0.5933 + 0.45 × 0.5969 + 0.6 = 1.1059 net_{o1}=w_5out_{h_1}+w_6out_{h_2}+b=0.4\times0.5933+0.45\times0.5969+0.6=1.1059 neto1=w5outh1+w6outh2+b=0.4×0.5933+0.45×0.5969+0.6=1.1059
n e t o 2 = w 7 o u t h 1 + w 8 o u t h 2 + b = 0.5 × 0.5933 + 0.55 × 0.5969 + 0.6 = 1.2249 net_{o2}=w_7out_{h_1}+w_8out_{h_2}+b=0.5\times0.5933+0.55\times0.5969+0.6=1.2249 neto2=w7outh1+w8outh2+b=0.5×0.5933+0.55×0.5969+0.6=1.2249
o u t o 1 = 1 1 + e − x = 1 1 + e − n e t o 1 = 1 1 + e − 1.1059 = 0.7514 out_{o1}=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-net_{o1}}}=\frac{1}{1+e^{-1.1059}}=0.7514 outo1=1+e−x1=1+e−neto11=1+e−1.10591=0.7514
o u t o 2 = 1 1 + e − x = 1 1 + e − n e t o 2 = 1 1 + e − 1.2249 = 0.7729 out_{o2}=\frac{1}{1+e^{-x}}=\frac{1}{1+e^{-net_{o2}}}=\frac{1}{1+e^{-1.2249}}=0.7729 outo2=1+e−x1=1+e−neto21=1+e−1.22491=0.7729
E t o t a l = ∑ 1 2 ( t a r g e t − o u t p u t ) 2 E_{total}=\sum\frac{1}{2}(target-output)^2 Etotal=∑21(target−output)2
E t o t a l = E o 1 + E o 2 = 0.2748 + 0.0236 = 0.2984 E_{total}=E_{o1}+E_{o2}=0.2748+0.0236=0.2984 Etotal=Eo1+Eo2=0.2748+0.0236=0.2984
2.链式法则
对于复杂的复合函数,我们将其拆分为一系列的加减乘除或指数、对数、三角函数等差初等函数,通过链式法则完成复合函数的求导。我们这里以神经网络中常见的复合函数为例说明这个过程,令复合函数
f
(
x
;
w
,
b
)
f(x;w,b)
f(x;w,b)为:
f
(
x
;
w
,
b
)
=
1
e
x
p
(
−
(
w
x
+
b
)
)
+
1
f(x;w,b)=\frac{1}{exp\left(-(wx+b)\right)+1}
f(x;w,b)=exp(−(wx+b))+11
其中
x
x
x是输入数据,
w
w
w是权重,
b
b
b是偏置。我们将复合函数分解为:
函数 | 导数 |
---|---|
h 1 = x ⋅ w h_1=x\cdot w h1=x⋅w | ∂ h 1 ∂ w = x , ∂ h 1 ∂ x = w \frac{\partial h_1}{\partial w}=x,\frac{\partial h_1}{\partial x}=w ∂w∂h1=x,∂x∂h1=w |
h 2 = h 1 + b h_2=h_1 +b h2=h1+b | ∂ h 2 ∂ h 1 = 1 , ∂ h 2 ∂ b = 1 \frac{\partial h_2}{\partial h_1}=1,\frac{\partial h_2}{\partial b}=1 ∂h1∂h2=1,∂b∂h2=1 |
h 3 = − h 2 h_3=-h_2 h3=−h2 | ∂ h 3 ∂ h 2 = − 1 \frac{\partial h_3}{\partial h_2}=-1 ∂h2∂h3=−1 |
h 4 = e x p ( h 3 ) h_4=exp(h_3) h4=exp(h3) | ∂ h 4 ∂ h 3 = e x p ( h 3 ) \frac{\partial h_4}{\partial h_3}=exp(h_3) ∂h3∂h4=exp(h3) |
h 5 = h 4 + 1 h_5=h_4+1 h5=h4+1 | ∂ h 5 ∂ h 4 = 1 \frac{\partial h_5}{\partial h_4}=1 ∂h4∂h5=1 |
h 6 = 1 h 5 h_6=\frac{1}{h_5} h6=h51 | ∂ h 6 ∂ h 5 = − 1 h 5 2 \frac{\partial h_6}{\partial h_5}=\frac {-1}{h_5^2} ∂h5∂h6=h52−1 |
用图形化表示:
整个复合函数
f
(
x
;
w
,
b
)
f(x;w,b)
f(x;w,b)关于参数
w
w
w和
b
b
b的导数可以通过
f
(
x
;
w
,
b
)
f(x;w,b)
f(x;w,b)与参数
w
w
w和
b
b
b之间路径上所有的导数连乘得到:
∂
f
(
x
;
w
,
b
)
∂
w
=
∂
f
(
x
;
w
,
b
)
∂
h
6
⋅
∂
h
6
∂
h
5
⋅
∂
h
5
∂
h
4
⋅
∂
h
4
∂
h
3
⋅
∂
h
3
∂
h
2
⋅
∂
h
2
∂
h
1
⋅
∂
h
1
∂
w
\frac{\partial f(x;w,b)}{\partial w}=\frac{\partial f(x;w,b)}{\partial h_6}\cdot \frac{\partial h_6}{\partial h_5}\cdot \frac{\partial h_5}{\partial h_4}\cdot \frac{\partial h_4}{\partial h_3}\cdot \frac{\partial h_3}{\partial h_2}\cdot \frac{\partial h_2}{\partial h_1}\cdot \frac{\partial h_1}{\partial w}
∂w∂f(x;w,b)=∂h6∂f(x;w,b)⋅∂h5∂h6⋅∂h4∂h5⋅∂h3∂h4⋅∂h2∂h3⋅∂h1∂h2⋅∂w∂h1
∂
f
(
x
;
w
,
b
)
∂
b
=
∂
f
(
x
;
w
,
b
)
∂
h
6
⋅
∂
h
6
∂
h
5
⋅
∂
h
5
∂
h
4
⋅
∂
h
4
∂
h
3
⋅
∂
h
3
∂
h
2
⋅
∂
h
2
∂
b
\frac{\partial f(x;w,b)}{\partial b}=\frac{\partial f(x;w,b)}{\partial h_6}\cdot \frac{\partial h_6}{\partial h_5}\cdot \frac{\partial h_5}{\partial h_4}\cdot \frac{\partial h_4}{\partial h_3}\cdot \frac{\partial h_3}{\partial h_2}\cdot \frac{\partial h_2}{\partial b}
∂b∂f(x;w,b)=∂h6∂f(x;w,b)⋅∂h5∂h6⋅∂h4∂h5⋅∂h3∂h4⋅∂h2∂h3⋅∂b∂h2
以
w
w
w为例,当
x
x
x=1,
w
w
w=0,
b
b
b=0时,可以得到:
h 1 = x ⋅ w h_1=x\cdot w h1=x⋅w=0
h 2 = h 1 + b = 0 h_2=h_1+b=0 h2=h1+b=0
h 3 = − h 2 = 0 h_3=-h_2=0 h3=−h2=0
h 4 = e x p ( h 3 ) = 1 h_4=exp(h_3)=1 h4=exp(h3)=1
h 5 = h 4 + 1 = 2 h_5=h_4+1=2 h5=h4+1=2
h 6 = 1 h 5 = 1 2 h_6=\frac{1}{h_5}=\frac{1}{2} h6=h51=21
f ( x ; w , b ) = h 6 = 1 2 f(x;w,b)=h_6=\frac{1}{2} f(x;w,b)=h6=21
∂
f
(
x
;
w
,
b
)
∂
w
∣
x
=
1
,
w
=
0
,
b
=
0
=
∂
f
(
x
;
w
,
b
)
∂
h
6
⋅
∂
h
6
∂
h
5
⋅
∂
h
5
∂
h
4
⋅
∂
h
4
∂
h
3
⋅
∂
h
3
∂
h
2
⋅
∂
h
2
∂
h
1
⋅
∂
h
1
∂
w
=
1
×
(
−
0.25
)
×
1
×
1
×
(
−
1
)
×
1
×
1
=
0.25
\begin{aligned} \frac{\partial f(x;w,b)}{\partial w}|_{x=1,w=0,b=0} & =\frac{\partial f(x;w,b)}{\partial h_6}\cdot \frac{\partial h_6}{\partial h_5}\cdot \frac{\partial h_5}{\partial h_4}\cdot \frac{\partial h_4}{\partial h_3}\cdot \frac{\partial h_3}{\partial h_2}\cdot \frac{\partial h_2}{\partial h_1}\cdot \frac{\partial h_1}{\partial w}\\ & =1\times(-0.25)\times1\times1\times(-1)\times1\times{1} \\ &=0.25 \end{aligned}
∂w∂f(x;w,b)∣x=1,w=0,b=0=∂h6∂f(x;w,b)⋅∂h5∂h6⋅∂h4∂h5⋅∂h3∂h4⋅∂h2∂h3⋅∂h1∂h2⋅∂w∂h1=1×(−0.25)×1×1×(−1)×1×1=0.25
∂
f
(
x
;
w
,
b
)
∂
b
∣
x
=
1
,
w
=
0
,
b
=
0
=
∂
f
(
x
;
w
,
b
)
∂
h
6
⋅
∂
h
6
∂
h
5
⋅
∂
h
5
∂
h
4
⋅
∂
h
4
∂
h
3
⋅
∂
h
3
∂
h
2
⋅
∂
h
2
∂
b
=
1
×
(
−
0.25
)
×
1
×
1
×
(
−
1
)
×
1
=
0.25
\begin{aligned} \frac{\partial f(x;w,b)}{\partial b}|_{x=1,w=0,b=0} & =\frac{\partial f(x;w,b)}{\partial h_6}\cdot \frac{\partial h_6}{\partial h_5}\cdot \frac{\partial h_5}{\partial h_4}\cdot \frac{\partial h_4}{\partial h_3}\cdot \frac{\partial h_3}{\partial h_2}\cdot \frac{\partial h_2}{\partial b}\\ & =1\times(-0.25)\times1\times1\times(-1)\times{1}\\ &=0.25 \end{aligned}
∂b∂f(x;w,b)∣x=1,w=0,b=0=∂h6∂f(x;w,b)⋅∂h5∂h6⋅∂h4∂h5⋅∂h3∂h4⋅∂h2∂h3⋅∂b∂h2=1×(−0.25)×1×1×(−1)×1=0.25
3.BP反向传播算法
反向传播算法是利用链式法则对神经网络中的各个节点的权重进行更新。
- 输出层权重:
w j k = w j k − η ∂ E ∂ w j k w_{jk}=w_{jk}-\eta \frac{\partial E}{\partial w_{jk}} wjk=wjk−η∂wjk∂E - 隐藏层权重:
w i j = w i j − η ∂ E ∂ w i j w_{ij}=w_{ij}-\eta \frac{\partial E}{\partial w_{ij}} wij=wij−η∂wij∂E - 偏置更新:
b j = b j − η ∂ E ∂ b j b_{j}=b_{j}-\eta \frac{\partial E}{\partial b_{j}} bj=bj−η∂bj∂E
我们仍旧用前向传播的例子,先求最简单的误差 E E E对 w 5 w_5 w5的导数。先要明确链式法则的求导过程,要求误差 E E E 对 w 5 w_5 w5 的导数,需要先求误差 E E E 对 o u t o 1 out_{o1} outo1 的导数,再求 o u t o 1 out_{o1} outo1 对 n e t o 1 net_{o1} neto1 的导数,最后求 n e t o 1 net_{o1} neto1 对 w 5 w_5 w5 的导数,经过链式法则,我们即求出了误差 E E E 对 w 5 w_5 w5 的导数。如下图所示:
3.1 求解导数
E t o t a l = 1 2 ( t a r g e t o 1 − o u t o 1 ) 2 + 1 2 ( t a r g e t o 2 − o u t o 2 ) 2 E_{total}=\frac{1}{2}(target_{o1}-out_{o1})^2+\frac{1}{2}(target_{o2}-out_{o2})^2 Etotal=21(targeto1−outo1)2+21(targeto2−outo2)2
∂ E t o t a l ∂ o u t o 1 = 2 × 1 2 × ( t a r g e t o 1 − o u t o 1 ) 2 − 1 × ( − 1 ) + 0 = − ( t a r g e t o 1 − o u t o 1 ) = − ( 0.01 − 0.7514 ) = 0.7414 \frac{\partial E_{total}}{\partial out_{o1}}=2\times\frac{1}{2}\times(target_{o1}-out_{o1})^{2-1}\times(-1)+0=-(target_{o1}-out_{o1})=-(0.01-0.7514)=0.7414 ∂outo1∂Etotal=2×21×(targeto1−outo1)2−1×(−1)+0=−(targeto1−outo1)=−(0.01−0.7514)=0.7414
o u t o 1 = 1 1 + e − n e t o 1 out_{o1}=\frac{1}{1+e^{-net_{o1}}} outo1=1+e−neto11
∂ o u t o 1 ∂ n e t o 1 = o u t o 1 ( 1 − o u t o 1 ) = 0.7514 × ( 1 − 0.7514 ) = 0.1868 \frac{\partial out_{o1}}{\partial net_{o1}}=out_{o1}(1-out_{o1})=0.7514\times(1-0.7514)=0.1868 ∂neto1∂outo1=outo1(1−outo1)=0.7514×(1−0.7514)=0.1868
n e t o 1 = w 5 o u t h 1 + w 6 o u t h 2 + b net_{o1}=w_5out_{h_1}+w_6out_{h_2}+b neto1=w5outh1+w6outh2+b
∂ n e t o 1 ∂ w 5 = o u t h 1 + 0 + 0 = 0.5933 \frac{\partial net_{o1}}{\partial w_5}=out_{h_1}+0+0=0.5933 ∂w5∂neto1=outh1+0+0=0.5933
因此:
∂ E t o t a l ∂ w 5 = ∂ E t o t a l ∂ o u t o 1 ⋅ ∂ o u t o 1 ∂ n e t o 1 ⋅ ∂ n e t o 1 ∂ w 5 = 0.7414 × 0.1868 × 0.5933 = 0.0822 \frac{\partial E_{total}}{\partial w_5} =\frac{\partial E_{total}}{\partial out_{o1}}\cdot\frac{\partial out_{o1}}{\partial net_{o1}}\cdot\frac{\partial net_{o1}}{\partial w_5}=0.7414\times0.1868\times{0.5933} =0.0822 ∂w5∂Etotal=∂outo1∂Etotal⋅∂neto1∂outo1⋅∂w5∂neto1=0.7414×0.1868×0.5933=0.0822
3.2 参数更新
由上述求导过程可知:
∂
E
t
o
t
a
l
∂
o
u
t
o
1
=
−
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
⋅
o
u
t
o
1
(
1
−
o
u
t
o
1
)
⋅
o
u
t
h
1
=
0.0822
\begin{aligned} \frac{\partial E_{total}}{\partial out_{o1}}&=-(target_{o1}-out_{o1})\cdot out_{o1}(1-out_{o1})\cdot out_{h_1}\\ &=0.0822 \end{aligned}
∂outo1∂Etotal=−(targeto1−outo1)⋅outo1(1−outo1)⋅outh1=0.0822
∂
E
t
o
t
a
l
∂
o
u
t
o
2
=
−
(
t
a
r
g
e
t
o
2
−
o
u
t
o
2
)
⋅
o
u
t
o
2
(
1
−
o
u
t
o
2
)
⋅
o
u
t
h
2
=
−
0.0227
\begin{aligned} \frac{\partial E_{total}}{\partial out_{o2}}&=-(target_{o2}-out_{o2})\cdot out_{o2}(1-out_{o2})\cdot out_{h_2}\\ &=-0.0227 \end{aligned}
∂outo2∂Etotal=−(targeto2−outo2)⋅outo2(1−outo2)⋅outh2=−0.0227
w
5
+
=
w
5
−
η
⋅
∂
E
t
o
t
a
l
∂
o
u
t
o
1
=
0.4
−
0.5
×
0.0822
=
0.3589
w_5^+=w_5-\eta\cdot\frac{\partial E_{total}}{\partial out_{o1}}=0.4-0.5\times0.0822=0.3589
w5+=w5−η⋅∂outo1∂Etotal=0.4−0.5×0.0822=0.3589
w 6 + = w 6 − η ⋅ ∂ E t o t a l ∂ o u t o 1 = 0.45 − 0.5 × 0.0822 = 0.4089 w_6^+=w_6-\eta\cdot\frac{\partial E_{total}}{\partial out_{o1}}=0.45-0.5\times0.0822=0.4089 w6+=w6−η⋅∂outo1∂Etotal=0.45−0.5×0.0822=0.4089
w 7 + = w 7 − η ⋅ ∂ E t o t a l ∂ o u t o 2 = 0.50 − 0.5 × ( − 0.0227 ) = 0.5113 w_7^+=w_7-\eta\cdot\frac{\partial E_{total}}{\partial out_{o2}}=0.50-0.5\times(-0.0227)=0.5113 w7+=w7−η⋅∂outo2∂Etotal=0.50−0.5×(−0.0227)=0.5113
w 8 + = w 7 − η ⋅ ∂ E t o t a l ∂ o u t o 2 = 0.55 − 0.5 × ( − 0.0227 ) = 0.5614 w_8^+=w_7-\eta\cdot\frac{\partial E_{total}}{\partial out_{o2}}=0.55-0.5\times(-0.0227)=0.5614 w8+=w7−η⋅∂outo2∂Etotal=0.55−0.5×(−0.0227)=0.5614
误差
E
E
E 对
w
1
w_1
w1的导数,求导路径不止一条,计算过程下图所示:
∂ E t o t a l ∂ w 1 = ∂ E t o t a l o u t h 1 ⋅ ∂ o u t h 1 ∂ n e t h 1 ⋅ ∂ n e t h 1 w 1 \frac{\partial E_{total}}{\partial w_1}=\frac{\partial E_{total}}{out_{h_1}}\cdot\frac{\partial out_{h_1}}{\partial net_{h_1}}\cdot\frac{\partial net_{h_1}}{w_1} ∂w1∂Etotal=outh1∂Etotal⋅∂neth1∂outh1⋅w1∂neth1
∂ E t o t a l o u t h 1 = ∂ E o 1 ∂ o u t h 1 + ∂ E 02 ∂ o u t h 1 \frac{\partial E_{total}}{out_{h_1}}=\frac{\partial E_{o1}}{\partial out_{h_1}}+\frac{\partial E_{02}}{\partial out_{h_1}} outh1∂Etotal=∂outh1∂Eo1+∂outh1∂E02
∂ E o 1 ∂ o u t h 1 = ∂ E o 1 ∂ o u t o 1 ⋅ ∂ o u t o 1 ∂ n e t o 1 ⋅ ∂ n e t o 1 ∂ o u t h 1 \frac{\partial E_{o1}}{\partial out_{h_1}}=\frac{\partial E_{o1}}{\partial out_{o1}}\cdot\frac{\partial out_{o1}}{\partial net_{o1}}\cdot\frac{\partial net_{o1}}{\partial out_{h_1}} ∂outh1∂Eo1=∂outo1∂Eo1⋅∂neto1∂outo1⋅∂outh1∂neto1
∂ E o 2 ∂ o u t h 1 = ∂ E o 2 ∂ o u t o 2 ⋅ ∂ o u t o 2 ∂ n e t o 2 ⋅ ∂ n e t o 2 ∂ o u t h 1 \frac{\partial E_{o2}}{\partial out_{h_1}}=\frac{\partial E_{o2}}{\partial out_{o2}}\cdot\frac{\partial out_{o2}}{\partial net_{o2}}\cdot\frac{\partial net_{o2}}{\partial out_{h_1}} ∂outh1∂Eo2=∂outo2∂Eo2⋅∂neto2∂outo2⋅∂outh1∂neto2
∂ E t o t a l ∂ w 1 = ( ∂ E o 1 ∂ o u t o 1 ⋅ ∂ o u t o 1 ∂ n e t o 1 ⋅ ∂ n e t o 1 ∂ o u t h 1 + ∂ E o 2 ∂ o u t o 2 ⋅ ∂ o u t o 2 ∂ n e t o 2 ⋅ ∂ n e t o 2 ∂ o u t h 1 ) ⋅ ∂ o u t h 1 ∂ n e t h 1 ⋅ ∂ n e t h 1 w 1 \frac{\partial E_{total}}{\partial w_1}=\left(\frac{\partial E_{o1}}{\partial out_{o1}}\cdot\frac{\partial out_{o1}}{\partial net_{o1}}\cdot\frac{\partial net_{o1}}{\partial out_{h_1}}+\frac{\partial E_{o2}}{\partial out_{o2}}\cdot\frac{\partial out_{o2}}{\partial net_{o2}}\cdot\frac{\partial net_{o2}}{\partial out_{h_1}}\right)\cdot\frac{\partial out_{h_1}}{\partial net_{h_1}}\cdot\frac{\partial net_{h_1}}{w_1} ∂w1∂Etotal=(∂outo1∂Eo1⋅∂neto1∂outo1⋅∂outh1∂neto1+∂outo2∂Eo2⋅∂neto2∂outo2⋅∂outh1∂neto2)⋅∂neth1∂outh1⋅w1∂neth1
可得:
w
1
+
=
w
1
−
η
⋅
∂
E
t
o
t
a
l
∂
w
1
=
0.15
−
0.5
×
0.000438568
=
0.149780716
w_1^+=w_1-\eta\cdot\frac{\partial E_{total}}{\partial w_1}=0.15-0.5\times0.000438568=0.149780716
w1+=w1−η⋅∂w1∂Etotal=0.15−0.5×0.000438568=0.149780716
w 2 + = 0.19956143 w_2^+=0.19956143 w2+=0.19956143
w 3 + = 0.24975114 w_3^+=0.24975114 w3+=0.24975114
w 4 + = 0.29950229 w_4^+=0.29950229 w4+=0.29950229
通过以上步骤,更新了所有的权重,最初的前向传播输入是0.05和0.1,网络上的误差是0.298371109。经过第一轮传播之后,总误差下降到0.291027924。重复10000次之后,误差下降到0.000035085。两个输出神经元输出为0.015912196(相对于目标0.01)和0.984065734(相对于目标0.99)