本文主要是在该文的基础上进行了自己的整理总结,如有不清楚的地方还请移步
1 准备工作
1.1 Sigmoid激活函数的导数
Sigmoid函数表达式:
σ
(
x
)
=
1
1
+
e
−
x
\begin{aligned} \sigma \left( x \right) =\frac{1}{1+e^{-x}} \end{aligned}
σ(x)=1+e−x1
其导数为:
d
σ
(
x
)
d
x
=
d
d
x
(
1
1
+
e
−
x
)
=
e
−
x
(
1
+
e
−
x
)
2
=
(
1
+
e
−
x
)
−
1
(
1
+
e
−
x
)
2
=
1
+
e
−
x
(
1
+
e
−
x
)
2
−
(
1
1
+
e
−
x
)
2
=
σ
(
x
)
−
σ
(
x
)
2
=
σ
(
1
−
σ
)
\begin{aligned} \\ \frac{d\sigma \left( x \right)}{dx}&=\frac{d}{dx}\left( \frac{1}{1+e^{-x}} \right) \\ &=\frac{e^{-x}}{\left( 1+e^{-x} \right) ^2}=\frac{\left( 1+e^{-x} \right) -1}{\left( 1+e^{-x} \right) ^2} \\ &=\frac{1+e^{-x}}{\left( 1+e^{-x} \right) ^2}-\left( \frac{1}{1+e^{-x}} \right) ^{\begin{array}{c}2 \\ \end{array}} \\ &=\sigma \left( x \right) -\sigma \left( x \right) ^2=\sigma \left( 1-\sigma \right) \end{aligned}
dxdσ(x)=dxd(1+e−x1)=(1+e−x)2e−x=(1+e−x)2(1+e−x)−1=(1+e−x)21+e−x−(1+e−x1)2=σ(x)−σ(x)2=σ(1−σ)
2 神经网络结构图
i i i为输入, h h h和 o o o为两个全连接层
对神经网络的权重
w
w
w和偏置
b
b
b的参数进行初始化,如下图所示:
3 前向传播
1. 对 h h h层:
1.计算
h
1
h_1
h1节点的全部输入
n
e
t
h
1
=
w
1
×
i
1
+
w
2
×
i
2
+
b
1
×
1
=
0.15
×
0.05
+
0.2
×
0.1
+
0.35
×
1
=
0.3775
\begin{aligned} net_{h1}&=w_1\times i_1+w_2\times i_2+b_1\times 1 \\ &=0.15\times 0.05+0.2\times 0.1+0.35\times 1 \\ &=0.3775 \end{aligned}
neth1=w1×i1+w2×i2+b1×1=0.15×0.05+0.2×0.1+0.35×1=0.3775
2.计算
h
1
h_1
h1节点的输出。
h
1
h_1
h1节点的公式应该为
o
u
t
h
1
=
σ
(
w
x
+
b
)
out_{h1}=\sigma \left( wx+b \right)
outh1=σ(wx+b)其中,
x
x
x为该节点的输入(在此处即
i
i
i),
w
w
w为权重,
b
b
b为偏置,
σ
\sigma
σ为激活函数(此处采用的激活函数即准备工作1.1中的Sigmoid激活函数)。则
h
1
h_1
h1节点的输出为:
o
u
t
h
1
=
σ
(
w
x
+
b
)
=
σ
(
n
e
t
h
1
)
=
1
1
+
e
−
n
e
t
h
1
=
1
1
+
e
−
0.3775
=
0.593269992
\begin{aligned} out_{h1}&=\sigma \left( wx+b \right) \\ &=\sigma \left( net_{h1} \right) \\ &=\frac{1}{1+e^{-net_{h1}}}=\frac{1}{1+e^{-0.3775}} \\ &=0.593269992 \end{aligned}
outh1=σ(wx+b)=σ(neth1)=1+e−neth11=1+e−0.37751=0.593269992
3.用同样的方法得
o
u
t
h
2
=
0.596884378
out_{h2}=0.596884378
outh2=0.596884378
2.对 o o o层
1.对
o
o
o层重复上述的过程:
n
e
t
o
1
=
w
5
×
o
u
t
h
1
+
w
6
×
o
u
t
h
2
+
b
2
×
1
=
0.4
×
0.593269992
+
0.45
×
0.596884378
+
0.6
=
1.105905967
\begin{aligned} net_{o1}&=w_5\times out_{h1}+w_6\times out_{h2}+b_2\times 1 \\ &=0.4\times 0.593269992+0.45\times 0.596884378+0.6 \\ &=1.105905967 \end{aligned}
neto1=w5×outh1+w6×outh2+b2×1=0.4×0.593269992+0.45×0.596884378+0.6=1.105905967
则输出为:
o
u
t
o
1
=
1
1
+
e
−
n
e
t
o
1
=
0.75136507
out_{o1}=\frac{1}{1+e^{-net_{o1}}}=0.75136507
outo1=1+e−neto11=0.75136507
同理可以得到
o
u
t
o
2
=
0.772929456
out_{o2}=0.772929456
outo2=0.772929456
4 计算误差(Loss)
在这里采用均方差损失函数,其表达式为: E t o t a l = 1 2 ∑ k = 1 K ( y k − o k ) 2 E_{total}=\frac{1}{2}\sum_{k=1}^K{\left( y_k-o_k \right) ^2} Etotal=21k=1∑K(yk−ok)2其中, y k y_k yk为真实值(期望值), o k o_k ok为输出值。
如上图2,对
o
1
o_1
o1节点,其真实值为0.01,而神经网络经前向传播之后的输出值为0.75136507,则其误差为:
E
o
1
=
1
2
(
t
a
r
g
e
t
−
o
u
t
p
u
t
)
2
=
1
2
×
(
0.01
−
0.75136507
)
2
=
0.274811
\begin{aligned} E_{o1}&=\frac{1}{2}\left( target-output \right) ^2=\frac{1}{2}\times \left( 0.01-0.75136507 \right) ^2 \\ &=0.274811 \end{aligned}
Eo1=21(target−output)2=21×(0.01−0.75136507)2=0.274811
同理可得
E
o
2
=
0.023560026
E_{o2}=0.023560026
Eo2=0.023560026
综上所述,可以得到总误差为: E t o t a l = E o 1 + E o 2 = 0.274811 + 0.023560025 = 0.298371 E_{total}=E_{o1}+E_{o2}=0.274811+0.023560025=0.298371 Etotal=Eo1+Eo2=0.274811+0.023560025=0.298371
5 反向传播
1.对输出层( o o o层)
对于 w 5 w_5 w5,想知道其改变对于总误差有多大的影响,于是需要计算 ∂ E t o t a l ∂ w 5 \frac{\partial E_{total}}{\partial w_5} ∂w5∂Etotal
根据链式法则: ∂ E t o t a l ∂ w 5 = ∂ E t o t a l ∂ o u t o 1 × ∂ o u t o 1 ∂ n e t o 1 × ∂ n e t o 1 ∂ w 5 \frac{\partial E_{total}}{\partial w_5}=\frac{\partial E_{total}}{\partial out_{o1}}\times \frac{\partial out_{o1}}{\partial net_{o1}}\times \frac{\partial net_{o1}}{\partial w_5} ∂w5∂Etotal=∂outo1∂Etotal×∂neto1∂outo1×∂w5∂neto1
1.对 ∂ E t o t a l ∂ o u t o 1 \frac{\partial E_{total}}{\partial out_{o1}} ∂outo1∂Etotal:
E
t
o
t
a
l
=
1
2
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
2
+
1
2
(
t
a
r
g
e
t
o
2
−
o
u
t
o
2
)
2
∂
E
t
o
t
a
l
∂
o
u
t
o
1
=
2
×
1
2
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
2
−
1
×
(
−
1
)
+
0
=
−
(
t
a
r
g
e
t
o
1
−
o
u
t
o
1
)
=
−
(
0.01
−
0.75136507
)
=
0.741365
\begin{aligned} E_{total}&=\frac{1}{2}\left( target_{o1}-out_{o1} \right) ^2+\frac{1}{2}\left( target_{o2}-out_{o2} \right) ^2 \\ \frac{\partial E_{total}}{\partial out_{o1}}&=2\times \frac{1}{2}\left( target_{o1}-out_{o1} \right) ^{2-1}\times \left( -1 \right) +0 \\ &=-\left( target_{o1}-out_{o1} \right) \\ &=-\left( 0.01-0.75136507 \right) =0.741365 \end{aligned}
Etotal∂outo1∂Etotal=21(targeto1−outo1)2+21(targeto2−outo2)2=2×21(targeto1−outo1)2−1×(−1)+0=−(targeto1−outo1)=−(0.01−0.75136507)=0.741365
2.对
∂
o
u
t
o
1
∂
n
e
t
o
1
\frac{\partial out_{o1}}{\partial net_{o1}}
∂neto1∂outo1:
o
u
t
o
1
=
1
1
+
e
−
n
e
t
o
1
∂
o
u
t
o
1
∂
n
e
t
o
1
=
o
u
t
o
1
(
1
−
o
u
t
o
1
)
=
0.186815602
\begin{aligned} out_{o1}&=\frac{1}{1+e^{-net_{o1}}} \\ \frac{\partial out_{o1}}{\partial net_{o1}}&=out_{o1}\left( 1-out_{o1} \right) =0.186815602 \end{aligned}
outo1∂neto1∂outo1=1+e−neto11=outo1(1−outo1)=0.186815602
3.对
∂
n
e
t
o
1
∂
w
5
\frac{\partial net_{o1}}{\partial w_5}
∂w5∂neto1:
n
e
t
o
1
=
w
5
×
o
u
t
h
1
+
w
6
×
o
u
t
h
2
+
b
2
×
1
∂
n
e
t
o
1
∂
w
5
=
1
×
o
u
t
h
1
×
w
5
(
1
−
1
)
+
0
+
0
=
0.593269992
\begin{aligned} net_{o1}&=w_5\times out_{h1}+w_6\times out_{h2}+b_2\times 1 \\ \frac{\partial net_{o1}}{\partial w_5}&=1\times out_{h1}\times w_{5}^{\left( 1-1 \right)}+0+0=0.593269992 \end{aligned}
neto1∂w5∂neto1=w5×outh1+w6×outh2+b2×1=1×outh1×w5(1−1)+0+0=0.593269992
综上所述,
∂
E
t
o
t
a
l
∂
w
5
=
∂
E
t
o
t
a
l
∂
o
u
t
o
1
×
∂
o
u
t
o
1
∂
n
e
t
o
1
×
∂
n
e
t
o
1
∂
w
5
=
0.741365
×
0.186815602
×
0.593269992
=
0.082167
\begin{aligned} \frac{\partial E_{total}}{\partial w_5}&=\frac{\partial E_{total}}{\partial out_{o1}}\times \frac{\partial out_{o1}}{\partial net_{o1}}\times \frac{\partial net_{o1}}{\partial w_5} \\ &=0.741365\times 0.186815602\times 0.593269992 \\ &=0.082167 \end{aligned}
∂w5∂Etotal=∂outo1∂Etotal×∂neto1∂outo1×∂w5∂neto1=0.741365×0.186815602×0.593269992=0.082167
接下来是使用优化器根据该值调整权重
w
5
w_5
w5(关于优化器的更多介绍:Link),在这里采用最基础的标准梯度下降法(GD),设置学习率
η
=
0.5
\eta=0.5
η=0.5:
w
5
+
=
w
5
−
η
×
∂
E
t
o
t
a
l
∂
w
5
=
0.4
−
0.5
×
0.082167041
=
0.358916
w_{5}^{+}=w_5-\eta \times \frac{\partial E_{total}}{\partial w_5}=0.4-0.5\times 0.082167041=0.358916
w5+=w5−η×∂w5∂Etotal=0.4−0.5×0.082167041=0.358916
于是 就得到了权重
w
5
w_5
w5更新之后的值
w
5
+
w_{5}^{+}
w5+
重复上述相同的步骤即可得到 w 6 + w_{6}^{+} w6+, w 7 + w_{7}^{+} w7+, w 8 + w_{8}^{+} w8+
2.对隐藏层(h层 )
对h层的步骤与对输出层类似,需要计算: ∂ E t o t a l ∂ h 5 = ∂ E t o t a l ∂ o u t h 1 × ∂ o u t h 1 ∂ n e t h 1 × ∂ n e t h 1 ∂ w 5 \frac{\partial E_{total}}{\partial h_5}=\frac{\partial E_{total}}{\partial out_{h1}}\times \frac{\partial out_{h1}}{\partial net_{h1}}\times \frac{\partial net_{h1}}{\partial w_5} ∂h5∂Etotal=∂outh1∂Etotal×∂neth1∂outh1×∂w5∂neth1
后续的步骤就不再赘述,即可求得
w
1
+
w_{1}^{+}
w1+,
w
2
+
w_{2}^{+}
w2+,
w
3
+
w_{3}^{+}
w3+,
w
4
+
w_{4}^{+}
w4+
综上,不断的重复上述的操作,就能实现对网络中每一层的参数进行更新,达到训练的效果。
在神经网络训练时,每经过一个batch,就会通过上述的操作对权重进行更新,在经过多个Epoch之后,网络达到收敛(即误差值很小),前向传播的输出值与我们期望值相差很小,整个网络的训练完成。