- 正向传播:
a 3 = g ( W 13 × a 1 + W 23 × a 2 + b 1 ) a 4 = g ( W 14 × a 1 + W 24 × a 2 + b 2 ) a 5 = g ( W 35 × a 3 + W 45 × a 4 + b 3 ) a_3 = g(W_{13}\times a_1+W_{23}\times a_2+b_1)\\ a_4 = g(W_{14}\times a_1+W_{24}\times a_2+b_2)\\ a_5 = g(W_{35}\times a_3+W_{45}\times a_4+b_3) a3=g(W13×a1+W23×a2+b1)a4=g(W14×a1+W24×a2+b2)a5=g(W35×a3+W45×a4+b3)
- 反向传播:
L = − y l o g ( a 5 ) − ( 1 − y ) l o g ( 1 − a 5 ) L = -ylog(a_5)-(1-y)log(1-a_5) L=−ylog(a5)−(1−y)log(1−a5)
∂ L ∂ a 5 = − y a 5 + 1 − y 1 − a 5 \frac{\partial L}{\partial a_5}=-\frac{y}{a_5}+\frac{1-y}{1-a_5} ∂a5∂L=−a5y+1−a51−y
∂ L ∂ W 35 = ( − y a 5 + 1 − y 1 − a 5 ) × a 5 ( 1 − a 5 ) × a 3 = ( a 5 − y ) × a 3 \frac{\partial L}{\partial W_{35}}=(-\frac{y}{a_5}+\frac{1-y}{1-a_5})\times a_5(1-a_5)\times a_3=(a_5-y)\times a_3 ∂W35∂L=(−a5y+1−a51−y)×a5(1−a5)×a3=(a5−y)×a3
∂ L ∂ W 45 = ( − y a 5 + 1 − y 1 − a 5 ) × a 5 ( 1 − a 5 ) × a 4 = ( a 5 − y ) × a 4 \frac{\partial L}{\partial W_{45}}=(-\frac{y}{a_5}+\frac{1-y}{1-a_5})\times a_5(1-a_5)\times a_4=(a_5-y)\times a_4 ∂W45∂L=(−a5y+1−a51−y)×a5(1−a5)×a4=(a5−y)×a4
∂
L
∂
b
3
=
(
−
y
a
5
+
1
−
y
1
−
a
5
)
×
a
5
(
1
−
a
5
)
×
a
4
=
(
a
5
−
y
)
\frac{\partial L}{\partial b_{3}}=(-\frac{y}{a_5}+\frac{1-y}{1-a_5})\times a_5(1-a_5)\times a_4=(a_5-y)
∂b3∂L=(−a5y+1−a51−y)×a5(1−a5)×a4=(a5−y)
∂
L
∂
a
3
=
(
−
y
a
5
+
1
−
y
1
−
a
5
)
×
a
5
(
1
−
a
5
)
×
a
3
=
(
a
5
−
y
)
×
W
35
\frac{\partial L}{\partial a_{3}}=(-\frac{y}{a_5}+\frac{1-y}{1-a_5})\times a_5(1-a_5)\times a_3=(a_5-y)\times W_{35}
∂a3∂L=(−a5y+1−a51−y)×a5(1−a5)×a3=(a5−y)×W35
∂
L
∂
a
4
=
(
−
y
a
5
+
1
−
y
1
−
a
5
)
×
a
5
(
1
−
a
5
)
×
a
4
=
(
a
5
−
y
)
×
W
45
\frac{\partial L}{\partial a_{4}}=(-\frac{y}{a_5}+\frac{1-y}{1-a_5})\times a_5(1-a_5)\times a_4=(a_5-y)\times W_{45}
∂a4∂L=(−a5y+1−a51−y)×a5(1−a5)×a4=(a5−y)×W45
∂
L
∂
W
13
=
(
a
5
−
y
)
×
W
35
×
a
4
(
1
−
a
4
)
×
a
1
\frac{\partial L}{\partial W_{13}}=(a_5-y)\times W_{35}\times a_4(1-a_4)\times a_1
∂W13∂L=(a5−y)×W35×a4(1−a4)×a1
∂ L ∂ W 14 = ( a 5 − y ) × W 45 × a 3 ( 1 − a 3 ) × a 1 \frac{\partial L}{\partial W_{14}}=(a_5-y)\times W_{45}\times a_3(1-a_3)\times a_1 ∂W14∂L=(a5−y)×W45×a3(1−a3)×a1
∂ L ∂ W 23 = ( a 5 − y ) × W 35 × a 4 ( 1 − a 4 ) × a 2 \frac{\partial L}{\partial W_{23}}=(a_5-y)\times W_{35}\times a_4(1-a_4)\times a_2 ∂W23∂L=(a5−y)×W35×a4(1−a4)×a2
∂ L ∂ W 24 = ( a 5 − y ) × W 45 × a 4 ( 1 − a 4 ) × a 2 \frac{\partial L}{\partial W_{24}}=(a_5-y)\times W_{45}\times a_4(1-a_4)\times a_2 ∂W24∂L=(a5−y)×W45×a4(1−a4)×a2
∂ L ∂ b 2 = ( a 5 − y ) × W 45 × a 4 ( 1 − a 4 ) \frac{\partial L}{\partial b_{2}}=(a_5-y)\times W_{45}\times a_4(1-a_4) ∂b2∂L=(a5−y)×W45×a4(1−a4)
∂
L
∂
b
1
=
(
a
5
−
y
)
×
W
35
×
a
3
(
1
−
a
3
)
\frac{\partial L}{\partial b_{1}}=(a_5-y)\times W_{35}\times a_3(1-a_3)
∂b1∂L=(a5−y)×W35×a3(1−a3)
∂
L
∂
a
1
=
(
a
5
−
y
)
×
W
35
×
a
3
(
1
−
a
3
)
×
W
13
+
(
a
5
−
y
)
×
W
45
×
a
4
(
1
−
a
4
)
×
W
14
\frac{\partial L}{\partial a_{1}}=(a_5-y)\times W_{35}\times a_3(1-a_3)\times W_{13}+(a_5-y)\times W_{45}\times a_4(1-a_4)\times W_{14}
∂a1∂L=(a5−y)×W35×a3(1−a3)×W13+(a5−y)×W45×a4(1−a4)×W14
∂ L ∂ a 2 = ( a 5 − y ) × W 35 × a 3 ( 1 − a 3 ) × W 23 + ( a 5 − y ) × W 45 × a 4 ( 1 − a 4 ) × W 24 \frac{\partial L}{\partial a_{2}}=(a_5-y)\times W_{35}\times a_3(1-a_3)\times W_{23}+(a_5-y)\times W_{45}\times a_4(1-a_4)\times W_{24} ∂a2∂L=(a5−y)×W35×a3(1−a3)×W23+(a5−y)×W45×a4(1−a4)×W24
- 当 W W W全部设为 0 0 0, b b b全部设为 0 0 0
第一个epoch梯度更新时, W 35 , W 45 , b 3 W_{35},W_{45},b_{3} W35,W45,b3会更新,其余的都为为 0 0 0,不发生变化,第二次梯度更新时, a 3 = a 4 a_3=a_4 a3=a4, W 13 , W 14 , W 23 , W 24 , b 1 , b 2 W_{13},W_{14},W_{23},W_{24},b_{1},b_2 W13,W14,W23,W24,b1,b2的更新变化相同,第三次,第四次都会和第二次相同。
- 当 W W W全部设为 0 0 0, b b b随机初始化
第一个epoch梯度更新时, W 35 , W 45 , b 3 W_{35},W_{45},b_{3} W35,W45,b3会更新,其余的都为为 0 0 0,不发生变化,第二次梯度更新时,所有的参数都能够得到更新。这种方式存在更新较慢、梯度消失、梯度爆炸等问题,在实践中,通常不会选择此方式。
- 当 W W W随机初始化, b b b设为 0 0 0
在反向传播的过程中所有 W W W和 b b b都能得到更新。