Backpropagation intuition
简单的2层浅神经网络,第一层的activation function为
tanh(z)
t
a
n
h
(
z
)
,第二层的activation function为
sigmoid(z)
s
i
g
m
o
i
d
(
z
)
。
神经网络architecture如下图:
使用计算流图(computational graphs)表示如下图:
在下面的公式中, loga[2] means lna[2] log a [ 2 ] m e a n s ln a [ 2 ] ; da[2],dz[2] d a [ 2 ] , d z [ 2 ] 等等是标记相应的导数的符号;并且,下面的公式是单个instance的,并没有矩阵化。
L(a[2],y)=−yloga[2]−(1−y)log(1−a[2])(1.1)
(1.1)
L
(
a
[
2
]
,
y
)
=
−
y
log
a
[
2
]
−
(
1
−
y
)
log
(
1
−
a
[
2
]
)
da[2][1×1]=dda[2]L(a[2],y)=−ya[2]+1−y1−a[2](1.2)
(1.2)
d
a
[
1
×
1
]
[
2
]
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
=
−
y
a
[
2
]
+
1
−
y
1
−
a
[
2
]
g(z[2])=sigmoid(z[2])=a[2](1.3)
(1.3)
g
(
z
[
2
]
)
=
s
i
g
m
o
i
d
(
z
[
2
]
)
=
a
[
2
]
dz[2][1×1]=ddz[2]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]=da[2]⋅g′(z[2])=(−ya[2]+1−y1−a[2])⋅(g(z[2])(1−g(z[2])))=(−ya[2]+1−y1−a[2])⋅a[2]⋅(1−a[2])=a[2]−y(1.4)
(1.4)
d
z
[
1
×
1
]
[
2
]
=
d
d
z
[
2
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
=
d
a
[
2
]
⋅
g
′
(
z
[
2
]
)
=
(
−
y
a
[
2
]
+
1
−
y
1
−
a
[
2
]
)
⋅
(
g
(
z
[
2
]
)
(
1
−
g
(
z
[
2
]
)
)
)
=
(
−
y
a
[
2
]
+
1
−
y
1
−
a
[
2
]
)
⋅
a
[
2
]
⋅
(
1
−
a
[
2
]
)
=
a
[
2
]
−
y
dW[2][1×4]=ddW[2]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅ddW[2]z[2]=dz[2]⋅x=dz[2][1×1](a[1][4×1])T(1.5)
(1.5)
d
W
[
1
×
4
]
[
2
]
=
d
d
W
[
2
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
W
[
2
]
z
[
2
]
=
d
z
[
2
]
⋅
x
=
d
z
[
1
×
1
]
[
2
]
(
a
[
4
×
1
]
[
1
]
)
T
db[2][1×1]=ddb[2]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅ddb[2]z[2]=dz[2][1×1](1.6)
(1.6)
d
b
[
1
×
1
]
[
2
]
=
d
d
b
[
2
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
b
[
2
]
z
[
2
]
=
d
z
[
1
×
1
]
[
2
]
da[1][4×1]=dda[1]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅dda[1]z[2]=dz[2]⋅W[2]=(W[2][1×4])Tdz[2][1×1](1.7)
(1.7)
d
a
[
4
×
1
]
[
1
]
=
d
d
a
[
1
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
a
[
1
]
z
[
2
]
=
d
z
[
2
]
⋅
W
[
2
]
=
(
W
[
1
×
4
]
[
2
]
)
T
d
z
[
1
×
1
]
[
2
]
g(z[1])=tanh(z[1])=a[1](1.8)
(1.8)
g
(
z
[
1
]
)
=
tanh
(
z
[
1
]
)
=
a
[
1
]
dz[1][4×1]=ddz[1]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅dda[1]z[2]⋅ddz[1]a[1]=da[1]⋅g′(z[1])=(W[2][1×4])Tdz[2][1×1]∗g′(z[1])[4×1](1.9)
(1.9)
d
z
[
4
×
1
]
[
1
]
=
d
d
z
[
1
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
a
[
1
]
z
[
2
]
⋅
d
d
z
[
1
]
a
[
1
]
=
d
a
[
1
]
⋅
g
′
(
z
[
1
]
)
=
(
W
[
1
×
4
]
[
2
]
)
T
d
z
[
1
×
1
]
[
2
]
∗
g
′
(
z
[
1
]
)
[
4
×
1
]
dW[1][4×3]=ddW[1]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅dda[1]z[2]⋅ddz[1]a[1]⋅ddW[1]z[1]=dz[1]⋅x=dz[1][4×1](a[0][3×1])T(1.10)
(1.10)
d
W
[
4
×
3
]
[
1
]
=
d
d
W
[
1
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
a
[
1
]
z
[
2
]
⋅
d
d
z
[
1
]
a
[
1
]
⋅
d
d
W
[
1
]
z
[
1
]
=
d
z
[
1
]
⋅
x
=
d
z
[
4
×
1
]
[
1
]
(
a
[
3
×
1
]
[
0
]
)
T
db[1][4×1]=ddW[1]L(a[2],y)=dda[2]L(a[2],y)⋅ddz[2]a[2]⋅dda[1]z[2]⋅ddz[1]a[1]⋅ddb[1]z[1]=dz[1][4×1](1.11)
(1.11)
d
b
[
4
×
1
]
[
1
]
=
d
d
W
[
1
]
L
(
a
[
2
]
,
y
)
=
d
d
a
[
2
]
L
(
a
[
2
]
,
y
)
⋅
d
d
z
[
2
]
a
[
2
]
⋅
d
d
a
[
1
]
z
[
2
]
⋅
d
d
z
[
1
]
a
[
1
]
⋅
d
d
b
[
1
]
z
[
1
]
=
d
z
[
4
×
1
]
[
1
]
下面是vectorization后的反向传播算法公式:
L(A[2],Y)=1m∑i=1m−y(i)logA[2](i)−(1−y(i))log(1−A[2](i))(2.1)
(2.1)
L
(
A
[
2
]
,
Y
)
=
1
m
∑
i
=
1
m
−
y
(
i
)
log
A
[
2
]
(
i
)
−
(
1
−
y
(
i
)
)
log
(
1
−
A
[
2
]
(
i
)
)
dA[2][1×m]=[(−Y(1)A[2](1)+1−Y(1)1−A[2](1)),⋯,(−Y(m)A[2](m)+1−Y(m)1−A[2](m))](2.2)
(2.2)
d
A
[
1
×
m
]
[
2
]
=
[
(
−
Y
(
1
)
A
[
2
]
(
1
)
+
1
−
Y
(
1
)
1
−
A
[
2
]
(
1
)
)
,
⋯
,
(
−
Y
(
m
)
A
[
2
]
(
m
)
+
1
−
Y
(
m
)
1
−
A
[
2
]
(
m
)
)
]
dZ[2][1×m]=[(−Y(1)A[2](1)+1−Y(1)1−A[2](1)),⋯,(−Y(m)A[2](m)+1−Y(m)1−A[2](m))]∗[A[2](1)(1−A[2](1)),⋯,A[2](m)(1−A[2](m))]=[(A[2](1)−Y(1)),⋯,(A[2](m)−Y(m))]=A[2]−Y(2.3)
(2.3)
d
Z
[
1
×
m
]
[
2
]
=
[
(
−
Y
(
1
)
A
[
2
]
(
1
)
+
1
−
Y
(
1
)
1
−
A
[
2
]
(
1
)
)
,
⋯
,
(
−
Y
(
m
)
A
[
2
]
(
m
)
+
1
−
Y
(
m
)
1
−
A
[
2
]
(
m
)
)
]
∗
[
A
[
2
]
(
1
)
(
1
−
A
[
2
]
(
1
)
)
,
⋯
,
A
[
2
]
(
m
)
(
1
−
A
[
2
]
(
m
)
)
]
=
[
(
A
[
2
]
(
1
)
−
Y
(
1
)
)
,
⋯
,
(
A
[
2
]
(
m
)
−
Y
(
m
)
)
]
=
A
[
2
]
−
Y
dW[2][1×4]=1mdZ[2][1×m](A[1][4×m])T(2.4)
(2.4)
d
W
[
1
×
4
]
[
2
]
=
1
m
d
Z
[
1
×
m
]
[
2
]
(
A
[
4
×
m
]
[
1
]
)
T
db[2][1×1]=1mnp.sum(dZ[2],axis=1,keepdims=True)(2.5)
(2.5)
d
b
[
1
×
1
]
[
2
]
=
1
m
n
p
.
s
u
m
(
d
Z
[
2
]
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)
dZ[1][4×m]=(W[2][1×4])TdZ[2][1×m]∗g[1]′(Z[1])[4×m]
d
Z
[
4
×
m
]
[
1
]
=
(
W
[
1
×
4
]
[
2
]
)
T
d
Z
[
1
×
m
]
[
2
]
∗
g
[
1
]
′
(
Z
[
1
]
)
[
4
×
m
]
dW[1][4×3]=1mdZ[1][4×m](A[0][3×m])T(2.6)
(2.6)
d
W
[
4
×
3
]
[
1
]
=
1
m
d
Z
[
4
×
m
]
[
1
]
(
A
[
3
×
m
]
[
0
]
)
T
db[1][4×1]=1msp.sum(dZ[1][4×m],axis=1,keepdims=True)(2.7)
(2.7)
d
b
[
4
×
1
]
[
1
]
=
1
m
s
p
.
s
u
m
(
d
Z
[
4
×
m
]
[
1
]
,
a
x
i
s
=
1
,
k
e
e
p
d
i
m
s
=
T
r
u
e
)