J(Θ)=−1m∑t=1m∑k=1K[y(t)klog(hΘ(x(t))k)+(1−y(t)k)log(1−hΘ(x(t))k)]+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2
J
(
Θ
)
=
−
1
m
∑
t
=
1
m
∑
k
=
1
K
[
y
k
(
t
)
log
(
h
Θ
(
x
(
t
)
)
k
)
+
(
1
−
y
k
(
t
)
)
log
(
1
−
h
Θ
(
x
(
t
)
)
k
)
]
+
λ
2
m
∑
l
=
1
L
−
1
∑
i
=
1
s
l
∑
j
=
1
s
l
+
1
(
Θ
j
,
i
(
l
)
)
2
我们只用一条数据(x,y),并且忽略正则化,则代价函数(K=4)为:
Cost(x)=∑k=1KCost(x)kCost(x)k=−[yklog(hΘ(x)k)+(1−yk)log(1−hΘ(x)k)]
C
o
s
t
(
x
)
=
∑
k
=
1
K
C
o
s
t
(
x
)
k
C
o
s
t
(
x
)
k
=
−
[
y
k
log
(
h
Θ
(
x
)
k
)
+
(
1
−
y
k
)
log
(
1
−
h
Θ
(
x
)
k
)
]
2. 证明∂Cost(x)∂Θ(3)
∂
C
o
s
t
(
x
)
∂
Θ
(
3
)
针对该神经网络,声明一些事实:
hΘ(x)
h
Θ
(
x
)
就是a(4)
a
(
4
)
,即hΘ(x)=a(4)
h
Θ
(
x
)
=
a
(
4
)
a(4)=g(z(4))
a
(
4
)
=
g
(
z
(
4
)
)
,且∂a(4)∂z(4)=a(4)(1−a(4))
∂
a
(
4
)
∂
z
(
4
)
=
a
(
4
)
(
1
−
a
(
4
)
)
z(4)j=∑5i=0Θ(3)jia(3)i
z
j
(
4
)
=
∑
i
=
0
5
Θ
j
i
(
3
)
a
i
(
3
)
,其中 1⩽j⩽4
1
⩽
j
⩽
4
∂Cost(x)∂Θ(3)10=∂∑Kk=1Cost(x)k∂Θ(3)10=∑k=1K∂Cost(x)k∂Θ(3)10=∂Cost(x)1∂Θ(3)10+0+0+0=∂Cost(x)1∂hΘ(x)1×∂hΘ(x)1∂z(4)1×∂z(4)1∂Θ(3)10=−[y11hΘ(x)1+(1−y1)−11−hΘ(x)1]×hΘ(x)1(1−hΘ(x)1)×a(3)0=[hΘ(x)1−y1]a(3)0=[a(4)1−y1]a(3)0
∂
C
o
s
t
(
x
)
∂
Θ
10
(
3
)
=
∂
∑
k
=
1
K
C
o
s
t
(
x
)
k
∂
Θ
10
(
3
)
=
∑
k
=
1
K
∂
C
o
s
t
(
x
)
k
∂
Θ
10
(
3
)
=
∂
C
o
s
t
(
x
)
1
∂
Θ
10
(
3
)
+
0
+
0
+
0
=
∂
C
o
s
t
(
x
)
1
∂
h
Θ
(
x
)
1
×
∂
h
Θ
(
x
)
1
∂
z
1
(
4
)
×
∂
z
1
(
4
)
∂
Θ
10
(
3
)
=
−
[
y
1
1
h
Θ
(
x
)
1
+
(
1
−
y
1
)
−
1
1
−
h
Θ
(
x
)
1
]
×
h
Θ
(
x
)
1
(
1
−
h
Θ
(
x
)
1
)
×
a
0
(
3
)
=
[
h
Θ
(
x
)
1
−
y
1
]
a
0
(
3
)
=
[
a
1
(
4
)
−
y
1
]
a
0
(
3
)
∂Cost(x)∂Θ(3)20=0+∂Cost(x)2∂Θ(3)20+0+0=∂Cost(x)2∂hΘ(x)2×∂hΘ(x)2∂z(4)2×∂z(4)2∂Θ(3)20=−[y21hΘ(x)2+(1−y2)−11−hΘ(x)2]×hΘ(x)2(1−hΘ(x)2)×a(3)0=[hΘ(x)2−y2]a(3)0
∂
C
o
s
t
(
x
)
∂
Θ
20
(
3
)
=
0
+
∂
C
o
s
t
(
x
)
2
∂
Θ
20
(
3
)
+
0
+
0
=
∂
C
o
s
t
(
x
)
2
∂
h
Θ
(
x
)
2
×
∂
h
Θ
(
x
)
2
∂
z
2
(
4
)
×
∂
z
2
(
4
)
∂
Θ
20
(
3
)
=
−
[
y
2
1
h
Θ
(
x
)
2
+
(
1
−
y
2
)
−
1
1
−
h
Θ
(
x
)
2
]
×
h
Θ
(
x
)
2
(
1
−
h
Θ
(
x
)
2
)
×
a
0
(
3
)
=
[
h
Θ
(
x
)
2
−
y
2
]
a
0
(
3
)
∂Cost(x)∂Θ(3)21=0+∂Cost(x)2∂Θ(3)21+0+0=∂Cost(x)2∂hΘ(x)2×∂hΘ(x)2∂z(4)2×∂z(4)2∂Θ(3)21=−[y21hΘ(x)2+(1−y2)−11−hΘ(x)2]×hΘ(x)2(1−hΘ(x)2)×a(3)1=[hΘ(x)2−y2]a(3)1
∂
C
o
s
t
(
x
)
∂
Θ
21
(
3
)
=
0
+
∂
C
o
s
t
(
x
)
2
∂
Θ
21
(
3
)
+
0
+
0
=
∂
C
o
s
t
(
x
)
2
∂
h
Θ
(
x
)
2
×
∂
h
Θ
(
x
)
2
∂
z
2
(
4
)
×
∂
z
2
(
4
)
∂
Θ
21
(
3
)
=
−
[
y
2
1
h
Θ
(
x
)
2
+
(
1
−
y
2
)
−
1
1
−
h
Θ
(
x
)
2
]
×
h
Θ
(
x
)
2
(
1
−
h
Θ
(
x
)
2
)
×
a
1
(
3
)
=
[
h
Θ
(
x
)
2
−
y
2
]
a
1
(
3
)
综上:
∂Cost(x)∂Θ(3)ji=∂Cost(x)j∂hΘ(x)j×∂hΘ(x)j∂z(4)j×∂z(4)j∂Θ(3)ji=−[yj1hΘ(x)j+(1−yj)−11−hΘ(x)j]×hΘ(x)j(1−hΘ(x)j)×a(3)i=[hΘ(x)j−yj](a(3)i
∂
C
o
s
t
(
x
)
∂
Θ
j
i
(
3
)
=
∂
C
o
s
t
(
x
)
j
∂
h
Θ
(
x
)
j
×
∂
h
Θ
(
x
)
j
∂
z
j
(
4
)
×
∂
z
j
(
4
)
∂
Θ
j
i
(
3
)
=
−
[
y
j
1
h
Θ
(
x
)
j
+
(
1
−
y
j
)
−
1
1
−
h
Θ
(
x
)
j
]
×
h
Θ
(
x
)
j
(
1
−
h
Θ
(
x
)
j
)
×
a
i
(
3
)
=
[
h
Θ
(
x
)
j
−
y
j
]
(
a
i
(
3
)
引入δ(4)=a(4)−y
δ
(
4
)
=
a
(
4
)
−
y
,再加上
hΘ(x)=a(4)
h
Θ
(
x
)
=
a
(
4
)
,所以
∂Cost(x)∂Θ(3)ji=δ(4)ja(3)i
∂
C
o
s
t
(
x
)
∂
Θ
j
i
(
3
)
=
δ
j
(
4
)
a
i
(
3
)
把矩阵下标去掉,公式变为:
∂Cost(x)∂Θ(3)=δ(4)(a(3))T
∂
C
o
s
t
(
x
)
∂
Θ
(
3
)
=
δ
(
4
)
(
a
(
3
)
)
T
3. 证明∂Cost(x)∂Θ(2),∂Cost(x)∂Θ(1)
∂
C
o
s
t
(
x
)
∂
Θ
(
2
)
,
∂
C
o
s
t
(
x
)
∂
Θ
(
1
)
举例:
∂Cost(x)∂Θ(2)10=∑k=1K∂Cost(x)k∂Θ(2)10=∑k=1K[∂Cost(x)k∂hΘ(x)k×∂hΘ(x)k∂z(4)k×∂z(4)k∂a(3)1×∂a(3)1∂z(3)1×∂z(3)1∂Θ(2)10]=∑k=1K[(hΘ(x)k−yk)×Θ(3)k1×a(3)1(1−a(3)1)×a(2)0]
∂
C
o
s
t
(
x
)
∂
Θ
10
(
2
)
=
∑
k
=
1
K
∂
C
o
s
t
(
x
)
k
∂
Θ
10
(
2
)
=
∑
k
=
1
K
[
∂
C
o
s
t
(
x
)
k
∂
h
Θ
(
x
)
k
×
∂
h
Θ
(
x
)
k
∂
z
k
(
4
)
×
∂
z
k
(
4
)
∂
a
1
(
3
)
×
∂
a
1
(
3
)
∂
z
1
(
3
)
×
∂
z
1
(
3
)
∂
Θ
10
(
2
)
]
=
∑
k
=
1
K
[
(
h
Θ
(
x
)
k
−
y
k
)
×
Θ
k
1
(
3
)
×
a
1
(
3
)
(
1
−
a
1
(
3
)
)
×
a
0
(
2
)
]
∂Cost(x)∂Θ(2)20=∑k=1K∂Cost(x)k∂Θ(2)20=∑k=1K[∂Cost(x)k∂hΘ(x)k×∂hΘ(x)k∂z(4)k×∂z(4)k∂a(3)2×∂a(3)2∂z(3)2×∂z(3)2∂Θ(2)20]=∑k=1K[(hΘ(x)k−yk)×Θ(3)k2×a(3)2(1−a(3)2)×a(2)0]
∂
C
o
s
t
(
x
)
∂
Θ
20
(
2
)
=
∑
k
=
1
K
∂
C
o
s
t
(
x
)
k
∂
Θ
20
(
2
)
=
∑
k
=
1
K
[
∂
C
o
s
t
(
x
)
k
∂
h
Θ
(
x
)
k
×
∂
h
Θ
(
x
)
k
∂
z
k
(
4
)
×
∂
z
k
(
4
)
∂
a
2
(
3
)
×
∂
a
2
(
3
)
∂
z
2
(
3
)
×
∂
z
2
(
3
)
∂
Θ
20
(
2
)
]
=
∑
k
=
1
K
[
(
h
Θ
(
x
)
k
−
y
k
)
×
Θ
k
2
(
3
)
×
a
2
(
3
)
(
1
−
a
2
(
3
)
)
×
a
0
(
2
)
]
∂Cost(x)∂Θ(2)21=∑k=1K∂Cost(x)k∂Θ(2)21=∑k=1K[∂Cost(x)k∂hΘ(x)k×∂hΘ(x)k∂z(4)k×∂z(4)k∂a(3)2×∂a(3)2∂z(3)2×∂z(3)2∂Θ(2)21]=∑k=1K[(hΘ(x)k−yk)×Θ(3)k2×a(3)2(1−a(3)2)×a(2)1]
∂
C
o
s
t
(
x
)
∂
Θ
21
(
2
)
=
∑
k
=
1
K
∂
C
o
s
t
(
x
)
k
∂
Θ
21
(
2
)
=
∑
k
=
1
K
[
∂
C
o
s
t
(
x
)
k
∂
h
Θ
(
x
)
k
×
∂
h
Θ
(
x
)
k
∂
z
k
(
4
)
×
∂
z
k
(
4
)
∂
a
2
(
3
)
×
∂
a
2
(
3
)
∂
z
2
(
3
)
×
∂
z
2
(
3
)
∂
Θ
21
(
2
)
]
=
∑
k
=
1
K
[
(
h
Θ
(
x
)
k
−
y
k
)
×
Θ
k
2
(
3
)
×
a
2
(
3
)
(
1
−
a
2
(
3
)
)
×
a
1
(
2
)
]
∂Cost(x)∂Θ(2)32=∑k=1K∂Cost(x)k∂Θ(2)32=∑k=1K[∂Cost(x)k∂hΘ(x)k×∂hΘ(x)k∂z(4)k×∂z(4)k∂a(3)3×∂a(3)3∂z(3)3×∂z(3)3∂Θ(2)32]=∑k=1K[(hΘ(x)k−yk)×Θ(3)k3×a(3)3(1−a(3)3)×a(2)2]
∂
C
o
s
t
(
x
)
∂
Θ
32
(
2
)
=
∑
k
=
1
K
∂
C
o
s
t
(
x
)
k
∂
Θ
32
(
2
)
=
∑
k
=
1
K
[
∂
C
o
s
t
(
x
)
k
∂
h
Θ
(
x
)
k
×
∂
h
Θ
(
x
)
k
∂
z
k
(
4
)
×
∂
z
k
(
4
)
∂
a
3
(
3
)
×
∂
a
3
(
3
)
∂
z
3
(
3
)
×
∂
z
3
(
3
)
∂
Θ
32
(
2
)
]
=
∑
k
=
1
K
[
(
h
Θ
(
x
)
k
−
y
k
)
×
Θ
k
3
(
3
)
×
a
3
(
3
)
(
1
−
a
3
(
3
)
)
×
a
2
(
2
)
]
记住δ(4)=a(4)−y,hΘ(x)=a(4)
δ
(
4
)
=
a
(
4
)
−
y
,
h
Θ
(
x
)
=
a
(
4
)
。根据前面的例子把公式进行一般化(其中(Θ(3))Tj:
(
Θ
(
3
)
)
j
:
T
表示矩阵((Θ(3))T
(
(
Θ
(
3
)
)
T
第 j 行,第一个乘号是矩阵相乘,其他乘号是实数相乘):
∂Cost(x)∂Θ(2)ji=∑k=1K[(hΘ(x)k−yk)×Θ(3)kj×a(3)j(1−a(3)j)×a(2)i]=∑k=1K[δ(4)k×Θ(3)kj]×a(3)j(1−a(3)j)×a(2)i=(Θ(3))Tj:×δ(4)×a(3)j(1−a(3)j)×a(2)i
∂
C
o
s
t
(
x
)
∂
Θ
j
i
(
2
)
=
∑
k
=
1
K
[
(
h
Θ
(
x
)
k
−
y
k
)
×
Θ
k
j
(
3
)
×
a
j
(
3
)
(
1
−
a
j
(
3
)
)
×
a
i
(
2
)
]
=
∑
k
=
1
K
[
δ
k
(
4
)
×
Θ
k
j
(
3
)
]
×
a
j
(
3
)
(
1
−
a
j
(
3
)
)
×
a
i
(
2
)
=
(
Θ
(
3
)
)
j
:
T
×
δ
(
4
)
×
a
j
(
3
)
(
1
−
a
j
(
3
)
)
×
a
i
(
2
)
把矩阵下标去掉,公式变为(两个乘号表示矩阵相乘,.* 表示对应元素相乘):
∂Cost(x)∂Θ(2)=((Θ(3))T×δ(4)).∗a(3).∗(1−a(3))×a(2)
∂
C
o
s
t
(
x
)
∂
Θ
(
2
)
=
(
(
Θ
(
3
)
)
T
×
δ
(
4
)
)
.
∗
a
(
3
)
.
∗
(
1
−
a
(
3
)
)
×
a
(
2
)
为什么计算每一层的误差 δ
δ
?因为经过一系列复杂的求导后,我们通过 δ
δ
可以计算代价函数对每一层权重矩阵的每一个参数的偏导数(无正则化处理或λ=0
λ
=
0
):∂∂Θ(l)i,jJ(Θ)=δ(l+1)ia(l)j
∂
∂
Θ
i
,
j
(
l
)
J
(
Θ
)
=
δ
i
(
l
+
1
)
a
j
(
l
)
,其中每个元素都是一个实数!
重点:
误差公式:δ(l)=((Θ(l))Tδ(l+1)).∗g′(z(l))
δ
(
l
)
=
(
(
Θ
(
l
)
)
T
δ
(
l
+
1
)
)
.
∗
g
′
(
z
(
l
)
)
,其中g′(z(l))=a(l).∗(1−a(l))
g
′
(
z
(
l
)
)
=
a
(
l
)
.
∗
(
1
−
a
(
l
)
)
偏导数(梯度)公式:∂∂Θ(l)i,jJ(Θ)=δ(l+1)ia(l)j
∂
∂
Θ
i
,
j
(
l
)
J
(
Θ
)
=
δ
i
(
l
+
1
)
a
j
(
l
)
偏导数(梯度)公式(矩阵形式):∂∂Θ(l)J(Θ)=δ(l+1)(a(l))T
∂
∂
Θ
(
l
)
J
(
Θ
)
=
δ
(
l
+
1
)
(
a
(
l
)
)
T