损失函数 J(θ) J ( θ )
J(θ)=−1m∑i=1m∑k=1K[y(i)kln(hθ(X(i))k)+(1−y(i)k)ln(1−hθ(X(i))k)]
J
(
θ
)
=
−
1
m
∑
i
=
1
m
∑
k
=
1
K
[
y
k
(
i
)
ln
(
h
θ
(
X
(
i
)
)
k
)
+
(
1
−
y
k
(
i
)
)
ln
(
1
−
h
θ
(
X
(
i
)
)
k
)
]
+λ2m∑l=1L−1∑i=1sl+1∑j=1sl(θ(l)i,j)2
+
λ
2
m
∑
l
=
1
L
−
1
∑
i
=
1
s
l
+
1
∑
j
=
1
s
l
(
θ
i
,
j
(
l
)
)
2
λ=0 λ = 0 时的单样本损失函数 cost(θ;X,Y) cost ( θ ; X , Y )
λ=0
λ
=
0
时,单一样本
X=⎛⎝⎜⎜x1⋮xs1⎞⎠⎟⎟,Y=⎛⎝⎜⎜y1⋮yK⎞⎠⎟⎟
X
=
(
x
1
⋮
x
s
1
)
,
Y
=
(
y
1
⋮
y
K
)
的损失函数:
cost(θ;X,Y)=−∑k=1K[ykln(hθ(X)k)+(1−yk)ln(1−hθ(X)k)]
cost
(
θ
;
X
,
Y
)
=
−
∑
k
=
1
K
[
y
k
ln
(
h
θ
(
X
)
k
)
+
(
1
−
y
k
)
ln
(
1
−
h
θ
(
X
)
k
)
]
令
a(1)=X
a
(
1
)
=
X
Z(l+1)=θ(l)a(l),1≤l≤L−1
Z
(
l
+
1
)
=
θ
(
l
)
a
(
l
)
,
1
≤
l
≤
L
−
1
a(l)=g(Z(l)),2≤l≤L,
a
(
l
)
=
g
(
Z
(
l
)
)
,
2
≤
l
≤
L
,
其中函数
g
g
是 Logistic 函数。
则
于是
cost(θ;X,Y)=−∑k=1K[yklna(L)k+(1−yk)ln(1−a(L)k)]
cost
(
θ
;
X
,
Y
)
=
−
∑
k
=
1
K
[
y
k
ln
a
k
(
L
)
+
(
1
−
y
k
)
ln
(
1
−
a
k
(
L
)
)
]
则
J(θ)=1m∑i=1mcost(θ;X(i),Y(i))+λ2m∑l=1L−1∑i=1sl+1∑j=1sl(θ(l)i,j)2
J
(
θ
)
=
1
m
∑
i
=
1
m
cost
(
θ
;
X
(
i
)
,
Y
(
i
)
)
+
λ
2
m
∑
l
=
1
L
−
1
∑
i
=
1
s
l
+
1
∑
j
=
1
s
l
(
θ
i
,
j
(
l
)
)
2
cost(θ;X,Y) cost ( θ ; X , Y ) 关于 Z(l) Z ( l ) 的梯度
令
δ(l)=∂∂Z(l)cost(θ;X,Y)=⎛⎝⎜⎜⎜⎜⎜⎜⎜⎜∂∂z(l)1cost(θ;X,Y)⋮∂∂z(l)slcost(θ;X,Y)⎞⎠⎟⎟⎟⎟⎟⎟⎟⎟,2≤l≤L,
δ
(
l
)
=
∂
∂
Z
(
l
)
cost
(
θ
;
X
,
Y
)
=
(
∂
∂
z
1
(
l
)
cost
(
θ
;
X
,
Y
)
⋮
∂
∂
z
s
l
(
l
)
cost
(
θ
;
X
,
Y
)
)
,
2
≤
l
≤
L
,
则
δ(l)={a(L)−Y,(θ(l))⊺δ(l+1) .∗ a(l) .∗ (1−a(l)),l=L,2≤l≤L−1,
δ
(
l
)
=
{
a
(
L
)
−
Y
,
l
=
L
,
(
θ
(
l
)
)
⊺
δ
(
l
+
1
)
.
∗
a
(
l
)
.
∗
(
1
−
a
(
l
)
)
,
2
≤
l
≤
L
−
1
,
其中运算符
.∗
.
∗
为 element-wise 的乘积,如
⎛⎝⎜⎜x1⋮xn⎞⎠⎟⎟ .∗ ⎛⎝⎜⎜y1⋮yn⎞⎠⎟⎟=⎛⎝⎜⎜x1y1⋮xnyn⎞⎠⎟⎟
(
x
1
⋮
x
n
)
.
∗
(
y
1
⋮
y
n
)
=
(
x
1
y
1
⋮
x
n
y
n
)
。
证明
命题等价于:
δ(l)j=⎧⎩⎨⎪⎪a(L)j−yj,[∑i=1sl+1θ(l)i,jδ(l+1)i]⋅δ(l)j(1−a(l)j),l=L,2≤l≤L−1,1≤j≤sl
δ
j
(
l
)
=
{
a
j
(
L
)
−
y
j
,
l
=
L
,
[
∑
i
=
1
s
l
+
1
θ
i
,
j
(
l
)
δ
i
(
l
+
1
)
]
⋅
δ
j
(
l
)
(
1
−
a
j
(
l
)
)
,
2
≤
l
≤
L
−
1
,
1
≤
j
≤
s
l
由
{Z(l+1)=θ(l)a(l),a(l)=g(Z(l)),1≤l≤L−1,2≤l≤L,
{
Z
(
l
+
1
)
=
θ
(
l
)
a
(
l
)
,
1
≤
l
≤
L
−
1
,
a
(
l
)
=
g
(
Z
(
l
)
)
,
2
≤
l
≤
L
,
得:
⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪∂z(l+1)i∂a(l)j=θ(l)i,j,1≤l≤L−1,da(l)jdz(l)j=g′(z(l)j)=a(l)j(1−a(l)j),2≤l≤L,
{
∂
z
i
(
l
+
1
)
∂
a
j
(
l
)
=
θ
i
,
j
(
l
)
,
1
≤
l
≤
L
−
1
,
d
a
j
(
l
)
d
z
j
(
l
)
=
g
′
(
z
j
(
l
)
)
=
a
j
(
l
)
(
1
−
a
j
(
l
)
)
,
2
≤
l
≤
L
,
因此
∂z(l+1)i∂z(l)j=θ(l)i,ja(l)j(1−a(l)j),2≤l≤L−1,
∂
z
i
(
l
+
1
)
∂
z
j
(
l
)
=
θ
i
,
j
(
l
)
a
j
(
l
)
(
1
−
a
j
(
l
)
)
,
2
≤
l
≤
L
−
1
,
所以
δ(l)j=∑i=1sl+1δ(l+1)i∂z(l+1)i∂z(l)j
δ
j
(
l
)
=
∑
i
=
1
s
l
+
1
δ
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
z
j
(
l
)
=∑i=1sl+1δ(l+1)iθ(l)i,ja(l)j(1−a(l)j)
=
∑
i
=
1
s
l
+
1
δ
i
(
l
+
1
)
θ
i
,
j
(
l
)
a
j
(
l
)
(
1
−
a
j
(
l
)
)
=[∑i=1sl+1θ(l)i,jδ(l+1)i]⋅δ(l)j(1−a(l)j),2≤l≤L−1,
=
[
∑
i
=
1
s
l
+
1
θ
i
,
j
(
l
)
δ
i
(
l
+
1
)
]
⋅
δ
j
(
l
)
(
1
−
a
j
(
l
)
)
,
2
≤
l
≤
L
−
1
,
由于
∂∂a(L)kcost(θ;X,Y)=−[yk1a(L)k−(1−yk)11−a(L)k]
∂
∂
a
k
(
L
)
cost
(
θ
;
X
,
Y
)
=
−
[
y
k
1
a
k
(
L
)
−
(
1
−
y
k
)
1
1
−
a
k
(
L
)
]
=−(yk−a(L)k)1a(L)k(1−a(L)k)
=
−
(
y
k
−
a
k
(
L
)
)
1
a
k
(
L
)
(
1
−
a
k
(
L
)
)
=(a(L)k−yk)1a(L)k(1−a(L)k),1≤k≤sL=K
=
(
a
k
(
L
)
−
y
k
)
1
a
k
(
L
)
(
1
−
a
k
(
L
)
)
,
1
≤
k
≤
s
L
=
K
因此
(δ(L))j=∂∂aL,jcost(θ;X,Y)da(L)jdzL,j
(
δ
(
L
)
)
j
=
∂
∂
a
L
,
j
cost
(
θ
;
X
,
Y
)
d
a
j
(
L
)
d
z
L
,
j
=(a(L)j−yj)1a(L)j(1−a(L)j)a(L)j(1−a(L)j)
=
(
a
j
(
L
)
−
y
j
)
1
a
j
(
L
)
(
1
−
a
j
(
L
)
)
a
j
(
L
)
(
1
−
a
j
(
L
)
)
=a(L)j−yj,1≤j≤sL
=
a
j
(
L
)
−
y
j
,
1
≤
j
≤
s
L
因此,命题成立。
cost(θ;X,Y) cost ( θ ; X , Y ) 关于 θ θ 的梯度
∂∂θ(l)i,jcost(θ;X,Y)=δ(l+1)ia(l)j,1≤l<L−1 ∂ ∂ θ i , j ( l ) cost ( θ ; X , Y ) = δ i ( l + 1 ) a j ( l ) , 1 ≤ l < L − 1
证明
由
∂z(l+1)i∂θ(l)i,j=a(l)j,1≤l≤L−1,
∂
z
i
(
l
+
1
)
∂
θ
i
,
j
(
l
)
=
a
j
(
l
)
,
1
≤
l
≤
L
−
1
,
得
∂∂θ(l)i,jcost(θ;X,Y)=δ(l+1)i∂z(l+1)i∂θ(l)i,j=δ(l+1)ia(l)j,1≤l<L−1
∂
∂
θ
i
,
j
(
l
)
cost
(
θ
;
X
,
Y
)
=
δ
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
θ
i
,
j
(
l
)
=
δ
i
(
l
+
1
)
a
j
(
l
)
,
1
≤
l
<
L
−
1
推论
∂∂θ(l)cost(θ;X,Y)=δ(l+1)(a(l))⊺,1≤l<L−1 ∂ ∂ θ ( l ) cost ( θ ; X , Y ) = δ ( l + 1 ) ( a ( l ) ) ⊺ , 1 ≤ l < L − 1
损失函数 J(θ) J ( θ ) 关于 θ θ 的梯度
∀t∈N,1≤t≤m,
∀
t
∈
N
,
1
≤
t
≤
m
,
令
a(t,1)=X(t),
a
(
t
,
1
)
=
X
(
t
)
,
Z(t,l+1)=θ(l)a(t,l),1≤l≤L−1,
Z
(
t
,
l
+
1
)
=
θ
(
l
)
a
(
t
,
l
)
,
1
≤
l
≤
L
−
1
,
a(t,l)=g(Z(t,l)),2≤l≤L,
a
(
t
,
l
)
=
g
(
Z
(
t
,
l
)
)
,
2
≤
l
≤
L
,
则
a(t,L)=hθ(X(t))
a
(
t
,
L
)
=
h
θ
(
X
(
t
)
)
令
δ(t,l)=∂∂Z(t,l)cost(θ;X(t),Y(t))=⎛⎝⎜⎜⎜⎜⎜⎜⎜⎜∂∂z(t,l)1cost(θ;X(t),Y(t))⋮∂∂z(t,l)slcost(θ;X(t),Y(t))⎞⎠⎟⎟⎟⎟⎟⎟⎟⎟,2≤l≤L,
δ
(
t
,
l
)
=
∂
∂
Z
(
t
,
l
)
cost
(
θ
;
X
(
t
)
,
Y
(
t
)
)
=
(
∂
∂
z
1
(
t
,
l
)
cost
(
θ
;
X
(
t
)
,
Y
(
t
)
)
⋮
∂
∂
z
s
l
(
t
,
l
)
cost
(
θ
;
X
(
t
)
,
Y
(
t
)
)
)
,
2
≤
l
≤
L
,
则
δ(t,l)={a(t,L)−Y(t),(θ(l))⊺δ(t,l+1) .∗ a(t,l) .∗ (1−a(t,l)),l=L,2≤l≤L−1,
δ
(
t
,
l
)
=
{
a
(
t
,
L
)
−
Y
(
t
)
,
l
=
L
,
(
θ
(
l
)
)
⊺
δ
(
t
,
l
+
1
)
.
∗
a
(
t
,
l
)
.
∗
(
1
−
a
(
t
,
l
)
)
,
2
≤
l
≤
L
−
1
,
于是
∂∂θ(l)i,jcost(θ;X(t),Y(t))=δ(t,l+1)ia(t,l)j,1≤l<L−1
∂
∂
θ
i
,
j
(
l
)
cost
(
θ
;
X
(
t
)
,
Y
(
t
)
)
=
δ
i
(
t
,
l
+
1
)
a
j
(
t
,
l
)
,
1
≤
l
<
L
−
1
因此
∂∂θ(l)i,jJ(θ)=1m∑t=1m∂∂θ(l)i,jcost(θ;X(t),Y(t))+λmθ(l)i,j
∂
∂
θ
i
,
j
(
l
)
J
(
θ
)
=
1
m
∑
t
=
1
m
∂
∂
θ
i
,
j
(
l
)
cost
(
θ
;
X
(
t
)
,
Y
(
t
)
)
+
λ
m
θ
i
,
j
(
l
)
=1m∑i=1mδ(t,l+1)ia(t,l)j+λmθ(l)i,j,1≤l≤L−1
=
1
m
∑
i
=
1
m
δ
i
(
t
,
l
+
1
)
a
j
(
t
,
l
)
+
λ
m
θ
i
,
j
(
l
)
,
1
≤
l
≤
L
−
1
推论
∂∂θ(l)J(θ)=1m∑i=1mδ(t,l+1)(a(t,l))⊺+λmθ(l),1≤l≤L−1 ∂ ∂ θ ( l ) J ( θ ) = 1 m ∑ i = 1 m δ ( t , l + 1 ) ( a ( t , l ) ) ⊺ + λ m θ ( l ) , 1 ≤ l ≤ L − 1