前言
另一个博主有更详细的推导https://blog.csdn.net/chaipp0607/article/details/101946040
一.交叉熵函数的导数
- softmax:令一条数据最后的输出为[z1,z2,z3,z4,…,z10],这里令输出层的神经元数量为10
p i = e z i ∑ j = 1 10 e z j pi=\frac{e^{z_i} }{\sum_{j=1}^{10} e^{z_j}} pi=∑j=110ezjezi - cross_entropy:
L = − ∑ i = 1 10 y i × l o g ( p i ) L=-\sum_{i=1}^{10} y_i \times log(pi) L=−i=1∑10yi×log(pi) - 链式法则:
∂ L ∂ z i = ∑ j = 1 10 ∂ L ∂ p j ∂ p j ∂ z i \frac{\partial L}{\partial z_i}=\sum_{j=1}^{10}\frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial z_i} ∂zi∂L=j=1∑10∂pj∂L∂zi∂pj - 逐个击破
- ∂ L ∂ p j = − y j × 1 p j \frac{\partial L}{\partial p_j}=-y_j\times\frac{1}{p_j} ∂pj∂L=−yj×pj1
-
∂
p
j
∂
z
i
\frac{\partial p_j}{\partial z_i}
∂zi∂pj有两种情况,这里直接跳过中间步骤(链式求导)
- 1.i!=j
∂ p j ∂ z i = − p i × p j \frac{\partial p_j}{\partial z_i}=-p_i\times p_j ∂zi∂pj=−pi×pj
∂ L ∂ p j ∂ p j ∂ z i = p i × y j \frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial z_i}=p_i\times y_j ∂pj∂L∂zi∂pj=pi×yj - i==j
∂ p j ∂ z i = p i × ( 1 − p i ) \frac{\partial p_j}{\partial z_i}=p_i\times (1-p_i) ∂zi∂pj=pi×(1−pi)
∂ L ∂ p j ∂ p j ∂ z i = ( p i − 1 ) × y i \frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial z_i}=(p_i-1)\times y_i ∂pj∂L∂zi∂pj=(pi−1)×yi
- 1.i!=j
综上
,
∂
L
∂
z
i
=
∑
j
=
1
10
∂
L
∂
p
j
∂
p
j
∂
z
i
=
∑
j
≠
i
∂
L
∂
p
j
∂
p
j
∂
z
i
+
(
p
i
−
1
)
×
y
i
=
∑
j
≠
i
p
i
×
y
j
+
(
p
i
−
1
)
×
y
i
综上, \frac{\partial L}{\partial z_i}=\sum_{j=1}^{10}\frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial z_i}=\sum_{j\neq i}\frac{\partial L}{\partial p_j}\frac{\partial p_j}{\partial z_i}+(p_i-1)\times y_i=\\ \sum_{j\neq i}p_i\times y_j+(p_i-1)\times y_i
综上,∂zi∂L=j=1∑10∂pj∂L∂zi∂pj=j=i∑∂pj∂L∂zi∂pj+(pi−1)×yi=j=i∑pi×yj+(pi−1)×yi
且已知
∑
j
=
1
10
y
i
=
1
,
所以
∂
L
∂
z
i
=
p
i
−
y
i
!
且已知 \sum_{j=1}^{10}y_i=1,所以\frac{\partial L}{\partial z_i}=p_i-y_i!
且已知j=1∑10yi=1,所以∂zi∂L=pi−yi!
二.Z,y为有n条数据的矩阵
z
=
[
z
10
,
z
11
,
.
.
.
,
z
19
.
.
.
z
n
0
,
z
n
1
,
.
.
.
,
z
n
9
]
z= \begin{bmatrix} z_{10},z_{11},...,z_{19}\\ ...\\ z_{n0},z_{n1},...,z_{n9}\\ \end{bmatrix}
z=
z10,z11,...,z19...zn0,zn1,...,zn9
y
=
[
y
10
,
y
11
,
.
.
.
,
y
19
.
.
.
y
n
0
,
y
n
1
,
.
.
.
,
y
n
9
]
y= \begin{bmatrix} y_{10},y_{11},...,y_{19}\\ ...\\ y_{n0},y_{n1},...,y_{n9}\\ \end{bmatrix}
y=
y10,y11,...,y19...yn0,yn1,...,yn9
∂
L
∂
z
=
s
o
f
t
m
a
x
(
z
)
−
y
/
/
这里用到
n
u
m
p
y
的广播机制
\frac{\partial L}{\partial z}=softmax(z)-y //这里用到numpy的广播机制\\
∂z∂L=softmax(z)−y//这里用到numpy的广播机制
∂
L
∂
z
=
[
s
o
f
t
m
a
x
(
z
10
−
y
1
)
,
s
o
f
t
m
a
x
(
z
11
−
y
1
)
,
.
.
.
,
s
o
f
t
m
a
x
(
z
19
−
y
1
)
.
.
.
s
o
f
t
m
a
x
(
z
n
0
−
y
n
)
,
s
o
f
t
m
a
x
(
z
n
1
−
y
n
)
,
.
.
.
,
s
o
f
t
m
a
x
(
z
n
9
−
y
n
)
]
\frac{\partial L}{\partial z}=\begin{bmatrix} softmax(z_{10}-y_1),softmax(z_{11}-y_1),...,softmax(z_{19}-y_1)\\ ...\\ softmax(z_{n0}-y_n),softmax(z_{n1}-y_n),...,softmax(z_{n9}-y_n)\\ \end{bmatrix}
∂z∂L=
softmax(z10−y1),softmax(z11−y1),...,softmax(z19−y1)...softmax(zn0−yn),softmax(zn1−yn),...,softmax(zn9−yn)