参考链接:https://blog.csdn.net/qian99/article/details/78046329
交叉熵cross-entropy
对一个分类神经网络
f
f
f,输出为
z
=
f
(
x
;
θ
)
,
z
=
[
z
0
,
z
1
,
⋯
,
z
C
−
1
]
z=f(x;\theta),z=[z_{0},z_{1},\cdots,z_{C-1}]
z=f(x;θ),z=[z0,z1,⋯,zC−1],
z
z
z为logits,其中类别数量为
C
C
C,
y
y
y为
x
x
x的one-hot标签。通过softmax归一化来得到概率:
p
i
=
exp
z
i
∑
j
exp
z
j
p_{i}=\frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}}
pi=∑jexpzjexpzi
负交叉熵误差为:
L
=
−
∑
i
y
i
log
p
i
\mathcal{L}=-\sum_{i}y_{i}\log{p_{i}}
L=−i∑yilogpi
误差对于概率的梯度为:
∂
L
∂
p
i
=
−
y
i
1
p
i
\frac{\partial \mathcal{L}}{\partial p_{i}}=-y_{i}\frac{1}{p_{i}}
∂pi∂L=−yipi1
紧接着计算
∂
p
i
∂
z
k
,
k
=
0
,
1
,
.
.
.
,
C
−
1
\frac{\partial \mathcal{p_{i}}}{\partial z_{k}},k=0,1,...,C-1
∂zk∂pi,k=0,1,...,C−1:
(1)当
k
=
i
k=i
k=i时,
∂
p
i
∂
z
i
=
∂
(
exp
z
i
∑
j
exp
z
j
)
∂
z
i
=
exp
z
i
∑
j
exp
z
j
−
(
exp
z
i
)
2
(
∑
j
exp
z
j
)
2
=
(
exp
z
i
∑
j
exp
z
j
)
(
1
−
exp
z
i
∑
j
exp
z
j
)
=
p
i
(
1
−
p
i
)
\frac{\partial \mathcal{p_{i}}}{\partial z_{i}}=\frac{\partial ( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})}{\partial z_{i}}=\frac{\exp{z_{i}}\sum_{j}\exp{z_{j}}-(\exp{z_{i}})^{2}}{(\sum_{j}{\exp{z_{j}}})^{2}} \\ =( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})(1- \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})=p_{i}(1-p_{i})
∂zi∂pi=∂zi∂(∑jexpzjexpzi)=(∑jexpzj)2expzi∑jexpzj−(expzi)2=(∑jexpzjexpzi)(1−∑jexpzjexpzi)=pi(1−pi)
(2)当
k
≠
i
k\neq i
k=i时,
∂
p
i
∂
z
k
=
∂
(
exp
z
i
∑
j
exp
z
j
)
∂
z
k
=
−
exp
z
i
exp
z
k
(
∑
j
exp
z
j
)
2
=
−
p
i
p
k
\frac{\partial \mathcal{p_{i}}}{\partial z_{k}}=\frac{\partial ( \frac{\exp{z_{i}}}{\sum_{j}{\exp{z_{j}}}})}{\partial z_{k}}=\frac{-\exp{z_{i}}\exp{z_{k}}}{(\sum_{j}{\exp{z_{j}}})^{2}} =-p_{i}p_{k}
∂zk∂pi=∂zk∂(∑jexpzjexpzi)=(∑jexpzj)2−expziexpzk=−pipk
根据求导的链式法则:
∂
L
∂
z
k
=
∑
j
(
∂
L
∂
p
j
∂
p
j
∂
z
k
)
=
∑
j
=
/
k
(
∂
L
∂
p
j
∂
p
j
∂
z
k
)
+
(
∂
L
∂
p
k
∂
p
k
∂
z
k
)
=
∑
j
=
/
k
(
−
y
j
1
p
j
∗
−
p
j
p
k
)
+
(
−
y
k
1
p
k
∗
p
k
(
1
−
p
k
)
)
=
∑
j
=
/
k
(
y
j
p
k
)
−
y
k
+
y
k
p
k
=
p
k
∑
j
y
j
−
y
k
\frac{\partial \mathcal{\mathcal{L}}}{\partial z_{k}}=\sum_{j}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})\\ =\sum_{j=/k}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})+(\frac{\partial \mathcal{L}}{\partial p_{k}}\frac{\partial \mathcal{p_{k}}}{\partial z_{k}})\\ =\sum_{j=/k}(-y_{j}\frac{1}{p_{j}}*-p_{j}p_{k})+(-y_{k}\frac{1}{p_{k}}*p_{k}(1-p_{k}))\\ =\sum_{j=/k}(y_{j}p_{k})-y_{k}+y_{k}p_{k}\\ =p_{k}\sum_{j}y_{j}-y_{k}
∂zk∂L=j∑(∂pj∂L∂zk∂pj)=j=/k∑(∂pj∂L∂zk∂pj)+(∂pk∂L∂zk∂pk)=j=/k∑(−yjpj1∗−pjpk)+(−ykpk1∗pk(1−pk))=j=/k∑(yjpk)−yk+ykpk=pkj∑yj−yk
因为
y
y
y为one-hot编码,所以
∑
j
y
j
=
1
\sum_{j}y_{j}=1
∑jyj=1,i.e.,
∂
L
∂
z
k
=
p
k
−
y
k
\frac{\partial \mathcal{\mathcal{L}}}{\partial z_{k}}=p_{k}-y_{k}
∂zk∂L=pk−yk
相对熵KL散度
预测的概率分布
p
p
p,真实概率分布为
q
q
q,KL的散度为:
L
=
K
L
(
q
∣
∣
p
)
=
∑
k
q
c
log
q
k
p
k
\mathcal{L}=KL(q||p)=\sum_{k}q_{c}\log{\frac{q_{k}}{p_{k}}}
L=KL(q∣∣p)=k∑qclogpkqk
求解对概率
p
k
p_{k}
pk的梯度
∂
L
∂
p
k
=
−
q
k
p
k
\frac{\partial \mathcal{\mathcal{L}}}{\partial p_{k}}=-\frac{q_{k}}{p_{k}}
∂pk∂L=−pkqk
求解对logits
z
k
z_{k}
zk的梯度:
∂
L
∂
z
c
=
∑
j
(
∂
L
∂
p
j
∂
p
j
∂
z
k
)
=
∑
j
=
/
k
(
∂
L
∂
p
j
∂
p
j
∂
z
k
)
+
(
∂
L
∂
p
k
∂
p
k
∂
z
k
)
=
∑
j
=
/
k
(
−
q
j
p
j
∗
−
p
j
p
k
)
+
(
−
q
k
p
k
∗
p
k
(
1
−
p
k
)
)
=
∑
j
=
/
k
(
q
j
p
k
)
+
q
k
p
k
−
q
k
=
∑
j
q
j
p
k
−
q
k
\frac{\partial \mathcal{\mathcal{L}}}{\partial z_{c}}= \sum_{j}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})\\ =\sum_{j=/k}(\frac{\partial \mathcal{L}}{\partial p_{j}}\frac{\partial \mathcal{p_{j}}}{\partial z_{k}})+(\frac{\partial \mathcal{L}}{\partial p_{k}}\frac{\partial \mathcal{p_{k}}}{\partial z_{k}})\\ =\sum_{j=/k}(-\frac{q_{j}}{p_{j}}*-p_{j}p_{k})+(-\frac{q_{k}}{p_{k}}*p_{k}(1-p_{k}))\\ =\sum_{j=/k}(q_{j}p_{k})+q_{k}p_{k}-q_{k}\\ =\sum_{j}q_{j}p_{k}-q_{k}
∂zc∂L=j∑(∂pj∂L∂zk∂pj)=j=/k∑(∂pj∂L∂zk∂pj)+(∂pk∂L∂zk∂pk)=j=/k∑(−pjqj∗−pjpk)+(−pkqk∗pk(1−pk))=j=/k∑(qjpk)+qkpk−qk=j∑qjpk−qk