下面用求和符号展开是针对离散分布而言的,对于连续分布,使用积分代替求和。
熵
熵,又称香农熵(Shannon entropy),一个分布 P P P的熵记为 H ( P ) H(P) H(P),计算公式为:
H ( P ) = E x ∼ P ( x ) [ − l o g P ( x ) ] = ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) H(P)=\mathbb E_{ x \sim P(x)} [ -logP(x)]=\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)} H(P)=Ex∼P(x)[−logP(x)]=i=1∑nP(xi)logP(xi)1
交叉熵
两个分布 P P P和 Q Q Q的交叉熵(Cross entropy)记为 H ( P , Q ) H(P,Q) H(P,Q),计算公式为:
H ( P , Q ) = E x ∼ P ( x ) [ − l o g Q ( x ) ] = ∑ i = 1 n P ( x i ) l o g 1 Q ( x i ) H(P,Q)=\mathbb E_{x \sim P(x)} [-logQ(x)]=\sum_{i=1}^n P(x_i)log\frac{1}{Q(x_i)} H(P,Q)=Ex∼P(x)[−logQ(x)]=i=1∑nP(xi)logQ(xi)1
KL散度
KL散度(Kullback–Leibler divergence)又称相对熵(relative entropy),两个分布 P P P和 Q Q Q的KL散度记为 D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(P∣∣Q),计算公式为:
D K L ( P ∣ ∣ Q ) = E x ∼ P ( x ) [ l o g P ( x ) Q ( x ) ] = ∑ i = 1 n P ( x i ) l o g P ( x i ) Q ( x i ) D_{KL}(P||Q)=\mathbb E_{x \sim P(x)} [ log\frac {P(x)}{Q(x)}]=\sum_{i=1}^n P(x_i)log\frac{P(x_i)}{Q(x_i)} DKL(P∣∣Q)=Ex∼P(x)[logQ(x)P(x)]=i=1∑nP(xi)logQ(xi)P(xi)
由熵、交叉熵和KL散度的公式我们可得到三者的关系:
D
K
L
(
P
∣
∣
Q
)
=
∑
i
=
1
n
P
(
x
i
)
l
o
g
P
(
x
i
)
Q
(
x
i
)
=
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
Q
(
x
i
)
−
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
=
H
(
P
,
Q
)
−
H
(
P
)
\begin{aligned} D_{KL}(P||Q) &=\sum_{i=1}^n P(x_i)log\frac{P(x_i)}{Q(x_i)} \\ &=\sum_{i=1}^n P(x_i)log\frac{1}{Q(x_i)}-\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)} \\ &=H(P,Q)-H(P) \end{aligned}
DKL(P∣∣Q)=i=1∑nP(xi)logQ(xi)P(xi)=i=1∑nP(xi)logQ(xi)1−i=1∑nP(xi)logP(xi)1=H(P,Q)−H(P)
因此在机器学习的优化问题中,假设我们的目标分布是 P P P。如果 P P P在我们的优化过程中是固定的,即 H ( P ) H(P) H(P)不变,那么使用 D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(P∣∣Q)和使用 H ( P , Q ) H(P,Q) H(P,Q)是等价的,所以我们可以用计算更加方便的交叉熵而不是KL散度来作为Loss函数。
JS散度
两个分布 P P P和 Q Q Q的JS散度(Jensen–Shannon divergence)记为 J S D ( P ∣ ∣ Q ) JSD(P||Q) JSD(P∣∣Q),其计算公式为:
J S D ( P ∣ ∣ Q ) = 1 2 D K L ( P ∣ ∣ P + Q 2 ) + 1 2 D K L ( Q ∣ ∣ P + Q 2 ) JSD(P||Q)=\frac 12 D_{KL}(P||\frac {P+Q}{2})+\frac 12 D_{KL}(Q||\frac {P+Q}{2}) JSD(P∣∣Q)=21DKL(P∣∣2P+Q)+21DKL(Q∣∣2P+Q)
对于 n n n个分布 P 1 , P 2 , P 3 . . . , P n P_1, P_2, P_3 ..., P_n P1,P2,P3...,Pn,其JS散度记为 J S D π 1 , π 2 , π 3 . . . , π n ( P 1 , P 2 , P 3 . . . , P n ) JSD_{\pi _1, \pi_2, \pi_3 ..., \pi_n}(P_1, P_2, P_3 ..., P_n) JSDπ1,π2,π3...,πn(P1,P2,P3...,Pn),其中 π 1 , π 2 , π 3 . . . , π n \pi _1, \pi_2, \pi_3 ..., \pi_n π1,π2,π3...,πn分别是给分布 P 1 , P 2 , P 3 . . . , P n P_1, P_2, P_3 ..., P_n P1,P2,P3...,Pn赋予的权重。计算公式为:
J S D π 1 , π 2 , π 3 . . . , π n ( P 1 , P 2 , P 3 . . . , P n ) = H ( ∑ i = 1 n π i P i ) − ∑ i = 1 n π i H ( P i ) JSD_{\pi _1, \pi_2, \pi_3 ..., \pi_n}(P_1, P_2, P_3 ..., P_n)=H(\sum_{i=1}^n\pi_iP_i)-\sum_{i=1}^n\pi_iH(P_i) JSDπ1,π2,π3...,πn(P1,P2,P3...,Pn)=H(i=1∑nπiPi)−i=1∑nπiH(Pi)
实际上,两个分布的JS散度对应了当 n = 2 n=2 n=2,且取 π 1 = π 2 = 1 2 \pi_1=\pi_2=\frac12 π1=π2=21的情形,即:
J S D ( P ∣ ∣ Q ) = H ( P + Q 2 ) − H ( P ) + H ( Q ) 2 JSD(P||Q)=H(\frac{P+Q}{2})-\frac{H(P)+H(Q)}{2} JSD(P∣∣Q)=H(2P+Q)−2H(P)+H(Q)
上述式子不难验证,将KL散度的计算公式带入JS散度的计算公式,并将每个KL散度展开成交叉熵减去熵的形式,然后再合并就行,如下:
J
S
D
(
P
∣
∣
Q
)
=
1
2
D
K
L
(
P
∣
∣
P
+
Q
2
)
+
1
2
D
K
L
(
Q
∣
∣
P
+
Q
2
)
=
1
2
[
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
2
−
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
]
+
1
2
[
∑
i
=
1
n
Q
(
x
i
)
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
2
−
∑
i
=
1
n
Q
(
x
i
)
l
o
g
1
Q
(
x
i
)
]
=
1
2
[
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
2
+
∑
i
=
1
n
Q
(
x
i
)
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
2
]
−
1
2
[
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
+
∑
i
=
1
n
Q
(
x
i
)
l
o
g
1
Q
(
x
i
)
]
=
∑
i
=
1
n
P
(
x
i
)
+
Q
(
x
i
)
2
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
2
−
∑
i
=
1
n
P
(
x
i
)
l
o
g
1
P
(
x
i
)
+
Q
(
x
i
)
l
o
g
1
Q
(
x
i
)
2
=
H
(
P
+
Q
2
)
−
H
(
P
)
+
H
(
Q
)
2
\begin{aligned} JSD(P||Q) &=\frac 12 D_{KL}(P||\frac {P+Q}{2})+\frac 12 D_{KL}(Q||\frac {P+Q}{2}) \\ &=\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)}]+\frac12[\sum_{i=1}^n Q(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n Q(x_i)log\frac{1}{Q(x_i)}] \\ &=\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}+\sum_{i=1}^n Q(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}]-\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)}+\sum_{i=1}^n Q(x_i)log\frac{1}{Q(x_i)}] \\ &=\sum_{i=1}^n\frac {P(x_i)+Q(x_i)}{2}log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n \frac{P(x_i)log\frac{1}{P(x_i)}+ Q(x_i)log\frac{1}{Q(x_i)}}{2} \\ &=H(\frac{P+Q}{2})-\frac{H(P)+H(Q)}{2} \\ \end{aligned}
JSD(P∣∣Q)=21DKL(P∣∣2P+Q)+21DKL(Q∣∣2P+Q)=21[i=1∑nP(xi)log2P(xi)+Q(xi)1−i=1∑nP(xi)logP(xi)1]+21[i=1∑nQ(xi)log2P(xi)+Q(xi)1−i=1∑nQ(xi)logQ(xi)1]=21[i=1∑nP(xi)log2P(xi)+Q(xi)1+i=1∑nQ(xi)log2P(xi)+Q(xi)1]−21[i=1∑nP(xi)logP(xi)1+i=1∑nQ(xi)logQ(xi)1]=i=1∑n2P(xi)+Q(xi)log2P(xi)+Q(xi)1−i=1∑n2P(xi)logP(xi)1+Q(xi)logQ(xi)1=H(2P+Q)−2H(P)+H(Q)
互信息
挖坑待填
References
- https://en.wikipedia.org/wiki/Entropy_(information_theory)
- https://en.wikipedia.org/wiki/Cross_entropy
- https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence