熵、交叉熵、KL散度、JS散度、推广的JS散度公式、互信息

下面用求和符号展开是针对离散分布而言的,对于连续分布,使用积分代替求和。

熵,又称香农熵(Shannon entropy),一个分布 P P P的熵记为 H ( P ) H(P) H(P),计算公式为:

H ( P ) = E x ∼ P ( x ) [ − l o g P ( x ) ] = ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) H(P)=\mathbb E_{ x \sim P(x)} [ -logP(x)]=\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)} H(P)=ExP(x)[logP(x)]=i=1nP(xi)logP(xi)1

交叉熵

两个分布 P P P Q Q Q的交叉熵(Cross entropy)记为 H ( P , Q ) H(P,Q) H(P,Q),计算公式为:

H ( P , Q ) = E x ∼ P ( x ) [ − l o g Q ( x ) ] = ∑ i = 1 n P ( x i ) l o g 1 Q ( x i ) H(P,Q)=\mathbb E_{x \sim P(x)} [-logQ(x)]=\sum_{i=1}^n P(x_i)log\frac{1}{Q(x_i)} H(P,Q)=ExP(x)[logQ(x)]=i=1nP(xi)logQ(xi)1

KL散度

KL散度(Kullback–Leibler divergence)又称相对熵(relative entropy),两个分布 P P P Q Q Q的KL散度记为 D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(PQ),计算公式为:

D K L ( P ∣ ∣ Q ) = E x ∼ P ( x ) [ l o g P ( x ) Q ( x ) ] = ∑ i = 1 n P ( x i ) l o g P ( x i ) Q ( x i ) D_{KL}(P||Q)=\mathbb E_{x \sim P(x)} [ log\frac {P(x)}{Q(x)}]=\sum_{i=1}^n P(x_i)log\frac{P(x_i)}{Q(x_i)} DKL(PQ)=ExP(x)[logQ(x)P(x)]=i=1nP(xi)logQ(xi)P(xi)

由熵、交叉熵和KL散度的公式我们可得到三者的关系:
D K L ( P ∣ ∣ Q ) = ∑ i = 1 n P ( x i ) l o g P ( x i ) Q ( x i ) = ∑ i = 1 n P ( x i ) l o g 1 Q ( x i ) − ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) = H ( P , Q ) − H ( P ) \begin{aligned} D_{KL}(P||Q) &=\sum_{i=1}^n P(x_i)log\frac{P(x_i)}{Q(x_i)} \\ &=\sum_{i=1}^n P(x_i)log\frac{1}{Q(x_i)}-\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)} \\ &=H(P,Q)-H(P) \end{aligned} DKL(PQ)=i=1nP(xi)logQ(xi)P(xi)=i=1nP(xi)logQ(xi)1i=1nP(xi)logP(xi)1=H(P,Q)H(P)

因此在机器学习的优化问题中,假设我们的目标分布是 P P P。如果 P P P在我们的优化过程中是固定的,即 H ( P ) H(P) H(P)不变,那么使用 D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(PQ)和使用 H ( P , Q ) H(P,Q) H(P,Q)是等价的,所以我们可以用计算更加方便的交叉熵而不是KL散度来作为Loss函数。

JS散度

两个分布 P P P Q Q Q的JS散度(Jensen–Shannon divergence)记为 J S D ( P ∣ ∣ Q ) JSD(P||Q) JSD(PQ),其计算公式为:

J S D ( P ∣ ∣ Q ) = 1 2 D K L ( P ∣ ∣ P + Q 2 ) + 1 2 D K L ( Q ∣ ∣ P + Q 2 ) JSD(P||Q)=\frac 12 D_{KL}(P||\frac {P+Q}{2})+\frac 12 D_{KL}(Q||\frac {P+Q}{2}) JSD(PQ)=21DKL(P2P+Q)+21DKL(Q2P+Q)

对于 n n n个分布 P 1 , P 2 , P 3 . . . , P n P_1, P_2, P_3 ..., P_n P1,P2,P3...,Pn,其JS散度记为 J S D π 1 , π 2 , π 3 . . . , π n ( P 1 , P 2 , P 3 . . . , P n ) JSD_{\pi _1, \pi_2, \pi_3 ..., \pi_n}(P_1, P_2, P_3 ..., P_n) JSDπ1,π2,π3...,πn(P1,P2,P3...,Pn),其中 π 1 , π 2 , π 3 . . . , π n \pi _1, \pi_2, \pi_3 ..., \pi_n π1,π2,π3...,πn分别是给分布 P 1 , P 2 , P 3 . . . , P n P_1, P_2, P_3 ..., P_n P1,P2,P3...,Pn赋予的权重。计算公式为:

J S D π 1 , π 2 , π 3 . . . , π n ( P 1 , P 2 , P 3 . . . , P n ) = H ( ∑ i = 1 n π i P i ) − ∑ i = 1 n π i H ( P i ) JSD_{\pi _1, \pi_2, \pi_3 ..., \pi_n}(P_1, P_2, P_3 ..., P_n)=H(\sum_{i=1}^n\pi_iP_i)-\sum_{i=1}^n\pi_iH(P_i) JSDπ1,π2,π3...,πn(P1,P2,P3...,Pn)=H(i=1nπiPi)i=1nπiH(Pi)

实际上,两个分布的JS散度对应了当 n = 2 n=2 n=2,且取 π 1 = π 2 = 1 2 \pi_1=\pi_2=\frac12 π1=π2=21的情形,即:

J S D ( P ∣ ∣ Q ) = H ( P + Q 2 ) − H ( P ) + H ( Q ) 2 JSD(P||Q)=H(\frac{P+Q}{2})-\frac{H(P)+H(Q)}{2} JSD(PQ)=H(2P+Q)2H(P)+H(Q)

上述式子不难验证,将KL散度的计算公式带入JS散度的计算公式,并将每个KL散度展开成交叉熵减去熵的形式,然后再合并就行,如下:
J S D ( P ∣ ∣ Q ) = 1 2 D K L ( P ∣ ∣ P + Q 2 ) + 1 2 D K L ( Q ∣ ∣ P + Q 2 ) = 1 2 [ ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) + Q ( x i ) 2 − ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) ] + 1 2 [ ∑ i = 1 n Q ( x i ) l o g 1 P ( x i ) + Q ( x i ) 2 − ∑ i = 1 n Q ( x i ) l o g 1 Q ( x i ) ] = 1 2 [ ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) + Q ( x i ) 2 + ∑ i = 1 n Q ( x i ) l o g 1 P ( x i ) + Q ( x i ) 2 ] − 1 2 [ ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) + ∑ i = 1 n Q ( x i ) l o g 1 Q ( x i ) ] = ∑ i = 1 n P ( x i ) + Q ( x i ) 2 l o g 1 P ( x i ) + Q ( x i ) 2 − ∑ i = 1 n P ( x i ) l o g 1 P ( x i ) + Q ( x i ) l o g 1 Q ( x i ) 2 = H ( P + Q 2 ) − H ( P ) + H ( Q ) 2 \begin{aligned} JSD(P||Q) &=\frac 12 D_{KL}(P||\frac {P+Q}{2})+\frac 12 D_{KL}(Q||\frac {P+Q}{2}) \\ &=\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)}]+\frac12[\sum_{i=1}^n Q(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n Q(x_i)log\frac{1}{Q(x_i)}] \\ &=\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}+\sum_{i=1}^n Q(x_i)log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}]-\frac12[\sum_{i=1}^n P(x_i)log\frac{1}{P(x_i)}+\sum_{i=1}^n Q(x_i)log\frac{1}{Q(x_i)}] \\ &=\sum_{i=1}^n\frac {P(x_i)+Q(x_i)}{2}log\frac{1}{\frac {P(x_i)+Q(x_i)}{2}}-\sum_{i=1}^n \frac{P(x_i)log\frac{1}{P(x_i)}+ Q(x_i)log\frac{1}{Q(x_i)}}{2} \\ &=H(\frac{P+Q}{2})-\frac{H(P)+H(Q)}{2} \\ \end{aligned} JSD(PQ)=21DKL(P2P+Q)+21DKL(Q2P+Q)=21[i=1nP(xi)log2P(xi)+Q(xi)1i=1nP(xi)logP(xi)1]+21[i=1nQ(xi)log2P(xi)+Q(xi)1i=1nQ(xi)logQ(xi)1]=21[i=1nP(xi)log2P(xi)+Q(xi)1+i=1nQ(xi)log2P(xi)+Q(xi)1]21[i=1nP(xi)logP(xi)1+i=1nQ(xi)logQ(xi)1]=i=1n2P(xi)+Q(xi)log2P(xi)+Q(xi)1i=1n2P(xi)logP(xi)1+Q(xi)logQ(xi)1=H(2P+Q)2H(P)+H(Q)

互信息

挖坑待填

References

  1. https://en.wikipedia.org/wiki/Entropy_(information_theory)
  2. https://en.wikipedia.org/wiki/Cross_entropy
  3. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
  4. https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
  • 4
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值