互信息的原理、计算和应用
Mutual Information 互信息
Background
熵 Entropy
在信息论中,熵是给定概率下的最佳编码方法[1]
H
(
x
)
=
−
E
x
[
log
p
(
x
)
]
=
−
∑
x
p
(
x
)
log
p
(
x
)
=
−
∫
p
(
x
)
log
p
(
x
)
\begin{aligned} H(x)&=-\mathbb{E}_x[\log{p(x)}]\\ &=-\sum_x{p(x)\log{p(x)}}\\ &=-\int{p(x)\log{p(x)}} \end{aligned}
H(x)=−Ex[logp(x)]=−x∑p(x)logp(x)=−∫p(x)logp(x)
交叉熵 Cross Entropy
从信息论的角度看,交叉熵是用本来为
q
(
x
)
q(x)
q(x)事件编码的规则来对一个全新的事件
p
(
x
)
p(x)
p(x)
H
(
p
,
q
)
=
−
E
x
∼
p
(
x
)
[
q
(
x
)
]
=
−
∑
x
p
(
x
)
log
q
(
x
)
=
−
∫
x
p
(
x
)
log
q
(
x
)
\begin{aligned} H(p,q)&=-\mathbb{E}_{x\sim p(x)}[q(x)]\\ &=-\sum_x{p(x)\log{q(x)}}\\ &=-\int_x{p(x)\log{q(x)}} \end{aligned}
H(p,q)=−Ex∼p(x)[q(x)]=−x∑p(x)logq(x)=−∫xp(x)logq(x)
显然由于对
p
(
x
)
p(x)
p(x)采用了其他的编码方式,有
H
(
p
,
p
)
<
H
(
p
,
q
)
H(p,p)<H (p,q)
H(p,p)<H(p,q)。采用其他编码方式造成的资源浪费就定义为KL-散度(KL-divergence),也叫相对熵
D
K
L
=
H
(
p
,
q
)
−
H
(
p
)
D_{KL}=H(p,q)-H(p)
DKL=H(p,q)−H(p)
条件熵 Conditional Entropy
条件熵是指已知某个变量
Z
Z
Z之后,变量
X
X
X的熵
H
(
X
∣
Z
)
=
−
E
x
,
z
[
p
(
x
∣
z
)
]
=
−
∑
x
,
z
p
(
x
,
z
)
log
p
(
x
∣
z
)
=
−
∫
x
,
z
p
(
x
,
z
)
log
p
(
x
∣
z
)
\begin{aligned} H(X|Z)&=-\mathbb{E}_{x,z}[p(x|z)]\\ &=-\sum_{x,z}{p(x,z)\log{p(x|z)}}\\ &=-\int_{x,z}{p(x,z)\log{p(x|z)}} \end{aligned}
H(X∣Z)=−Ex,z[p(x∣z)]=−x,z∑p(x,z)logp(x∣z)=−∫x,zp(x,z)logp(x∣z)
显然 H ( X ) > H ( X ∣ Z ) H(X)>H(X|Z) H(X)>H(X∣Z)。
KL-散度 KL-divergence
D K L ( p ( x ) ∣ ∣ q ( x ) ) = H ( p ∣ q ) − H ( p ) = ∑ x p ( x ) log p ( x ) q ( x ) = ∫ p ( x ) log p ( x ) q ( x ) \begin{aligned} D_{KL}{(p(x)||q(x))} &= H(p|q)-H(p)\\ &=\sum_x{p(x)\log{\frac{p(x)}{q(x)}}}\\ &=\int{p(x)\log{\frac{p(x)}{q(x)}}} \end{aligned} DKL(p(x)∣∣q(x))=H(p∣q)−H(p)=x∑p(x)logq(x)p(x)=∫p(x)logq(x)p(x)
KL-散度可以看成是两个概率分布之间的度量,需要注意的是它没有对称性,并且 D K L ( p ( x ) ∣ ∣ q ( x ) ) ≥ 0 D_KL (p(x)||q(x))≥0 DKL(p(x)∣∣q(x))≥0[4]。(Appendix A.)
定义
互信息量化了两个随机变量X和Z之间的相关性[2]
I
(
X
;
Z
)
=
D
K
L
(
P
X
Z
∣
∣
P
X
⊗
P
Z
)
=
∫
X
×
Z
log
d
P
X
Z
d
P
X
⊗
P
Z
d
P
X
Z
\begin{aligned} I(X;Z)&=D_{KL}(\mathbb{P}_{XZ}||\mathbb{P}_X\otimes\mathbb{P}_Z)\\ &=\int_{\mathcal{X}\times\mathcal{Z}}{\log{\frac{d\mathbb{P}_{XZ}}{d\mathbb{P}_X\otimes\mathbb{P}_Z}d\mathbb{P}_{XZ}}} \end{aligned}
I(X;Z)=DKL(PXZ∣∣PX⊗PZ)=∫X×ZlogdPX⊗PZdPXZdPXZ
其中
P
X
Z
\mathbb{P}_{XZ}
PXZ是联合概率分布,
P
X
=
∫
Z
d
P
X
Z
\mathbb{P}_X=\int_{\mathcal{Z}}{d\mathbb{P}_{XZ}}
PX=∫ZdPXZ与
P
Z
=
∫
X
d
P
X
Z
\mathbb{P}_Z=\int_{\mathcal{X}}{d\mathbb{P}_{XZ}}
PZ=∫XdPXZ边缘分布函数,也就是说互信息可以看成联合概率分布和边缘概率分布之积的距离。
互信息也可以理解为在已知变量
Z
Z
Z的情况下,表示变量
X
X
X所节省的资源
I
(
X
;
Z
)
=
H
(
x
)
−
H
(
X
∣
Z
)
I(X;Z)=H(x)-H(X|Z)
I(X;Z)=H(x)−H(X∣Z)
与相关系数不同的是,互信息更倾向于捕捉非线形的关系。但是,一般情况下的互信息是很难计算的,因为变量X和Z的概率分布难以获得。
计算方法
Variational approach[3]
考虑到
D
K
L
(
p
(
x
∣
y
)
∣
∣
q
(
x
∣
y
)
)
≥
0
D_{KL}(p(x|y)||q(x|y))≥0
DKL(p(x∣y)∣∣q(x∣y))≥0,则
∑
x
p
(
x
∣
y
)
log
p
(
x
∣
y
)
−
p
(
x
∣
y
)
log
q
(
x
∣
y
)
≥
0
\sum_x{p(x|y)\log{p(x|y)}-p(x|y)\log{q(x|y)}}\geq0
x∑p(x∣y)logp(x∣y)−p(x∣y)logq(x∣y)≥0
那么可以得到
I
(
x
,
y
)
=
H
(
x
)
−
H
(
x
∣
y
)
=
H
(
x
)
−
∑
y
p
(
y
)
∑
x
p
(
x
∣
y
)
log
p
(
x
∣
y
)
≥
H
(
x
)
+
∑
y
p
(
y
)
∑
x
p
(
x
∣
y
)
log
q
(
x
∣
y
)
=
H
(
x
)
+
E
x
,
y
∼
p
(
x
,
y
)
[
log
q
(
x
∣
y
)
]
=
d
e
f
I
~
(
x
,
y
)
\begin{aligned} I(x,y)&=H(x)-H(x|y)\\ &=H(x)-\sum_y{p(y)\sum_x{p(x|y)\log{p(x|y)}}}\\ &\geq H(x)+\sum_y{p(y)\sum_x{p(x|y)\log{q(x|y)}}}\\ &=H(x)+\mathbb{E}_{x,y\sim p(x,y)}[\log{q(x|y)}]\\ &\overset{\rm{def}}{=}\tilde{I}(x,y) \end{aligned}
I(x,y)=H(x)−H(x∣y)=H(x)−y∑p(y)x∑p(x∣y)logp(x∣y)≥H(x)+y∑p(y)x∑p(x∣y)logq(x∣y)=H(x)+Ex,y∼p(x,y)[logq(x∣y)]=defI~(x,y)
只要不断的推进
I
~
(
x
,
y
)
\tilde{I}(x,y)
I~(x,y)的边界,就能得到互信息的估计值。假设估计的目标为
p
(
y
∣
x
,
θ
)
p(y|x,θ)
p(y∣x,θ),与EM算法类似,IM算法进行以下迭代
- 固定 q ( x ∣ y ) , θ n e w = arg max θ [ I ~ ( x , y ) ] q(x|y),θ^{new}=\arg\max_θ[\tilde{I} (x, y)] q(x∣y),θnew=argmaxθ[I~(x,y)]
- 固定 θ , q n e w = arg max ( q ( x ∣ y ) ∈ Q ) I ~ ( x , y ) θ,q^{new}=\arg\max_{(q(x|y)∈Q)}{\tilde{I}(x,y)} θ,qnew=argmax(q(x∣y)∈Q)I~(x,y)
其中 Q Q Q是特定的概率分布,可以方便 E ( x , y ∼ p ( x , y ) ) [ log q ( x ∣ y ) ] E_{(x,y\sim p(x,y)})[\log{q(x|y)}] E(x,y∼p(x,y))[logq(x∣y)]的计算,一般情况下选为高斯分布。
Mutual Information Neural Estimation, MINE[5]
根据KL散度的Donsker-Varadhan表示
D
K
L
(
P
∣
∣
Q
)
=
sup
T
:
Ω
→
R
E
P
[
T
]
−
log
E
Q
[
e
T
]
D_{KL}(\mathbb{P}||\mathbb{Q})=\sup_{T:\Omega\rightarrow\mathbb{R}}{\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]}}
DKL(P∣∣Q)=T:Ω→RsupEP[T]−logEQ[eT]
不断推进这个函数的上界就可以得到KL-散度的估计值。 (Appendix B.)
其中 θ θ θ就是神经网络的参数。
DEEP INFOMAX[6]
-
同时优化计算互信息的网络参数 ω ω ω和生成器的参数 ψ ψ ψ
( ω ^ , ψ ^ ) = arg max ω , ψ I ^ ω ( X , E ψ ( X ) ) (\hat{ω},\hat{\psi})=\arg{\max_{ω,\psi}\hat{I}_{\omega}(X,E_{\psi}(X))} (ω^,ψ^)=argω,ψmaxI^ω(X,Eψ(X))
因为同时对 ω , ψ ω,ψ ω,ψ进行优化,因此可以让编码器E和互信息的估计网络T共享某些神经网络层,即 E ψ = f ψ ∘ C ψ E_ψ=f_ψ∘C_ψ Eψ=fψ∘Cψ, T ω , ψ = D ω ∘ g ∘ ( C ψ , E ψ ) T_{ω,ψ}=D_ω∘g∘(C_ψ,E_ψ) Tω,ψ=Dω∘g∘(Cψ,Eψ) -
可以使用非KL-散度对互信息进行估计,比如Jenson-Shannon估计
I ^ ω , ψ ( J S D ) : = E P [ − s p ( − T ω , ψ ( x , E ψ ( x ) ) ) ] − E P × P ~ [ s p ( T ω , ψ ( x , E ψ ( x ) ) ) ] \hat{I}^{(\rm JSD)}_{ω,ψ}:=\mathbb{E}_{\mathbb{P}}[-{\rm sp}(-T_{ω,ψ}(x,E_{\psi}(x)))]-\mathbb{E}_{\mathbb{P}\times\tilde{\mathbb{P}}}[{\rm sp}(T_{ω,ψ}(x,E_{\psi}(x)))] I^ω,ψ(JSD):=EP[−sp(−Tω,ψ(x,Eψ(x)))]−EP×P~[sp(Tω,ψ(x,Eψ(x)))]
其中 s p ( z ) = log ( 1 + e z ) , x ∼ P , x ′ ∼ P ~ {\rm sp}(z)=\log(1+e^z ),x\sim P,x^′\sim \tilde{P} sp(z)=log(1+ez),x∼P,x′∼P~。或者噪音对比估计Noise-Contrastive Estimation (NCE)
I ^ ω , ψ ( i n f o N C E ) : = E P [ T ω , ψ ( x , E ψ ( x ) ) − E P ~ [ log ∑ x ′ e T ω , ψ ( x ′ , E ψ ( x ) ) ] ] \hat{I}^{({\rm info}NCE)}_{ω,ψ}:=\mathbb{E}_{\mathbb{P}}[T_{ω,ψ}(x,E_{\psi}(x))-\mathbb{E}_{\mathbb{\tilde{P}}}[\log{\sum_{x^{\prime}}}{e^{T_{ω,ψ}(x^{\prime},E_{\psi}(x))}}]] I^ω,ψ(infoNCE):=EP[Tω,ψ(x,Eψ(x))−EP~[logx′∑eTω,ψ(x′,Eψ(x))]]
应用
迁移学习
不仅利用老师网络的logits层和标签进行调整,还利用中间层之间的互信息[7]。由变分法可知
I
(
t
;
s
)
=
H
(
t
)
−
H
(
t
∣
s
)
≥
H
(
t
)
+
E
t
,
s
[
log
q
(
t
∣
s
)
]
\begin{aligned} I(t;s)&=H(t)-H(t|s)\\ &\geq H(t)+\mathbb{E}_{t,s}[\log{q(t|s)}] \end{aligned}
I(t;s)=H(t)−H(t∣s)≥H(t)+Et,s[logq(t∣s)]
则模型的损失函数变为
L
~
=
L
S
−
∑
k
=
1
K
λ
k
E
t
(
k
)
,
s
(
k
)
[
log
q
(
t
(
k
)
∣
s
(
k
)
)
]
\tilde{\mathcal{L}}=\mathcal{L}_S-\sum_{k=1}^{K}\lambda_k\mathbb{E}_{t^{(k)},s^{(k)}}[\log{q(t^{(k)}|s^{(k)})}]
L~=LS−k=1∑KλkEt(k),s(k)[logq(t(k)∣s(k))]
强化学习+迁移学习[8]
强化学习迁移的一个阻碍是,两个不同任务之间的动作空间、状态空间等不一致。通过互信息可以对不同任务的空间进行转化,达到迁移强化学习的目的
最终的损失函数由三部分组成
L
c
o
u
p
l
i
n
g
=
−
1
N
π
∑
j
=
1
N
π
log
p
θ
j
−
1
N
V
∑
j
=
1
N
V
log
p
ψ
j
L
P
P
O
=
L
P
P
O
θ
+
L
P
P
O
ψ
L
M
I
(
ϕ
,
ω
)
=
−
E
s
∼
ρ
π
θ
[
log
q
ω
(
s
∣
ϕ
(
s
)
)
]
\begin{aligned} \mathcal{L}_{\rm coupling}&=-\frac{1}{N_{\pi}}\sum_{j=1}^{N_{\pi}}{\log{p_{\theta}^j}}-\frac{1}{N_V}\sum_{j=1}^{N_V}{\log{p_{\psi}^j}}\\ \mathcal{L}_{\rm PPO}&=\mathcal{L}_{\rm PPO}^{\theta}+\mathcal{L}_{\rm PPO}^{\psi}\\ \mathcal{L}_{{\rm MI}(\phi,\omega)}&=-\mathbb{E}_{s\sim \rho_{\pi_{\theta}}}[\log{q_{\omega}(s|\phi(s))}] \end{aligned}
LcouplingLPPOLMI(ϕ,ω)=−Nπ1j=1∑Nπlogpθj−NV1j=1∑NVlogpψj=LPPOθ+LPPOψ=−Es∼ρπθ[logqω(s∣ϕ(s))]
自监督学习[6]
通过正负采样样本之间的互信息和图片的空间信息,网络在无监督的情况下学习图片深层的信息。
References
- http://colah.github.io/posts/2015-09-Visual-Information/
- Mutual Information Neural Estimation. https://arxiv.org/pdf/1801.04062.pdf
- The IM Algorithm: A variational approach to Information Maximization. http://aivalley.com/Papers/MI_NIPS_final.pdf
- https://zhuanlan.zhihu.com/p/39682125
- https://arxiv.org/pdf/1801.04062.pdf
- Learning Deep Representation By Mutual Information Estimation and Maximization. https://arxiv.org/pdf/1808.06670.pdf
- Variational Information Distillation for Knowledge Transfer. https://openaccess.thecvf.com/content_CVPR_2019/papers/Ahn_Variational_Information_Distillation_for_Knowledge_Transfer_CVPR_2019_paper.pdf
- Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch. https://arxiv.org/pdf/2006.07041.pdf
Appendix
A.
D K L ( p ( x ) ∣ ∣ q ( x ) ) ≥ 0 D_{KL}(p(x)||q(x))≥0 DKL(p(x)∣∣q(x))≥0
设
f
(
x
)
≥
0
且
∫
x
f
(
x
)
=
1
f(x)≥0且∫_xf(x)=1
f(x)≥0且∫xf(x)=1 ,
g
g
g为任意可测实函数且
φ
φ
φ为凸函数,则有Jensen不等式如下
φ
(
∫
x
g
(
x
)
f
(
x
)
)
≤
∫
x
φ
(
g
(
x
)
)
f
(
x
)
\varphi\left(\int_x{g(x)f(x)}\right)\leq\int_x{\varphi(g(x))f(x)}
φ(∫xg(x)f(x))≤∫xφ(g(x))f(x)
注意到
−
ln
x
-\ln x
−lnx为严格凸函数,且
q
(
x
)
≥
0
,
∫
x
q
(
x
)
=
1
q(x)≥0,∫_xq(x)=1
q(x)≥0,∫xq(x)=1。令
φ
(
x
)
=
−
ln
x
,
g
(
x
)
=
q
(
x
)
p
(
x
)
,
f
(
x
)
=
p
(
x
)
φ(x)=-\lnx,g(x)=\frac{q(x)}{p(x)} ,f(x)=p(x)
φ(x)=−lnx,g(x)=p(x)q(x),f(x)=p(x),则
D
K
L
(
p
∣
∣
q
)
=
∫
x
p
(
x
)
[
−
ln
q
(
x
)
p
(
x
)
]
≥
−
ln
∫
x
q
(
x
)
=
0
D_{KL}(p||q)=\int_x{p(x)}\left[ -\ln{\frac{q(x)}{p(x)}}\right]\geq-\ln{\int_x{q(x)}}=0
DKL(p∣∣q)=∫xp(x)[−lnp(x)q(x)]≥−ln∫xq(x)=0
B.
D K L ( P ∣ ∣ Q ) = sup T : Ω → R E P [ T ] − log E Q [ e T ] D_{KL}(\mathbb{P}||\mathbb{Q})=\sup_{T:\Omega\rightarrow\mathbb{R}}{\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]}} DKL(P∣∣Q)=T:Ω→RsupEP[T]−logEQ[eT]
A simple proof goes as follows. For a given function
T
T
T, consider the Gibbs distribution
G
\mathbb{G}
G defined by
d
G
=
1
Z
e
T
d
Q
d\mathbb{G}=\frac{1}{Z}e^Td\mathbb{Q}
dG=Z1eTdQ, where
Z
=
E
Q
[
e
T
]
Z=E_{\mathbb{Q}}[e^T]
Z=EQ[eT]. By construction,
E
P
[
T
]
−
log
Z
=
E
P
[
log
d
G
d
Q
]
(
1
)
\mathbb{E}_{\mathbb{P}}[T]-\log{Z}=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{G}}{d\mathbb{Q}}}](1)
EP[T]−logZ=EP[logdQdG](1)
Let
∆
∆
∆ be the gap,
∆
:
=
D
K
L
(
P
∣
∣
Q
)
−
(
E
P
[
T
]
−
log
E
Q
[
e
T
]
)
(
2
)
∆:=D_{KL}(\mathbb{P}||\mathbb{Q})-(\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]})(2)
∆:=DKL(P∣∣Q)−(EP[T]−logEQ[eT])(2)
Using (1), we can write
∆
∆
∆ as a KL-divergence:
∆
:
=
E
P
[
log
d
P
d
Q
−
log
d
G
d
Q
]
=
E
P
[
log
d
P
d
G
]
=
D
K
L
(
P
∣
∣
G
)
(
3
)
∆:=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{P}}{d\mathbb{Q}}}-\log{\frac{d\mathbb{G}}{d\mathbb{Q}}}]=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{P}}{d\mathbb{G}}}]=D_{KL}(\mathbb{P}||\mathbb{G})(3)
∆:=EP[logdQdP−logdQdG]=EP[logdGdP]=DKL(P∣∣G)(3)
The positivity of the KL-divergence gives
∆
≥
0
∆≥0
∆≥0. We have thus shown that for any
T
T
T,
D
K
L
(
P
∣
∣
Q
)
=
E
P
[
T
]
−
log
(
E
Q
[
e
T
]
)
D_{KL}(\mathbb{P}||\mathbb{Q})=\mathbb{E}_{\mathbb{P}}[T]-\log{(\mathbb{E}_{\mathbb{Q}}[e^T])}
DKL(P∣∣Q)=EP[T]−log(EQ[eT])
and the inequality is preserved upon taking the supremum over the right-hand side. Finally, the identity (3) also shows that this bound is tight whenever
G
=
P
\mathbb{G}=\mathbb{P}
G=P, namely for optimal functions
T
∗
T^∗
T∗ taking the form
T
∗
=
log
d
P
d
Q
+
C
T^∗=\log{\frac{d\mathbb{P}}{d\mathbb{Q}}+C}
T∗=logdQdP+C for some constant
C
∈
R
C∈\mathbb{R}
C∈R.