KL散度的定义:
在概率论或信息论中,KL散度(Kullback-Leibler divergence), 又称为相对熵(relative entropy),是藐视两个概率分布 P 和 Q 之间差异的一种方法。KL散度是非对称的,即 D(P || Q) ≠ D(Q || P)。在信息论中,D(P || Q)表示当用概率分布Q来拟合真是分布P时,产生的信息损耗,其中P表示真是分布,Q表示P的拟合分布。
有人将KL散度称为KL距离,但事实上,KL散度并不满足距离的概念,因为:
1)KL散度是非对称的
2)KL散度不满足三角不等式
对一个离散随机变量或连续随机变量的两个概率分布P和Q来说,KL散度的定义分别如下:
-
Discrete random variable
D ( P ∣ ∣ Q ) = ∑ i ∈ X P ( i ) ∗ [ l o g ( P ( i ) Q ( i ) ) ] D(P||Q)=\sum\limits_{i\in X}P(i)*\left[log\left(\frac{P(i)}{Q(i)}\right)\right] D(P∣∣Q)=i∈X∑P(i)∗[log(Q(i)P(i))] -
Continuous random variable
D ( P ∣ ∣ Q ) = ∫ x P ( x ) ∗ [ l o g ( P ( x ) Q ( x ) ) ] d x D(P||Q)=\int_{x}P(x)*\left[log\left(\frac{P(x)}{Q(x)}\right)\right]dx D(P∣∣Q)=∫xP(x)∗[log(Q(x)P(x))]dx
KL散度和信息熵的关系
在信息论中,KL散度的物理意义:
-
信息量
信息奠基人香农(Shannon)认为“信息是用来消除随机不确定性的东西”,也就是说衡量信息量的大小就是看这个信息消除不确定性的程度。
信息量的大小与信息发生的概率成反比。概率越大,信息量越小。概率越小,信息量越大。
设某一事件发生的概率为P(x),其信息量表示为: I ( x ) = − l o g ( P ( x ) ) = l o g ( 1 P ( x ) ) \mathrm{I(x)} = − log(P(x))=log\left(\frac{1}{P(x)}\right) I(x)=−log(P(x))=log(P(x)1)
其中 I ( x ) \mathrm{I}(\mathrm{x}) I(x)表示信息量,这里 l o g log log表示以e为底的自然对数。 -
KL散度在信息论中有自己明确的物理意义,它是用来度量使用基于Q分布的编码来编码来自P分布的样本平均所需的额外的Bit个数。而其在机器学习领域的物理意义则是用来度量两个函数的相似程度或者相近程度,在泛函分析中也被频繁地用到。
-
下面式子中 绿色 {\color{green}绿色} 绿色和 红色 {\color{red}红色} 红色部分就表示 信息量。
在香农信息论中,用基于P的编码去编码来自P的样本,其最优编码平均所需要的比特个数(即这个字符集的熵)为:
H
(
x
)
=
∑
x
∈
X
P
(
x
)
⏟
P
中各字符出现的频率
∗
l
o
g
(
1
P
(
x
)
)
⏟
P
中此字符对应的编码长度
H(x)=\sum_{x\in X}{\color{blue}\underbrace{P(x)}_{P中各字符出现的频率} }*{\color{green}\underbrace{ log\left(\frac{1}{P(x)}\right)}_{P中此字符对应的编码长度}}
H(x)=x∈X∑P中各字符出现的频率
P(x)∗P中此字符对应的编码长度
log(P(x)1)
用基于P的编码去编码来自Q的样本,则所需要的比特个数变为:
H
′
(
x
)
=
∑
x
∈
X
P
(
x
)
⏟
P
中各字符出现的频率
∗
l
o
g
(
1
Q
(
x
)
)
⏟
此时各字符来自
Q
,各字符编码长度对应于
Q
的分布,与
P
无关
H^{\prime}(x)=\sum_{x\in X}{\color{blue}\underbrace{P(x)}_{P中各字符出现的频率} }*{\color{red}\underbrace{ log\left(\frac{1}{Q(x)}\right)}_{此时各字符来自Q,各字符编码长度对应于Q的分布,与P无关}}
H′(x)=x∈X∑P中各字符出现的频率
P(x)∗此时各字符来自Q,各字符编码长度对应于Q的分布,与P无关
log(Q(x)1)
于是,可以得出P与Q的KL散度:
D
(
P
∣
∣
Q
)
=
H
′
(
x
)
−
H
(
x
)
=
∑
x
∈
X
P
(
x
)
∗
l
o
g
(
1
Q
(
x
)
)
−
∑
x
∈
X
P
(
x
)
∗
l
o
g
(
1
P
(
x
)
)
=
∑
x
∈
X
P
(
x
)
∗
l
o
g
(
P
(
x
)
Q
(
x
)
)
\begin{aligned} D(P||Q)=&H^{\prime}(x)-H(x)=\underset{x\in X}{\sum}P(x)*log(\frac{1}{Q(x)})-\underset{x\in X}{\sum}P(x)*log(\frac{1}{P(x)})\\ =&\underset{x\in X}{\sum}P(x)*log(\frac{P(x)}{Q(x)}) \end{aligned}
D(P∣∣Q)==H′(x)−H(x)=x∈X∑P(x)∗log(Q(x)1)−x∈X∑P(x)∗log(P(x)1)x∈X∑P(x)∗log(Q(x)P(x))
KL散度非负的证明
- 利用Jensen不等式可以证明P与Q之间的KL散度一定是非负的:
Jensen不等式:
l o g ∑ i λ i y i ≥ ∑ i λ i l o g y i 其中, λ i ≥ 0 , ∑ i λ i = 1 log\underset{i}{\sum}\lambda_{i}y_{i}\ge \underset{i}{\sum}\lambda_{i}log\,y_{i}\quad\quad其中,\lambda_{i}\ge0,\underset{i}{\sum}\lambda_{i}=1 logi∑λiyi≥i∑λilogyi其中,λi≥0,i∑λi=1
D ( P ∣ ∣ Q ) = ∑ x ∈ X P ( x ) ∗ l o g ( P ( x ) Q ( x ) ) = E x ∼ P ( x ) [ l o g ( P ( x ) Q ( x ) ) ] = − E x ∼ P ( x ) [ l o g ( Q ( x ) P ( x ) ) ] ≥ − l o g ( ∑ x ∈ X P ( x ) ∗ Q ( x ) P ( x ) ) = − l o g ( ∑ x ∈ X Q ( x ) ) = 0 \begin{aligned} D(P||Q)=&\underset{x\in X}{\sum}P(x)*log(\frac{P(x)}{Q(x)})\\ =&\underset{x\sim P(x)}{E}\left[log\left(\frac{P(x)}{Q(x)}\right)\right]\\ =&-\underset{x\sim P(x)}{E}\left[log\left(\frac{Q(x)}{P(x)}\right)\right]\\ \ge&-log\left(\underset{x\in X}{\sum}P(x)*\frac{Q(x)}{P(x)}\right)=-log\left(\underset{x\in X}{\sum}Q(x)\right)=0 \end{aligned} D(P∣∣Q)===≥x∈X∑P(x)∗log(Q(x)P(x))x∼P(x)E[log(Q(x)P(x))]−x∼P(x)E[log(P(x)Q(x))]−log(x∈X∑P(x)∗P(x)Q(x))=−log(x∈X∑Q(x))=0
高斯分布的KL散度
一元高斯分布的KL散度
对于两个单一连续变量的高斯分布
P
(
x
)
∼
N
(
μ
1
,
σ
1
2
)
,
Q
(
x
)
∼
N
(
μ
2
,
σ
2
2
)
P(x)\sim \mathcal N(\mu_{1},\sigma_{1}^{2}),Q(x)\sim \mathcal N(\mu_{2},\sigma_{2}^{2})
P(x)∼N(μ1,σ12),Q(x)∼N(μ2,σ22).
由连续随机变量的KL散度定义得:
K
L
(
P
∣
∣
Q
)
=
K
L
(
N
(
μ
1
,
σ
1
2
)
∣
∣
N
(
μ
2
,
σ
2
2
)
=
∫
x
1
σ
1
2
π
e
−
(
x
−
μ
1
)
2
2
σ
1
2
l
o
g
(
1
σ
1
2
π
e
−
(
x
−
μ
1
)
2
2
σ
1
2
1
σ
2
2
π
e
−
(
x
−
μ
2
)
2
2
σ
2
2
)
d
x
=
∫
x
1
σ
1
2
π
e
−
(
x
−
μ
1
)
2
2
σ
1
2
[
l
o
g
σ
2
σ
1
−
(
x
−
μ
1
)
2
2
σ
1
2
+
(
x
−
μ
2
)
2
2
σ
2
2
]
d
x
\begin{aligned} KL(P||Q)=&KL(\mathcal N(\mu_{1},\sigma_{1}^{2})||\mathcal N(\mu_{2},\sigma_{2}^{2})\\ \\ =&\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}log\left(\frac{\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}}{\frac{1}{\sigma_{2}\sqrt{2\pi}}e^{-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}}}\right)dx\\ \\ =&\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}\left[log\frac{\sigma_{2}}{\sigma_{1}}-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}+\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\right]dx\\ \end{aligned}
KL(P∣∣Q)===KL(N(μ1,σ12)∣∣N(μ2,σ22)∫xσ12π1e−2σ12(x−μ1)2log
σ22π1e−2σ22(x−μ2)2σ12π1e−2σ12(x−μ1)2
dx∫xσ12π1e−2σ12(x−μ1)2[logσ1σ2−2σ12(x−μ1)2+2σ22(x−μ2)2]dx
把上式分为3项来分别求解:
第一项:
l
o
g
σ
2
σ
1
∫
x
1
σ
1
2
π
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
l
o
g
σ
2
σ
1
log\frac{\sigma_{2}}{\sigma_{1}}\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=log\frac{\sigma_{2}}{\sigma_{1}}
logσ1σ2∫xσ12π1e−2σ12(x−μ1)2dx=logσ1σ2
第二项需要分辨出积分项为方差:
−
1
σ
1
2
π
∫
x
(
x
−
μ
1
)
2
2
σ
1
2
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
−
1
σ
1
2
π
∫
x
(
x
−
μ
1
σ
1
2
)
2
e
−
(
x
−
μ
1
σ
1
2
)
2
d
x
=
−
1
π
∫
x
(
x
−
μ
1
σ
1
2
)
2
e
−
(
x
−
μ
1
σ
1
2
)
2
d
(
x
−
μ
1
σ
1
2
)
=
−
1
π
∫
t
t
2
.
e
−
t
2
d
t
=
−
1
π
.
π
2
=
−
1
2
\begin{aligned} -\frac{1}{\sigma_{1}\sqrt{2\pi}}\int_{x}\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx =&-\frac{1}{\sigma_{1}\sqrt{2\pi}}\int_{x}\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}e^{-\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}}dx\\ =&-\frac{1}{\sqrt{\pi}}\int_{x}\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}e^{-\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}}d\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)\\ =&-\frac{1}{\sqrt{\pi}}\int_{t}t^{2}.e^{-t^{2}}dt\\ =&-\frac{1}{\sqrt{\pi}}.\frac{\sqrt{\pi}}{2}\\ =&-\frac{1}{2} \end{aligned}
−σ12π1∫x2σ12(x−μ1)2e−2σ12(x−μ1)2dx=====−σ12π1∫x(σ12x−μ1)2e−(σ12x−μ1)2dx−π1∫x(σ12x−μ1)2e−(σ12x−μ1)2d(σ12x−μ1)−π1∫tt2.e−t2dt−π1.2π−21
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
推导过程如下:
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
---------------推导过程如下:------------------
−−−−−−−−−−−−−−−推导过程如下:−−−−−−−−−−−−−−−−−−
∫
−
∞
+
∞
x
2
.
e
−
x
2
d
x
=
2
∫
0
+
∞
x
2
.
e
−
x
2
d
x
=
t
=
x
2
∫
0
+
∞
t
.
e
−
t
d
t
\int_{-\infty}^{+\infty}x^{2}.e^{-x^{2}}dx=2\int_{0}^{+\infty}x^{2}.e^{-x^{2}}dx\xlongequal{t=x^{2}}\int_{0}^{+\infty}\sqrt{t}.e^{-t}dt
∫−∞+∞x2.e−x2dx=2∫0+∞x2.e−x2dxt=x2∫0+∞t.e−tdt
Γ
\Gamma
Γ函数如下:
Γ
(
s
)
=
∫
0
+
∞
x
s
−
1
.
e
−
x
d
x
\Gamma(s) = \int_{0}^{+\infty}x^{s-1}.e^{-x}dx
Γ(s)=∫0+∞xs−1.e−xdx
Γ
\Gamma
Γ函数的性质有:
Γ
(
s
+
1
)
=
s
Γ
(
s
)
\Gamma(s+1) = s\Gamma(s)
Γ(s+1)=sΓ(s)
Γ
(
1
)
=
1
Γ
(
1
2
)
=
π
Γ
(
n
+
1
)
=
n
!
\Gamma(1)=1\quad\quad\Gamma(\frac{1}{2})=\sqrt{\pi}\quad\quad\Gamma(n+1)=n!
Γ(1)=1Γ(21)=πΓ(n+1)=n!
Γ
(
3
2
)
=
∫
0
+
∞
x
.
e
−
x
d
x
=
Γ
(
1
2
+
1
)
=
1
2
.
Γ
(
1
2
)
=
π
2
\Gamma(\frac{3}{2})=\int_{0}^{+\infty}\sqrt{x}.e^{-x}dx=\Gamma(\frac{1}{2}+1)=\frac{1}{2}.\Gamma(\frac{1}{2})=\frac{\sqrt{\pi}}{2}
Γ(23)=∫0+∞x.e−xdx=Γ(21+1)=21.Γ(21)=2π
所以:
∫
−
∞
+
∞
x
2
.
e
−
x
2
d
x
=
π
2
\int_{-\infty}^{+\infty}x^{2}.e^{-x^{2}}dx=\frac{\sqrt{\pi}}{2}
∫−∞+∞x2.e−x2dx=2π
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
---------------------------------
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
第三项的积分内部分别是均方值、均值和常数,因此可以得到:
∫
x
1
σ
1
2
π
.
(
x
−
μ
2
)
2
2
σ
2
2
.
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
1
2
σ
1
σ
2
2
2
π
∫
x
(
x
2
−
2
x
μ
2
+
μ
2
2
)
.
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
σ
1
2
+
μ
1
2
−
2
μ
1
μ
2
+
μ
2
2
2
σ
2
2
=
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
\begin{aligned} \int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}.\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=&\frac{1}{2\sigma_{1}\sigma_{2}^{2}\sqrt{2\pi}}\int_{x}\left(x^{2} -2x\mu_{2}+ \mu_{2}^{2}\right).e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx\\ =&\frac{\sigma_{1}^{2}+\mu_{1}^{2}-2\mu_{1}\mu_{2}+\mu_{2}^{2}}{2\sigma_{2}^{2}}\\ =&\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}} \end{aligned}
∫xσ12π1.2σ22(x−μ2)2.e−2σ12(x−μ1)2dx===2σ1σ222π1∫x(x2−2xμ2+μ22).e−2σ12(x−μ1)2dx2σ22σ12+μ12−2μ1μ2+μ222σ22σ12+(μ1−μ2)2
−
−
−
−
−
−
−
−
−
−
−
−
−
计算过程:
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
-------------计算过程:----------------
−−−−−−−−−−−−−计算过程:−−−−−−−−−−−−−−−−
其中第一项为方差,第二项为奇函数全积分为0,第三项为常数可以提取为系数:
∫
x
1
σ
1
2
π
.
(
x
−
μ
2
)
2
2
σ
2
2
.
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
1
2
σ
2
2
∫
x
[
(
x
−
μ
1
)
2
+
2
(
μ
1
−
μ
2
)
(
x
−
μ
1
)
+
(
μ
1
−
μ
2
)
2
]
.
1
σ
1
2
π
.
e
−
(
x
−
μ
1
)
2
2
σ
1
2
d
x
=
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
\begin{aligned} \int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}.\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=&\frac{1}{2\sigma_{2}^{2}}\int_{x}\left[(x-\mu_{1})^{2}+2(\mu_{1}-\mu_{2})(x-\mu_{1})+(\mu_{1}-\mu_{2})^{2}\right].\frac{1}{\sigma_{1}\sqrt{2\pi}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx\\ =&\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}} \end{aligned}
∫xσ12π1.2σ22(x−μ2)2.e−2σ12(x−μ1)2dx==2σ221∫x[(x−μ1)2+2(μ1−μ2)(x−μ1)+(μ1−μ2)2].σ12π1.e−2σ12(x−μ1)2dx2σ22σ12+(μ1−μ2)2
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
-----------------------------
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
整理最终结果得:
K
L
(
P
∣
∣
Q
)
=
K
L
(
N
(
μ
1
,
σ
1
2
)
∣
∣
N
(
μ
2
,
σ
2
2
)
)
=
l
o
g
σ
2
σ
1
+
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
−
1
2
\begin{aligned} KL(P||Q)=&KL(\mathcal N(\mu_{1},\sigma_{1}^{2})||\mathcal N(\mu_{2},\sigma_{2}^{2}))\\ =&log\frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2} \end{aligned}
KL(P∣∣Q)==KL(N(μ1,σ12)∣∣N(μ2,σ22))logσ1σ2+2σ22σ12+(μ1−μ2)2−21
多元高斯分布的KL散度
x
∈
R
d
\mathrm{x}\in \mathbb{R}^{d}
x∈Rd
N
(
x
∣
μ
,
Σ
)
=
1
(
2
π
)
d
2
∣
Σ
∣
1
2
e
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
\mathcal{N}(\mathrm{x} \mid \mu, \Sigma)=\frac{1}{(2 \pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}(\mathrm{x}-\mu)^{\mathrm{T}} \Sigma^{-1}(\mathrm{x}-\mu)}
N(x∣μ,Σ)=(2π)2d∣Σ∣211e−21(x−μ)TΣ−1(x−μ)
P
(
x
)
∼
N
(
x
∣
μ
1
,
Σ
1
)
Q
(
x
)
∼
N
(
x
∣
μ
2
,
Σ
2
)
P(\mathrm{x})\sim\mathcal{N}(\mathrm{x}|\mu_{1},\Sigma_{1})\quad\quad \quad Q(\mathrm{x})\sim\mathcal{N}(\mathrm{x}|\mu_{2},\Sigma_{2})
P(x)∼N(x∣μ1,Σ1)Q(x)∼N(x∣μ2,Σ2)
K
L
(
N
(
x
∣
μ
1
,
Σ
1
)
∣
∣
N
(
x
∣
μ
2
,
Σ
2
)
)
=
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
log
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
1
(
2
π
)
d
2
∣
Σ
2
∣
1
2
e
−
1
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
d
x
1
⋯
d
x
d
=
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
−
1
(
x
−
μ
1
)
[
1
2
log
∣
Σ
2
∣
∣
Σ
1
∣
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
+
1
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
d
x
1
⋯
d
x
d
\begin{aligned} &\mathrm{KL}\left(\mathcal{N}\left(\mathrm{x} \mid \mu_{1}, \Sigma_{1}\right)|| \mathcal{N}\left(\mathrm{x} \mid \mu_{2}, \Sigma_{2}\right)\right)\\ =&\int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \log \frac{\frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}}{\frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{2}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)}} \mathrm{dx_{1 }} \cdots \mathrm{d} \mathrm{x}_{\mathrm{d}}\\ =&\int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left[\frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)+\frac{1}{2}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)\right] \mathrm{dx}_{1}\cdots\mathrm{dx_{d}} \end{aligned}
==KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)log(2π)2d∣Σ2∣211e−21(x−μ2)TΣ2−1(x−μ2)(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ−1(x−μ1)[21log∣Σ1∣∣Σ2∣−21(x−μ1)TΣ1−1(x−μ1)+21(x−μ2)TΣ2−1(x−μ2)]dx1⋯dxd
同样分布计算3项的结果:
第一项:
1
2
log
∣
Σ
2
∣
∣
Σ
1
∣
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
d
x
1
⋯
d
x
d
=
1
2
log
∣
Σ
2
∣
∣
Σ
1
∣
\frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \mathrm{dx}_{1} \cdots \mathrm{dx_{ \textrm {d } }}=\frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}
21log∣Σ1∣∣Σ2∣∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd =21log∣Σ1∣∣Σ2∣
第二项:
−
1
2
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
d
x
1
⋯
d
x
d
-\frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right) \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}}
−21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd
Σ
1
\Sigma_{1}
Σ1为半正定对称矩阵,设
Σ
1
−
1
=
U
T
U
,
y
=
U
(
x
−
μ
1
)
\Sigma_{1}^{-1}=\mathrm{U^{T}U}, \mathrm{y}=\mathrm{U(x-\mu_{1})}
Σ1−1=UTU,y=U(x−μ1), 由于线性变换矩阵就是雅克比矩阵,因此:
d
y
1
⋯
d
y
d
=
∣
U
∣
d
x
1
⋯
d
x
d
\mathrm{dy}_{1} \cdots \mathrm{dy}_{\mathrm{d}}=|\mathrm{U}| \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}}
dy1⋯dyd=∣U∣dx1⋯dxd
由
∣
Σ
1
−
1
=
∣
U
2
∣
|\Sigma_{1}^{-1}=|\mathrm{U^{2}}|
∣Σ1−1=∣U2∣,可知
∣
Σ
1
−
1
2
∣
=
∣
Σ
1
∣
−
1
2
=
∣
U
∣
|\Sigma_{1}^{-\frac{1}{2}}|=|\Sigma_{1}|^{-\frac{1}{2}}=|\mathrm{U}|
∣Σ1−21∣=∣Σ1∣−21=∣U∣, 因此:
−
1
2
∣
Σ
1
∣
1
2
∫
y
1
⋯
∫
y
d
1
(
2
π
)
d
2
e
−
1
2
y
T
y
y
T
y
∣
U
∣
−
1
d
y
1
⋯
d
y
d
=
−
1
2
∣
Σ
1
∣
1
2
∣
Σ
1
∣
1
2
⋅
d
=
−
d
2
\begin{aligned} &-\frac{1}{2|\Sigma_{1}|^{\frac{1}{2}}}\int_{\mathrm{y_{1}}}\cdots\int_{\mathrm{y_{d}}}\frac{1}{(2\pi)^{\frac{d}{2}}}\mathrm{e^{-\frac{1}{2}y^{T}y}y^{T}y|U|^{-1}dy_{1}\cdots dy_{d}}\\ =&-\frac{1}{2\left|\Sigma_{1}\right|^{\frac{1}{2}}}\left|\Sigma_{1}\right|^{\frac{1}{2}} \cdot \mathrm{d}=-\frac{\mathrm{d}}{2} \\ \end{aligned}
=−2∣Σ1∣211∫y1⋯∫yd(2π)2d1e−21yTyyTy∣U∣−1dy1⋯dyd−2∣Σ1∣211∣Σ1∣21⋅d=−2d
第三项:
需要用到的小技巧:
x
T
A
x
=
tr
(
A
x
x
T
)
\mathrm{x}^{T} A \mathrm{x}=\operatorname{tr}\left(A \mathrm{x}\mathrm{x}^{T}\right)
xTAx=tr(AxxT)
1
2
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
d
x
1
⋯
d
x
d
=
1
2
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
tr
[
Σ
2
−
1
(
x
−
μ
2
)
(
x
−
μ
2
)
T
]
d
x
1
⋯
d
x
d
=
1
2
tr
[
Σ
2
−
1
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
(
x
−
μ
2
)
(
x
−
μ
2
)
T
]
d
x
1
⋯
d
x
d
=
1
2
tr
[
Σ
2
−
1
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
(
x
x
T
−
μ
2
x
T
−
x
2
T
+
μ
2
μ
2
T
)
]
d
x
1
⋯
d
x
d
\begin{array}{l} \frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x_{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right) \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \operatorname{tr}\left[\Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}}\right] \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{2}\right)\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}}\right] \mathrm{dx_{1 }} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{xx}^{\mathrm{T}}-\mu_{2} \mathrm{x}^{\mathrm{T}}-\mathrm{x}_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \end{array}
21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)TΣ2−1(x−μ2)dx1⋯dxd=21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)tr[Σ2−1(x−μ2)(x−μ2)T]dx1⋯dxd=21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)(x−μ2)T]dx1⋯dxd=21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−x2T+μ2μ2T)]dx1⋯dxd
其中积分之后第一项为均方值,第二、三项为均值,第三项为常数:
1
2
tr
[
Σ
2
−
1
∫
x
1
⋯
∫
x
d
1
(
2
π
)
d
2
∣
Σ
1
∣
1
2
e
−
1
2
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
(
x
x
T
−
μ
2
x
T
−
x
μ
2
T
+
μ
2
μ
2
T
)
]
d
x
1
⋯
d
x
d
=
1
2
tr
[
Σ
2
−
1
(
Σ
1
+
μ
1
μ
1
T
−
μ
2
μ
1
T
−
μ
1
μ
2
T
+
μ
2
μ
2
T
)
]
=
1
2
[
tr
(
Σ
2
−
1
Σ
1
)
+
tr
(
Σ
2
−
1
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
)
]
=
1
2
[
tr
(
Σ
2
−
1
Σ
1
)
+
(
μ
1
−
μ
2
)
T
Σ
2
−
1
(
μ
1
−
μ
2
)
]
\begin{array}{l} \frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x} 1} \cdots \int_{\mathrm{x_{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{xx}^{\mathrm{T}}-\mu_{2} \mathrm{x}^{\mathrm{T}}-\mathrm{x} \mu_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \mathrm{dx_{1 }} \cdots \mathrm{dx_{ \textrm {d } }} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1}\left(\Sigma_{1}+\mu_{1} \mu_{1}^{\mathrm{T}}-\mu_{2} \mu_{1}^{\mathrm{T}}-\mu_{1} \mu_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \\ =\frac{1}{2}\left[\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\operatorname{tr}\left(\Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}}\right)\right] \\ =\frac{1}{2}\left[\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\right] \end{array}
21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−xμ2T+μ2μ2T)]dx1⋯dxd =21tr[Σ2−1(Σ1+μ1μ1T−μ2μ1T−μ1μ2T+μ2μ2T)]=21[tr(Σ2−1Σ1)+tr(Σ2−1(μ1−μ2)(μ1−μ2)T)]=21[tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]
整理最终结果,两个高斯分布的KL散度为:
K
L
(
N
(
x
∣
μ
1
,
Σ
1
)
∣
∣
N
(
x
∣
μ
2
,
Σ
2
)
)
=
1
2
[
log
∣
Σ
2
∣
∣
Σ
1
∣
−
K
+
tr
(
Σ
2
−
1
Σ
1
)
+
(
μ
1
−
μ
2
)
T
Σ
2
−
1
(
μ
1
−
μ
2
)
]
\mathrm{KL}\left(\mathcal{N}\left(\mathrm{x} \mid \mu_{1}, \Sigma_{1}\right)|| \mathcal{N}\left(\mathrm{x} \mid \mu_{2}, \Sigma_{2}\right)\right)=\frac{1}{2}\left[\log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}-\mathrm{K}+\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\right]
KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))=21[log∣Σ1∣∣Σ2∣−K+tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]