两个多元高斯分布的KL散度
高斯分布,是定义在
R
n
R^n
Rn 上的连续型概率分布,概率密度函数为:
p
(
x
)
=
1
(
2
π
)
n
det
(
Σ
)
e
x
p
{
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
}
p(x)=\frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}}exp\left\{ -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right\}
p(x)=(2π)ndet(Σ)1exp{−21(x−μ)TΣ−1(x−μ)}
这里
x
,
μ
∈
R
n
x,\mu\in R^n
x,μ∈Rn,
Σ
∈
R
n
×
n
\Sigma\in R^{n\times n}
Σ∈Rn×n 是协方差矩阵,它要求是正定对称的。当
μ
=
0
,
Σ
=
I
\mu=0,\Sigma=I
μ=0,Σ=I 时,即为标准正态分布。
正定对称:
Σ \Sigma Σ 是一个正定对称矩阵,那么:
(1)对称性: Σ = Σ T \Sigma=\Sigma^T Σ=ΣT
(2)正定性:对任意非零 ξ ∈ R n \xi\in R^n ξ∈Rn,有 ξ T Σ ξ > 0 \xi^T\Sigma\xi >0 ξTΣξ>0
正定矩阵的逆也是正定矩阵。两个正定矩阵的和也是正定矩阵。
正态分布的一些性质:
- E x [ x ] = μ E_x[x]=\mu Ex[x]=μ
- $E_x[(x-\mu)(x-\mu)^T]=\Sigma $
- $E_x[xxT]=\mu\muT+E_x[(x-\mu)(x-\mu)T]=\mu\muT+\Sigma $
- 熵:
H = E x [ − log p ( x ) ] = n 2 ( 1 + log 2 π ) + 1 2 log det ( Σ ) \mathcal{H}=E_x[-\log p(x)]=\frac{n}{2}(1+\log 2\pi)+\frac{1}{2}\log \det (\Sigma) H=Ex[−logp(x)]=2n(1+log2π)+21logdet(Σ)
KL散度
对于 p ( x ) = N ( μ p , Σ p ) p(x)=\mathcal{N}(\mu_p,\Sigma_p) p(x)=N(μp,Σp), q ( x ) = N ( μ q , Σ q ) q(x)=\mathcal{N}(\mu_q,\Sigma_q) q(x)=N(μq,Σq)
计算结果:
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
1
2
[
(
μ
p
−
μ
q
)
T
Σ
q
−
1
(
μ
p
−
μ
q
)
−
log
det
(
Σ
q
−
1
Σ
p
)
+
T
r
(
Σ
q
−
1
Σ
p
)
−
n
]
KL(p(x)||q(x))=\frac{1}{2}\left[(\mu_p-\mu_q)^T\Sigma_q^{-1}(\mu_p-\mu_q)-\log \det(\Sigma_q^{-1}\Sigma_p)+Tr(\Sigma_q^{-1}\Sigma_p)-n \right]
KL(p(x)∣∣q(x))=21[(μp−μq)TΣq−1(μp−μq)−logdet(Σq−1Σp)+Tr(Σq−1Σp)−n]
特别地,当
q
q
q 是标准正态分布时,结果简化为:
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
1
2
[
∣
∣
μ
p
∣
∣
2
+
T
r
(
Σ
p
)
−
log
det
(
Σ
p
)
−
n
]
KL(p(x)||q(x))=\frac{1}{2}\left[||\mu_p||^2+Tr(\Sigma_p)-\log \det (\Sigma_p)-n \right]
KL(p(x)∣∣q(x))=21[∣∣μp∣∣2+Tr(Σp)−logdet(Σp)−n]
推导过程:
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
E
x
∼
p
(
x
)
[
log
p
(
x
)
q
(
x
)
]
=
E
x
∼
p
(
x
)
[
log
p
(
x
)
]
+
E
x
∼
p
(
x
)
[
−
log
q
(
x
)
]
KL(p(x)||q(x))=E_{x\sim p(x)}\left[\log\frac{p(x)}{q(x)}\right]=E_{x\sim p(x)}[\log p(x)]+E_{x\sim p(x)}[-\log q(x)]
KL(p(x)∣∣q(x))=Ex∼p(x)[logq(x)p(x)]=Ex∼p(x)[logp(x)]+Ex∼p(x)[−logq(x)]
先计算
E
x
∼
p
(
x
)
[
−
log
q
(
x
)
]
E_{x\sim p(x)}[-\log q(x)]
Ex∼p(x)[−logq(x)]:
E
x
∼
p
(
x
)
[
−
log
q
(
x
)
]
=
E
x
∼
p
(
x
)
[
n
2
log
(
2
π
)
+
1
2
log
det
(
Σ
q
)
+
1
2
(
x
−
μ
q
)
T
Σ
q
−
1
(
x
−
μ
q
)
]
=
n
2
log
(
2
π
)
+
1
2
log
det
(
Σ
q
)
+
1
2
E
x
∼
p
(
x
)
[
(
x
−
μ
q
)
T
Σ
q
−
1
(
x
−
μ
q
)
]
\begin{align*} E_{x\sim p(x)}[-\log q(x)]&=E_{x\sim p(x)}\left[\frac{n}{2}\log (2\pi)+\frac{1}{2}\log \det(\Sigma_q)+\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right]\\ &=\frac{n}{2}\log (2\pi)+\frac{1}{2}\log \det(\Sigma_q)+\frac{1}{2}E_{x\sim p(x)}\left[(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right] \end{align*}
Ex∼p(x)[−logq(x)]=Ex∼p(x)[2nlog(2π)+21logdet(Σq)+21(x−μq)TΣq−1(x−μq)]=2nlog(2π)+21logdet(Σq)+21Ex∼p(x)[(x−μq)TΣq−1(x−μq)]
Frobenius内积:
对于 m × n m\times n m×n 的矩阵 A , B A,B A,B,它们的 Frobenius内积被定义为:
< A , B > F = ∑ i = 1 m ∑ j = 1 n A i j B i j <A,B>_F=\sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} <A,B>F=i=1∑mj=1∑nAijBij
Frobenius内积有如下性质:
< A , B > F = T r ( A T B ) = T r ( B A T ) = T r ( A B T ) = T r ( B T A ) <A,B>_F=Tr(A^TB)=Tr(BA^T)=Tr(AB^T)=Tr(B^TA) <A,B>F=Tr(ATB)=Tr(BAT)=Tr(ABT)=Tr(BTA)
根据 Frobenius内积 的性质:
E
x
∼
p
(
x
)
[
(
x
−
μ
q
)
T
Σ
q
−
1
(
x
−
μ
q
)
]
=
E
x
∼
p
(
x
)
[
T
r
(
(
x
−
μ
q
)
T
Σ
q
−
1
(
x
−
μ
q
)
)
]
=
E
x
∼
p
(
x
)
[
T
r
(
Σ
q
−
1
(
x
−
μ
q
)
(
x
−
μ
q
)
T
)
]
=
T
r
(
Σ
q
−
1
E
x
∼
p
(
x
)
[
(
x
−
μ
q
)
(
x
−
μ
q
)
T
]
)
=
T
r
(
Σ
q
−
1
E
x
∼
p
(
x
)
[
x
x
T
−
x
μ
q
T
−
μ
q
x
T
+
μ
q
μ
q
T
)
=
T
r
(
Σ
q
−
1
(
Σ
p
+
μ
p
μ
p
T
−
μ
p
μ
q
T
−
μ
q
μ
p
T
+
μ
q
μ
q
T
)
)
=
T
r
(
Σ
q
−
1
Σ
p
)
+
(
μ
p
−
μ
q
)
T
Σ
q
−
1
(
μ
p
−
μ
q
)
\begin{align*} E_{x\sim p(x)}\left[(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right]&=E_{x\sim p(x)}\left[Tr((x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q)) \right]\\ &=E_{x\sim p(x)}\left[Tr(\Sigma_q^{-1}(x-\mu_q)(x-\mu_q)^T) \right]\\ &=Tr\left(\Sigma_q^{-1} E_{x\sim p(x)}[(x-\mu_q)(x-\mu_q)^T] \right)\\ &=Tr\left(\Sigma_q^{-1} E_{x\sim p(x)}[xx^T-x\mu_q^T-\mu_qx^T+\mu_q\mu_q^T \right)\\ &=Tr(\Sigma_q^{-1}(\Sigma_p+\mu_p\mu_p^T-\mu_p\mu_q^T-\mu_q\mu_p^T+\mu_q\mu_q^T))\\ &=Tr(\Sigma_q^{-1}\Sigma_p)+(\mu_p-\mu_q)^T\Sigma_q^{-1}(\mu_p-\mu_q) \end{align*}
Ex∼p(x)[(x−μq)TΣq−1(x−μq)]=Ex∼p(x)[Tr((x−μq)TΣq−1(x−μq))]=Ex∼p(x)[Tr(Σq−1(x−μq)(x−μq)T)]=Tr(Σq−1Ex∼p(x)[(x−μq)(x−μq)T])=Tr(Σq−1Ex∼p(x)[xxT−xμqT−μqxT+μqμqT)=Tr(Σq−1(Σp+μpμpT−μpμqT−μqμpT+μqμqT))=Tr(Σq−1Σp)+(μp−μq)TΣq−1(μp−μq)
至于
E
x
∼
p
(
x
)
[
log
p
(
x
)
]
E_{x\sim p(x)}[\log p(x)]
Ex∼p(x)[logp(x)],即是上面提到的熵的负数。所以最终结果为:
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
E
x
∼
p
(
x
)
[
log
p
(
x
)
]
+
E
x
∼
p
(
x
)
[
−
log
q
(
x
)
]
=
[
−
n
2
(
1
+
log
2
π
)
−
1
2
log
det
(
Σ
p
)
]
+
n
2
log
(
2
π
)
+
1
2
log
det
(
Σ
q
)
+
1
2
[
T
r
(
Σ
q
−
1
Σ
p
)
+
(
μ
p
−
μ
q
)
T
Σ
q
−
1
(
μ
p
−
μ
q
)
]
=
1
2
[
T
r
(
Σ
q
−
1
Σ
p
)
+
(
μ
p
−
μ
q
)
T
Σ
q
−
1
(
μ
p
−
μ
q
)
−
n
−
log
det
(
Σ
q
−
1
Σ
p
)
]
\begin{align*} KL(p(x)||q(x))&=E_{x\sim p(x)}[\log p(x)]+E_{x\sim p(x)}[-\log q(x)] \\ &=[-\frac{n}{2}(1+\log 2\pi)-\frac{1}{2}\log \det (\Sigma_p)]\\ &+\frac{n}{2}\log (2\pi)+\frac{1}{2}\log \det(\Sigma_q)+\frac{1}{2}[Tr(\Sigma_q^{-1}\Sigma_p)+(\mu_p-\mu_q)^T\Sigma_q^{-1}(\mu_p-\mu_q)]\\ &=\frac{1}{2}\left[Tr(\Sigma_q^{-1}\Sigma_p)+(\mu_p-\mu_q)^T\Sigma_q^{-1}(\mu_p-\mu_q)-n-\log\det (\Sigma_q^{-1}\Sigma_p) \right] \end{align*}
KL(p(x)∣∣q(x))=Ex∼p(x)[logp(x)]+Ex∼p(x)[−logq(x)]=[−2n(1+log2π)−21logdet(Σp)]+2nlog(2π)+21logdet(Σq)+21[Tr(Σq−1Σp)+(μp−μq)TΣq−1(μp−μq)]=21[Tr(Σq−1Σp)+(μp−μq)TΣq−1(μp−μq)−n−logdet(Σq−1Σp)]