多变量t分布的KL散度
多变量学生t分布(简称多变量t分布,也称多元t分布,Multivariate t distribution)的定义如下:
f
(
x
)
=
C
n
(
det
Σ
)
−
1
/
2
[
1
+
1
ν
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
]
−
(
ν
+
n
)
/
2
f(\mathbf{x})=C_n(\det \Sigma)^{-1/2}[1+\frac{1}{\nu}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)]^{-(\nu+n)/2}
f(x)=Cn(detΣ)−1/2[1+ν1(x−μ)TΣ−1(x−μ)]−(ν+n)/2
其中随机变量
x
∈
R
n
x\in \mathbb{R}^n
x∈Rn,
μ
∈
R
n
\mu\in \mathbb{R}^n
μ∈Rn表示均值,
Σ
∈
R
n
×
n
\Sigma\in \mathbb{R}^{n\times n}
Σ∈Rn×n表示相关矩阵(correlation matrix)或者尺度矩阵(scale matrix),
ν
\nu
ν表示自由度,
n
n
n表示
x
x
x的维数,
C
n
C_n
Cn为归一化常数,其定义如下:
C
n
=
(
π
ν
)
−
n
/
2
Γ
[
(
ν
+
n
)
/
2
]
/
Γ
(
ν
/
2
)
C_n=(\pi\nu)^{-n/2}\Gamma[(\nu+n)/2]/\Gamma(\nu/2)
Cn=(πν)−n/2Γ[(ν+n)/2]/Γ(ν/2)
其中
Γ
(
⋅
)
\Gamma(\cdot)
Γ(⋅)为Gamma函数。
- 值得注意的是相关矩阵不是统计中一般意义上的协方差矩阵,但其和协方差矩阵有关系,后面将给出。
考虑两个多变量t分布
p
(
x
)
p(x)
p(x)和
q
(
x
)
q(x)
q(x),假设
p
(
x
)
p(x)
p(x)是已知的真值多变量t分布,
q
(
x
)
q(x)
q(x)未知的多变量t分布,用来近似
p
(
x
)
p(x)
p(x),两个分布的表示如下:
p
(
x
)
=
S
t
(
x
;
μ
1
,
Σ
1
,
ν
1
)
q
(
x
)
=
S
t
(
x
;
μ
2
,
Σ
2
,
ν
2
)
p(x)=St(x;\mu_1,\Sigma_1,\nu_1)\\ q(x)=St(x;\mu_2,\Sigma_2,\nu_2)
p(x)=St(x;μ1,Σ1,ν1)q(x)=St(x;μ2,Σ2,ν2)
根据KL散度的定义,
D
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
D_{KL}(p(x)||q(x))
DKL(p(x)∣∣q(x)) 可以写成:
D
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
E
p
(
x
)
[
log
p
(
x
)
−
log
q
(
x
)
]
=
E
p
(
x
)
{
{
log
Γ
(
ν
1
+
n
2
)
−
log
Γ
(
ν
1
2
)
−
1
2
log
(
det
Σ
1
)
−
n
2
log
(
ν
1
π
)
−
ν
1
+
n
2
log
[
1
+
1
ν
1
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
]
}
−
{
log
Γ
(
ν
2
+
n
2
)
−
log
Γ
(
ν
2
2
)
−
1
2
log
(
det
Σ
2
)
−
n
2
log
(
ν
2
π
)
−
ν
2
+
n
2
log
[
1
+
1
ν
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
}
}
=
1
2
log
det
Σ
2
det
Σ
1
+
n
2
log
ν
2
ν
1
+
log
Γ
(
ν
1
+
n
2
)
−
log
Γ
(
ν
1
2
)
−
log
Γ
(
ν
2
+
n
2
)
+
log
Γ
(
ν
2
2
)
−
ν
1
+
n
2
E
p
(
x
)
{
log
[
1
+
1
ν
1
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
]
}
+
ν
2
+
n
2
E
p
(
x
)
{
log
[
1
+
1
ν
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
}
\begin{aligned} &\quad D_{KL}(p(x)||q(x))=\mathbb{E}_{p(x)}[\log p(x)-\log q(x)]\\ &=\mathbb{E}_{p(x)}\left\{ \{\log \Gamma(\frac{\nu_1+n}{2})-\log \Gamma(\frac{\nu_1}{2})-\frac{1}{2}\log (\det \Sigma_1)-\frac{n}{2}\log(\nu_1\pi)\right.\\ &\quad \left. -\frac{\nu_1+n}{2}\log[1+\frac{1}{\nu_1}(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)]\} -\{\log \Gamma(\frac{\nu_2+n}{2})-\log \Gamma(\frac{\nu_2}{2})\right.\\ &\quad \left. -\frac{1}{2}\log (\det \Sigma_2)-\frac{n}{2}\log(\nu_2\pi)-\frac{\nu_2+n}{2}\log[1+\frac{1}{\nu_2}(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)]\} \right\}\\ &=\frac{1}{2}\log \frac{\det \Sigma_2}{\det \Sigma_1}+\frac{n}{2}\log \frac{\nu_2}{\nu_1}+\log \Gamma(\frac{\nu_1+n}{2})-\log \Gamma(\frac{\nu_1}{2})-\log \Gamma(\frac{\nu_2+n}{2})+\log \Gamma(\frac{\nu_2}{2})\\ &\quad -\frac{\nu_1+n}{2}\mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_1}(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)]\}\\ &\quad +\frac{\nu_2+n}{2}\mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_2}(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)]\} \end{aligned}
DKL(p(x)∣∣q(x))=Ep(x)[logp(x)−logq(x)]=Ep(x){{logΓ(2ν1+n)−logΓ(2ν1)−21log(detΣ1)−2nlog(ν1π)−2ν1+nlog[1+ν11(x−μ1)TΣ1−1(x−μ1)]}−{logΓ(2ν2+n)−logΓ(2ν2)−21log(detΣ2)−2nlog(ν2π)−2ν2+nlog[1+ν21(x−μ2)TΣ2−1(x−μ2)]}}=21logdetΣ1detΣ2+2nlogν1ν2+logΓ(2ν1+n)−logΓ(2ν1)−logΓ(2ν2+n)+logΓ(2ν2)−2ν1+nEp(x){log[1+ν11(x−μ1)TΣ1−1(x−μ1)]}+2ν2+nEp(x){log[1+ν21(x−μ2)TΣ2−1(x−μ2)]}
通过多变量t分布的最大熵推导,可以证明:
E
p
(
x
)
{
log
[
1
+
1
ν
1
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
]
}
=
w
(
n
+
ν
1
2
;
n
2
)
\mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_1}(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)]\}=w(\frac{n+\nu_1}{2};\frac{n}{2})
Ep(x){log[1+ν11(x−μ1)TΣ1−1(x−μ1)]}=w(2n+ν1;2n)
此处
w
(
x
;
α
)
=
ψ
(
x
)
−
ψ
(
x
−
α
)
,
x
>
α
w(x;\alpha)=\psi(x)-\psi(x-\alpha),x>\alpha
w(x;α)=ψ(x)−ψ(x−α),x>α, 而
ψ
(
⋅
)
\psi(\cdot)
ψ(⋅) 记为digamma函数,其定义如下:
ψ
(
t
)
=
d
log
Γ
(
t
)
/
d
t
\psi(t)=\mathrm{d}\log \Gamma(t)/\mathrm{d}t
ψ(t)=dlogΓ(t)/dt
同时考虑自然对数函数
log
(
⋅
)
\log(\cdot)
log(⋅)是凹函数,使用Jensen不等式就可以得到以下非常有用的不等式:
E
p
(
x
)
{
log
(
⋅
)
}
≤
log
{
E
p
(
x
)
(
⋅
)
}
\mathbb{E}_{p(x)}\{\log(\cdot)\} \leq \log\{\mathbb{E}_{p(x)}(\cdot)\}
Ep(x){log(⋅)}≤log{Ep(x)(⋅)}
因此
E
p
(
x
)
{
log
[
1
+
1
ν
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
}
≤
log
{
E
p
(
x
)
[
1
+
1
ν
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
}
=
log
{
E
p
(
x
)
[
1
+
1
ν
2
(
x
−
μ
1
+
μ
1
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
1
+
μ
1
−
μ
2
)
]
}
=
log
{
E
p
(
x
)
[
1
+
1
ν
2
(
x
−
μ
1
)
T
Σ
2
−
1
(
x
−
μ
1
)
+
1
ν
2
(
μ
1
−
μ
2
)
T
Σ
2
−
1
(
μ
1
−
μ
2
)
+
1
ν
2
(
x
−
μ
1
)
T
Σ
2
−
1
(
μ
1
−
μ
2
)
+
1
ν
2
(
μ
1
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
1
)
]
}
=
log
{
E
p
(
x
)
{
1
+
1
ν
2
t
r
[
Σ
2
−
1
(
x
−
μ
1
)
(
x
−
μ
1
)
T
]
+
1
ν
2
t
r
[
Σ
2
−
1
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
]
}
}
=
log
{
1
+
1
ν
2
t
r
(
Σ
2
−
1
Σ
~
1
)
+
1
ν
2
t
r
[
Σ
2
−
1
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
]
}
\begin{aligned} &\quad \mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_2}(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)]\}\\ & \leq \log\{\mathbb{E}_{p(x)}[1+\frac{1}{\nu_2}(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)]\}\\ & =\log\{\mathbb{E}_{p(x)}[1+\frac{1}{\nu_2}(x-\mu_1+\mu_1-\mu_2)^T\Sigma_2^{-1}(x-\mu_1+\mu_1-\mu_2)]\}\\ &=\log\{\mathbb{E}_{p(x)}[1+\frac{1}{\nu_2}(x-\mu_1)^T\Sigma_2^{-1}(x-\mu_1)+\frac{1}{\nu_2}(\mu_1-\mu_2)^T\Sigma_2^{-1}(\mu_1-\mu_2)\\ &\quad +\frac{1}{\nu_2}(x-\mu_1)^T\Sigma_2^{-1}(\mu_1-\mu_2)+\frac{1}{\nu_2}(\mu_1-\mu_2)^T\Sigma_2^{-1}(x-\mu_1)]\}\\ &=\log\left\{\mathbb{E}_{p(x)}\{ 1+\frac{1}{\nu_2}tr[\Sigma_2^{-1}(x-\mu_1)(x-\mu_1)^T]+\frac{1}{\nu_2}tr[\Sigma_2^{-1}(\mu_1-\mu_2)(\mu_1-\mu_2)^T] \}\right\}\\ &=\log\left\{1+\frac{1}{\nu_2}tr(\Sigma_2^{-1}\tilde\Sigma_1)+\frac{1}{\nu_2}tr[\Sigma_2^{-1}(\mu_1-\mu_2)(\mu_1-\mu_2)^T] \right\} \end{aligned}
Ep(x){log[1+ν21(x−μ2)TΣ2−1(x−μ2)]}≤log{Ep(x)[1+ν21(x−μ2)TΣ2−1(x−μ2)]}=log{Ep(x)[1+ν21(x−μ1+μ1−μ2)TΣ2−1(x−μ1+μ1−μ2)]}=log{Ep(x)[1+ν21(x−μ1)TΣ2−1(x−μ1)+ν21(μ1−μ2)TΣ2−1(μ1−μ2)+ν21(x−μ1)TΣ2−1(μ1−μ2)+ν21(μ1−μ2)TΣ2−1(x−μ1)]}=log{Ep(x){1+ν21tr[Σ2−1(x−μ1)(x−μ1)T]+ν21tr[Σ2−1(μ1−μ2)(μ1−μ2)T]}}=log{1+ν21tr(Σ2−1Σ~1)+ν21tr[Σ2−1(μ1−μ2)(μ1−μ2)T]}
其中
Σ
~
1
\tilde\Sigma_1
Σ~1记为多变量t分布
p
(
x
)
p(x)
p(x)的协方差矩阵,它和相关矩阵的关系如下:
Σ
~
1
=
ν
1
ν
1
−
2
Σ
1
\tilde\Sigma_1=\frac{\nu_1}{\nu_1-2}\Sigma_1
Σ~1=ν1−2ν1Σ1
上式需要多变量t分布
p
(
x
)
p(x)
p(x)的自由度
ν
1
\nu_1
ν1满足以下条件:
ν
1
>
2
\nu_1>2
ν1>2
因此我们可以得到两个多变量t分布的KL散度的上界(upper bound):
D
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
E
p
(
x
)
[
log
p
(
x
)
−
log
q
(
x
)
]
=
1
2
log
det
Σ
2
det
Σ
1
+
n
2
log
ν
2
ν
1
+
log
Γ
(
ν
1
+
n
2
)
−
log
Γ
(
ν
1
2
)
−
log
Γ
(
ν
2
+
n
2
)
+
log
Γ
(
ν
2
2
)
−
ν
1
+
n
2
E
p
(
x
)
{
log
[
1
+
1
ν
1
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
]
}
+
ν
2
+
n
2
E
p
(
x
)
{
log
[
1
+
1
ν
2
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
}
≤
1
2
log
det
Σ
2
det
Σ
1
+
n
2
log
ν
2
ν
1
+
log
Γ
(
ν
1
+
n
2
)
−
log
Γ
(
ν
1
2
)
−
log
Γ
(
ν
2
+
n
2
)
+
log
Γ
(
ν
2
2
)
−
ν
1
+
n
2
[
ψ
(
ν
1
+
n
2
)
−
ψ
(
ν
1
2
)
]
+
ν
2
+
n
2
log
{
1
+
1
ν
2
t
r
(
Σ
2
−
1
Σ
~
1
)
+
1
ν
2
t
r
[
Σ
2
−
1
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
]
}
\begin{aligned} &\quad D_{KL}(p(x)||q(x))=\mathbb{E}_{p(x)}[\log p(x)-\log q(x)]\\ &=\frac{1}{2}\log \frac{\det \Sigma_2}{\det \Sigma_1}+\frac{n}{2}\log \frac{\nu_2}{\nu_1}+\log \Gamma(\frac{\nu_1+n}{2})-\log \Gamma(\frac{\nu_1}{2})-\log \Gamma(\frac{\nu_2+n}{2})+\log \Gamma(\frac{\nu_2}{2})\\ &\quad -\frac{\nu_1+n}{2}\mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_1}(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1)]\}\\ &\quad +\frac{\nu_2+n}{2}\mathbb{E}_{p(x)}\{\log[1+\frac{1}{\nu_2}(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2)]\}\\ &\leq \frac{1}{2}\log \frac{\det \Sigma_2}{\det \Sigma_1}+\frac{n}{2}\log \frac{\nu_2}{\nu_1}+\log \Gamma(\frac{\nu_1+n}{2})-\log \Gamma(\frac{\nu_1}{2})-\log \Gamma(\frac{\nu_2+n}{2})+\log \Gamma(\frac{\nu_2}{2})\\ &\quad -\frac{\nu_1+n}{2}[\psi(\frac{\nu_1+n}{2})-\psi(\frac{\nu_1}{2})]\\ &\quad +\frac{\nu_2+n}{2}\log\left\{1+\frac{1}{\nu_2}tr(\Sigma_2^{-1}\tilde\Sigma_1)+\frac{1}{\nu_2}tr[\Sigma_2^{-1}(\mu_1-\mu_2)(\mu_1-\mu_2)^T] \right\}\\ \end{aligned}
DKL(p(x)∣∣q(x))=Ep(x)[logp(x)−logq(x)]=21logdetΣ1detΣ2+2nlogν1ν2+logΓ(2ν1+n)−logΓ(2ν1)−logΓ(2ν2+n)+logΓ(2ν2)−2ν1+nEp(x){log[1+ν11(x−μ1)TΣ1−1(x−μ1)]}+2ν2+nEp(x){log[1+ν21(x−μ2)TΣ2−1(x−μ2)]}≤21logdetΣ1detΣ2+2nlogν1ν2+logΓ(2ν1+n)−logΓ(2ν1)−logΓ(2ν2+n)+logΓ(2ν2)−2ν1+n[ψ(2ν1+n)−ψ(2ν1)]+2ν2+nlog{1+ν21tr(Σ2−1Σ~1)+ν21tr[Σ2−1(μ1−μ2)(μ1−μ2)T]}
参考:
[1]: https://www.researchgate.net/publication/335580775_A_Novel_Kullback-Leilber_Divergence_Minimization-Based_Adaptive_Student%27s_t-Filter
[2]: KotzS,NadarajahS.Multivariatet-distributionsandtheirapplicationsM.CambridgeUniversityPress,2004.