前置基础
矩阵内积
给定两个
m
×
n
m\times n
m×n的矩阵
A
\mathbf A
A和
B
\mathbf B
B,其矩阵内积(也称为Frobenius inner product)定义为:
<
A
,
B
>
=
∑
i
=
1
m
∑
j
=
1
n
a
i
j
b
i
j
=
t
r
(
A
T
B
)
<\mathbf A,\mathbf B>=\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}b_{ij}=tr(\mathbf A^T\mathbf B)
<A,B>=i=1∑mj=1∑naijbij=tr(ATB)
分布与期望
给定随机变量 X ∼ N ( μ , Σ ) X\sim\mathcal{N}(\boldsymbol \mu,\boldsymbol \Sigma) X∼N(μ,Σ),则有如下等式成立:
- E [ x x T ] = Σ + μ μ T E[\mathbf x\mathbf x^T]=\boldsymbol \Sigma+\boldsymbol \mu\boldsymbol \mu^T E[xxT]=Σ+μμT
-
E
[
x
T
A
x
]
=
t
r
(
A
Σ
)
+
μ
T
A
μ
E[\mathbf x^T\mathbf A\mathbf x]=tr(\mathbf A\mathbf \Sigma)+\boldsymbol \mu^T\mathbf A\boldsymbol \mu
E[xTAx]=tr(AΣ)+μTAμ
p r o o f : proof: proof: E [ x T A x ] = E [ t r ( x T A x ) ] = E [ t r ( A x x T ) ] = t r ( A E [ x x T ] ) = t r ( A ( Σ + μ μ T ) ) = t r ( A Σ ) + μ T A μ E[\mathbf x^T\mathbf A\mathbf x]=E[tr(\mathbf x^T\mathbf A\mathbf x)]=E[tr(\mathbf A\mathbf x\mathbf x^T)]=tr(\mathbf AE[\mathbf x\mathbf x^T])=tr(\mathbf A(\mathbf \Sigma+\boldsymbol \mu\boldsymbol \mu^T))=tr(\mathbf A\mathbf \Sigma)+\boldsymbol \mu^T\mathbf A\boldsymbol \mu E[xTAx]=E[tr(xTAx)]=E[tr(AxxT)]=tr(AE[xxT])=tr(A(Σ+μμT))=tr(AΣ)+μTAμ - E [ ( x − μ 1 ) T A ( x − μ 1 ) ] = t r ( A Σ ) + ( μ − μ 1 ) T A ( μ − μ 1 ) E[(\mathbf x-\boldsymbol \mu_1)^T\mathbf A(\mathbf x-\boldsymbol \mu_1)]=tr(\mathbf A\mathbf \Sigma)+(\boldsymbol \mu-\boldsymbol \mu_1)^T\mathbf A(\boldsymbol \mu-\boldsymbol \mu_1) E[(x−μ1)TA(x−μ1)]=tr(AΣ)+(μ−μ1)TA(μ−μ1)
KLD定义
给定两个连续时间概率分布的概率密度函数分别为
p
(
x
)
p(x)
p(x)和
q
(
x
)
q(x)
q(x),其KLD定义为:
D
K
L
(
P
∣
∣
Q
)
=
∫
p
(
x
)
l
o
g
(
p
(
x
)
q
(
x
)
)
d
x
D_{KL}(P||Q)=\int p(x)log(\frac{p(x)}{q(x)})dx
DKL(P∣∣Q)=∫p(x)log(q(x)p(x))dx
对于离散变量,给定两个概率分布
P
(
x
)
P(x)
P(x)和
Q
(
x
)
Q(x)
Q(x),KLD定义为:
D
K
L
(
P
∣
∣
Q
)
=
∑
x
P
(
x
)
l
o
g
(
P
(
x
)
Q
(
x
)
)
D_{KL}(P||Q)=\sum_x P(x)log(\frac{P(x)}{Q(x)})
DKL(P∣∣Q)=x∑P(x)log(Q(x)P(x))
一元高斯分布
假设连续时间的两个分布均为高斯分布,其中
P
P
P分布均值
μ
1
\mu_1
μ1,方差为
σ
1
\sigma_1
σ1,
Q
Q
Q分布均值
μ
2
\mu_2
μ2,方差为
σ
2
\sigma_2
σ2,则可以推导对应的KLD:
D
K
L
(
P
∣
∣
Q
)
=
∫
p
(
x
)
l
o
g
(
p
(
x
)
q
(
x
)
)
d
x
=
∫
1
2
π
σ
1
2
e
−
(
x
−
μ
1
)
2
2
σ
1
2
[
l
o
g
(
σ
2
σ
1
)
−
(
x
−
μ
1
)
2
2
σ
1
2
+
(
x
−
μ
2
)
2
2
σ
2
2
]
d
x
=
l
o
g
(
σ
2
σ
1
)
−
V
a
r
(
x
)
2
σ
1
2
+
V
a
r
(
x
)
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
=
l
o
g
(
σ
2
σ
1
)
−
1
2
+
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
\begin{equation} \begin{aligned} D_{KL}(P||Q)&=\int p(x)log(\frac{p(x)}{q(x)})dx \\ &=\int \frac{1}{\sqrt{2\pi\sigma_1^2}}e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}}[log(\frac{\sigma_2}{\sigma_1})-\frac{(x-\mu_1)^2}{2\sigma_1^2}+\frac{(x-\mu_2)^2}{2\sigma_2^2}]dx \\ &=log(\frac{\sigma_2}{\sigma_1})-\frac{Var(x)}{2\sigma_1^2}+\frac{Var(x)+(\mu_1-\mu_2)^2}{2\sigma_2^2} \\ &=log(\frac{\sigma_2}{\sigma_1})-\frac{1}{2}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2} \end{aligned} \notag \end{equation}
DKL(P∣∣Q)=∫p(x)log(q(x)p(x))dx=∫2πσ121e−2σ12(x−μ1)2[log(σ1σ2)−2σ12(x−μ1)2+2σ22(x−μ2)2]dx=log(σ1σ2)−2σ12Var(x)+2σ22Var(x)+(μ1−μ2)2=log(σ1σ2)−21+2σ22σ12+(μ1−μ2)2
多元高斯分布
对于
n
n
n维随机变量
X
X
X,假设
P
P
P和
Q
Q
Q分别满足
N
(
μ
1
,
Σ
1
)
\mathcal{N}(\boldsymbol \mu_1,\boldsymbol \Sigma_1)
N(μ1,Σ1)和
N
(
μ
2
,
Σ
2
)
\mathcal{N}(\boldsymbol \mu_2,\boldsymbol \Sigma_2)
N(μ2,Σ2)的分布,则其KLD推导如下:
D
K
L
(
P
∣
∣
Q
)
=
∫
R
n
p
(
x
)
l
o
g
(
p
(
x
)
q
(
x
)
)
d
x
=
∫
p
(
x
)
l
o
g
(
p
(
x
)
)
d
x
−
∫
p
(
x
)
l
o
g
(
q
(
x
)
)
d
x
=
−
1
2
(
l
o
g
(
2
π
n
∣
Σ
1
∣
)
+
E
[
(
x
−
μ
1
)
T
Σ
1
−
1
(
x
−
μ
1
)
]
)
+
1
2
(
l
o
g
(
2
π
n
∣
Σ
2
∣
)
+
E
[
(
x
−
μ
2
)
T
Σ
2
−
1
(
x
−
μ
2
)
]
)
=
1
2
[
l
o
g
(
∣
Σ
2
∣
∣
Σ
1
∣
)
−
n
+
t
r
(
Σ
2
−
1
Σ
1
)
+
(
μ
1
−
μ
2
)
T
Σ
2
−
1
(
μ
1
−
μ
2
)
]
=
1
2
[
<
Σ
2
−
1
,
Σ
1
>
+
∣
∣
μ
1
−
μ
2
∣
∣
Σ
2
−
1
−
l
o
g
(
∣
Σ
2
∣
−
1
∣
Σ
1
∣
)
−
n
]
\begin{equation} \begin{aligned} &D_{KL}(P||Q)\\ =&\int_{\mathbb{R}^n} p(\mathbf x)log(\frac{p(\mathbf x)}{q(\mathbf x)})d\mathbf x \\ =&\int p(\mathbf x)log(p(\mathbf x))d\mathbf x-\int p(\mathbf x)log(q(\mathbf x))d\mathbf x \\ =&-\frac{1}{2}(log(2\pi^n|\boldsymbol \Sigma_1|)+E[(\mathbf x-\boldsymbol \mu_1)^T\boldsymbol \Sigma_1^{-1}(\mathbf x-\boldsymbol \mu_1)])+\frac{1}{2}(log(2\pi^n|\boldsymbol \Sigma_2|)+E[(\mathbf x-\boldsymbol \mu_2)^T\boldsymbol \Sigma_2^{-1}(\mathbf x-\boldsymbol \mu_2)]) \\ =&\frac{1}{2}[log(\frac{|\boldsymbol \Sigma_2|}{|\boldsymbol \Sigma_1|})-n+tr(\boldsymbol \Sigma_2^{-1}\boldsymbol \Sigma_1)+(\boldsymbol \mu_1-\boldsymbol \mu_2)^T\mathbf \Sigma_2^{-1}(\boldsymbol \mu_1-\boldsymbol \mu_2)]\\ =&\frac{1}{2}[<\boldsymbol \Sigma_2^{-1},\boldsymbol \Sigma_1>+||\boldsymbol \mu_1-\boldsymbol \mu_2||_{\mathbf \Sigma_2^{-1}}-log(|\mathbf \Sigma_2|^{-1}|\mathbf \Sigma_1|)-n] \end{aligned} \notag \end{equation}
=====DKL(P∣∣Q)∫Rnp(x)log(q(x)p(x))dx∫p(x)log(p(x))dx−∫p(x)log(q(x))dx−21(log(2πn∣Σ1∣)+E[(x−μ1)TΣ1−1(x−μ1)])+21(log(2πn∣Σ2∣)+E[(x−μ2)TΣ2−1(x−μ2)])21[log(∣Σ1∣∣Σ2∣)−n+tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]21[<Σ2−1,Σ1>+∣∣μ1−μ2∣∣Σ2−1−log(∣Σ2∣−1∣Σ1∣)−n]
测试验证
% Generate sample data
% case 1
mu_p = [0.5, 1.0]';
sigma_p = diag([1.2, 0.8]);
mu_q = [0.5, 1.0]';
sigma_q = diag([1.2, 0.8]);
% case 2
% mu_p = [0.5, 1.0]';
% sigma_p = diag([1.2 0.8]);
% mu_q = [0.0, 1.5]';
% sigma_q = diag([0.9, 1.1]);
% Calculate KL divergence
kld = cal_KLD(mu_p, sigma_p, mu_q, sigma_q);
% Print the result
disp(['KL divergence: ', num2str(kl_loss)]);
% case 1 output: 0
% case 2 output: 0.44175
function kld = cal_KLD(mu_p, sigma_p, mu_q, sigma_q)
eps = 1e-8;
sigma_p = sigma_p .^ 2;
sigma_q = sigma_q .^ 2;
sigma_p_det = det(sigma_p);
sigma_q_det = det(sigma_q);
sigma_q_inv = inv(sigma_q);
delta_u = (mu_q - mu_p);
term1 = trace(sigma_q \ sigma_p);
term2 = delta_u' * sigma_q_inv * delta_u;
term3 = - length(mu_p);
term4 = log(sigma_q_det + eps) - log(sigma_p_det + eps);
kld = 0.5 * (term1 + term2 + term3 + term4);
kld = max(kld, 0);
end
参考网址
[1] 两个高斯分布KL散度推导
[2] 多元高斯分布间的KL散度及其Pytorch实现
[3] 多变量高斯分布之间的KL散度
[4] 矩阵内积