Kullback-Leibler(
K
L
\mathrm {KL}
KL) loss
(离散)For discrete probability distributions
F
(
x
)
F(x)
F(x) and
G
(
x
)
G(x)
G(x), the Kullback-Leibler (
K
L
\mathrm {KL}
KL) loss from
F
(
x
)
F(x)
F(x) to
G
(
x
)
G(x)
G(x) is defined[5] to be
K
L
{
F
(
x
)
∥
G
(
x
)
}
=
∑
i
=
1
n
F
(
x
)
log
F
(
x
)
G
(
x
)
.
\mathrm {KL}\{F(x)\|G(x)\} = \sum_{i=1}^nF(x)\log\frac{F(x)}{G(x)}.
KL{F(x)∥G(x)}=i=1∑nF(x)logG(x)F(x).
(连续)For distributions
F
(
x
)
F(x)
F(x) and
G
(
x
)
G(x)
G(x) of a continuous random variable, the Kullback–Leibler(
K
L
\mathrm {KL}
KL) loss is defined to be
K
L
{
F
(
x
)
∥
G
(
x
)
}
=
∫
−
∞
∞
f
(
x
)
log
f
(
x
)
g
(
x
)
d
x
\mathrm {KL}\{F(x)\|G(x)\} = \int_{-\infty}^{\infty}f(x)\log\frac{f(x)}{g(x)}dx
KL{F(x)∥G(x)}=∫−∞∞f(x)logg(x)f(x)dx
where
f
(
x
)
f(x)
f(x) and
g
(
x
)
g(x)
g(x) is the densities function of
F
(
x
)
F(x)
F(x) and
G
(
x
)
G(x)
G(x).
The Kullback–Leibler loss is always non-negative(始终非负), that is
K
L
{
F
(
x
)
∥
G
(
x
)
}
⩾
0.
\mathrm {KL}\{F(x)\|G(x)\}\geqslant0.
KL{F(x)∥G(x)}⩾0.
The Kullback–Leibler(
K
L
\mathrm {KL}
KL) loss
K
L
{
F
(
x
)
∥
G
(
x
)
}
\mathrm {KL}\{F(x)\|G(x)\}
KL{F(x)∥G(x)} is convex(凸的) in the pair of probability mass functions
(
f
,
g
)
{\displaystyle (f,g)}
(f,g), i.e. if
(
f
1
,
g
1
)
{\displaystyle (f_{1},g_{1})}
(f1,g1) and
(
f
2
,
g
2
)
{\displaystyle (f_{2},g_{2})}
(f2,g2) are two pairs of probability mass functions, then
K
L
{
λ
f
1
+
(
1
−
λ
)
f
2
∥
λ
g
1
+
(
1
−
λ
)
g
2
}
≤
λ
K
L
(
f
1
∥
g
1
)
+
(
1
−
λ
)
K
L
(
f
2
∥
g
2
)
{\mathrm {KL}\{\lambda f_{1}+(1-\lambda )f_{2}\|\lambda g_{1}+(1-\lambda )g_{2}\}\leq \lambda \mathrm {KL} (f_{1}\|g_{1})+(1-\lambda )\mathrm {KL} (f_{2}\|g_{2})}
KL{λf1+(1−λ)f2∥λg1+(1−λ)g2}≤λKL(f1∥g1)+(1−λ)KL(f2∥g2) for
0
≤
λ
≤
1
0\leq\lambda\leq1
0≤λ≤1.
eg: Multivariate normal distributions
Suppose that we have two multivariate normal distributions, with means
μ
0
,
μ
1
{\displaystyle \mu _{0},\mu _{1}}
μ0,μ1 and with (nonsingular) covariance matrices
Σ
0
,
Σ
1
{\displaystyle \Sigma _{0},\Sigma _{1}}
Σ0,Σ1. If the two distributions have the same dimension, k, then the Kullback–Leibler(
K
L
\mathrm{KL}
KL ) loss between the distributions is as follows:
K
L
(
N
0
∥
N
1
)
=
1
2
{
t
r
(
Σ
1
−
1
Σ
0
)
+
(
μ
1
−
μ
0
)
T
Σ
1
−
1
(
μ
1
−
μ
0
)
−
k
+
log
(
det
Σ
1
det
Σ
0
)
}
.
\mathrm{KL}({\mathcal {N}}_{0}\|{\mathcal {N}}_{1})={1 \over 2}\left\{\mathrm {tr} \left(\Sigma _{1}^{-1}\Sigma _{0}\right)+\left(\mu _{1}-\mu _{0}\right)^{\text{T}}\Sigma _{1}^{-1}(\mu _{1}-\mu _{0})-k+\log \left({\det \Sigma _{1} \over \det \Sigma _{0}}\right)\right\}.
KL(N0∥N1)=21{tr(Σ1−1Σ0)+(μ1−μ0)TΣ1−1(μ1−μ0)−k+log(detΣ0detΣ1)}.