Maximizing likelihood is equivalent to minimizing KL-Divergence
如题,现证明如下:
假设两个分布为 p ( x ∣ θ ∗ ) p(x|\theta^*) p(x∣θ∗)和 p ( x ∣ θ ) p(x|\theta) p(x∣θ),分别代表真实的数据分布和我们估计出来的数据分布,则:
- KL Divergence
= D K L ( p ( x ∣ θ ∗ ) ∣ ∣ p ( x ∣ θ ) ) ) D_{KL}(p(x|\theta^*)||p(x|\theta))) DKL(p(x∣θ∗)∣∣p(x∣θ)))
= E x ∼ p ( x ∣ θ ∗ ) [ l o g p ( x ∣ θ ∗ ) p ( x ∣ θ ) ] E_{x\sim p(x|\theta^*)}[log\frac{p(x|\theta^*)}{p(x|\theta)}] Ex∼p(x∣θ∗)[logp(x∣θ)p(x∣θ∗)]
= E x ∼ p ( x ∣ θ ∗ ) [ l o g p ( x ∣ θ ∗ ) ] − E x ∼ p ( x ∣ θ ∗ ) [ l o g p ( x ∣ θ ) ] E_{x\sim p(x|\theta^*)}[log p(x|\theta^*)]-E_{x\sim p(x|\theta^*)}[log p(x|\theta)] Ex∼p(x∣θ∗)[logp(x∣θ∗)]−Ex∼p(x∣θ∗)[logp(x∣θ)]
= − H ( p ( x ∣ θ ∗ ) ) − E x ∼ p ( x ∣ θ ∗ ) [ l o g p ( x ∣ θ ) ] -H(p(x|\theta^*))-E_{x\sim p(x|\theta^*)}[log p(x|\theta)] −H(p(x∣θ∗))−Ex∼p(x∣θ∗)[logp(x∣θ)]
可以看出,此时第一项和参数
θ
\theta
θ无关了,所以
minimize KL Divergence
=minimize
{
−
H
(
p
(
x
∣
θ
∗
)
)
−
E
x
∼
p
(
x
∣
θ
∗
)
[
l
o
g
p
(
x
∣
θ
)
]
}
\{-H(p(x|\theta^*))-E_{x\sim p(x|\theta^*)}[log p(x|\theta)]\}
{−H(p(x∣θ∗))−Ex∼p(x∣θ∗)[logp(x∣θ)]}
=minimize
[
−
E
x
∼
p
(
x
∣
θ
∗
)
[
l
o
g
p
(
x
∣
θ
)
]
]
[-E_{x\sim p(x|\theta^*)}[log p(x|\theta)]]
[−Ex∼p(x∣θ∗)[logp(x∣θ)]]
=maximize
[
E
x
∼
p
(
x
∣
θ
∗
)
[
l
o
g
p
(
x
∣
θ
)
]
]
[E_{x\sim p(x|\theta^*)}[log p(x|\theta)]]
[Ex∼p(x∣θ∗)[logp(x∣θ)]]
=maxmize likelihood
得证。
这样,我们在做极大似然估计的时候,相当于获得了一个对真实分布的近似分布。