UTF8gbsn
EM算法的由来, 上一篇文章介绍了EM算法的具体步骤,
且给出了一个EM算法例子的详细推导. 但是没有给出EM算法为什么有效的证明.
这里我们来证明并推导一下EM算法的推导.首先如果我们有观察值
X
X
X,
我们需要估计参数
θ
\theta
θ. 使用MLE(极大似然估计)
arg max θ L ( θ ; X ) \arg\max_{\theta}L(\theta;X) argθmaxL(θ;X)
如果有隐变量
Z
Z
Z, 并且知道隐变量
Z
Z
Z的估计模型我们可以改写最大似然估计为
arg
max
θ
L
(
θ
;
X
)
=
arg
max
θ
L
(
θ
;
X
,
Z
)
=
arg
max
θ
∑
Z
l
n
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
\left. \begin{aligned} \arg\max_{\theta}L(\theta;X)&=\arg\max_{\theta}L(\theta;X,Z)\\ &=\arg\max_{\theta}\sum_{Z}lnP(X|Z_i,\theta) P(Z_i|\theta) \end{aligned} \right.
argθmaxL(θ;X)=argθmaxL(θ;X,Z)=argθmaxZ∑lnP(X∣Zi,θ)P(Zi∣θ)
那么, 这个公式实际上不好计算,
因为穷举
P
(
Z
∣
θ
)
,
P
(
X
∣
Z
,
θ
)
P(Z|\theta), P(X|Z,\theta)
P(Z∣θ),P(X∣Z,θ)是不容易的.
所以我们想到使用近似来做. 假设我们
θ
t
\theta^t
θt参数来近似. 我们先来计算误差
L
(
θ
)
−
L
(
θ
t
)
=
l
n
[
∑
Z
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
]
−
l
n
P
(
X
∣
θ
t
)
=
l
n
[
∑
Z
P
(
Z
i
∣
X
,
θ
t
)
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
P
(
Z
i
∣
X
,
θ
t
)
]
−
l
n
P
(
X
∣
θ
t
)
⩾
∑
Z
P
(
Z
i
∣
X
,
θ
t
)
l
n
[
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
P
(
Z
i
∣
X
,
θ
t
)
]
−
l
n
P
(
X
∣
θ
t
)
,
(
J
e
n
s
e
n
i
n
e
q
u
a
l
i
t
y
)
=
∑
Z
P
(
Z
i
∣
X
,
θ
t
)
l
n
[
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
P
(
Z
i
∣
X
,
θ
t
)
P
(
X
∣
θ
t
)
]
\left. \begin{aligned} L(\theta)-L(\theta^t)&=ln[\sum_{Z}P(X|Z_i,\theta) P(Z_i|\theta)]-lnP(X|\theta^t)\\ &=ln[\sum_{Z}P(Z_i|X,\theta^t)\frac{P(X|Z_i,\theta) P(Z_i|\theta)}{P(Z_i|X,\theta^t)}]-lnP(X|\theta^t)\\ &\geqslant \sum_{Z}P(Z_i|X,\theta^t)ln[\frac{P(X|Z_i,\theta) P(Z_i|\theta)}{P(Z_i|X,\theta^t)}]-lnP(X|\theta^t),(Jensen\quad inequality)\\ &=\sum_{Z}P(Z_i|X,\theta^t) ln[\frac{P(X|Z_i,\theta) P(Z_i|\theta)}{P(Z_i|X,\theta^t)P(X|\theta^t)}] \end{aligned} \right.
L(θ)−L(θt)=ln[Z∑P(X∣Zi,θ)P(Zi∣θ)]−lnP(X∣θt)=ln[Z∑P(Zi∣X,θt)P(Zi∣X,θt)P(X∣Zi,θ)P(Zi∣θ)]−lnP(X∣θt)⩾Z∑P(Zi∣X,θt)ln[P(Zi∣X,θt)P(X∣Zi,θ)P(Zi∣θ)]−lnP(X∣θt),(Jenseninequality)=Z∑P(Zi∣X,θt)ln[P(Zi∣X,θt)P(X∣θt)P(X∣Zi,θ)P(Zi∣θ)]
令
B
θ
,
θ
t
B_{\theta,\theta^t}
Bθ,θt为
B
θ
,
θ
t
=
∑
Z
P
(
Z
i
∣
X
,
θ
t
)
l
n
[
P
(
X
∣
Z
i
,
θ
)
P
(
Z
i
∣
θ
)
P
(
Z
i
∣
X
,
θ
t
)
P
(
X
∣
θ
t
)
]
+
L
(
θ
t
)
B_{\theta,\theta^t}=\sum_{Z}P(Z_i|X,\theta^t) ln[\frac{P(X|Z_i,\theta) P(Z_i|\theta)}{P(Z_i|X,\theta^t)P(X|\theta^t)}]+L(\theta^t)
Bθ,θt=Z∑P(Zi∣X,θt)ln[P(Zi∣X,θt)P(X∣θt)P(X∣Zi,θ)P(Zi∣θ)]+L(θt)
由此可见
L
(
θ
)
>
B
θ
,
θ
t
L(\theta)>B_{\theta,\theta^t}
L(θ)>Bθ,θt. 也就是说,
B
θ
,
θ
t
B_{\theta,\theta^t}
Bθ,θt是
L
(
θ
)
L(\theta)
L(θ)的下界,如果我们能够优化
arg
max
θ
B
(
θ
,
θ
t
)
\arg\max_{\theta}B(\theta,\theta^t)
argmaxθB(θ,θt).
我们就可以逼近
L
(
θ
)
L(\theta)
L(θ).
arg max θ B ( θ , θ t ) = arg max θ ∑ Z P ( Z i ∣ X , θ t ) l n [ P ( X ∣ Z i , θ ) P ( Z i ∣ θ ) P ( Z i ∣ X , θ t ) P ( X ∣ θ t ) ] + L ( θ t ) = arg max θ ∑ Z P ( Z i ∣ X , θ t ) l n [ P ( X ∣ Z i , θ ) P ( Z i ∣ θ ) ] = arg max θ ∑ Z P ( Z i ∣ X , θ t ) l n [ P ( X , Z i ∣ θ ) ] \left. \begin{aligned} \arg\max_{\theta}B(\theta,\theta^t)&=\arg\max_{\theta}\sum_{Z}P(Z_i|X,\theta^t) ln[\frac{P(X|Z_i,\theta) P(Z_i|\theta)}{P(Z_i|X,\theta^t)P(X|\theta^t)}]+L(\theta^t)\\ &=\arg\max_{\theta}\sum_{Z}P(Z_i|X,\theta^t) ln[P(X|Z_i,\theta)P(Z_i|\theta)]\\ &=\arg\max_{\theta}\sum_{Z}P(Z_i|X,\theta^t) ln[P(X,Z_i|\theta)]\\ \end{aligned} \right. argθmaxB(θ,θt)=argθmaxZ∑P(Zi∣X,θt)ln[P(Zi∣X,θt)P(X∣θt)P(X∣Zi,θ)P(Zi∣θ)]+L(θt)=argθmaxZ∑P(Zi∣X,θt)ln[P(X∣Zi,θ)P(Zi∣θ)]=argθmaxZ∑P(Zi∣X,θt)ln[P(X,Zi∣θ)]
其中可见, Q ( θ , θ t ) Q(\theta,\theta^t) Q(θ,θt), 为
Q ( θ , θ t ) = ∑ Z P ( Z i ∣ X , θ t ) l n ( P ( X , Z i ∣ θ ) ) Q(\theta,\theta^t)=\sum_{Z}P(Z_i|X,\theta^t)ln(P(X,Z_i|\theta)) Q(θ,θt)=Z∑P(Zi∣X,θt)ln(P(X,Zi∣θ))
于是,这正式我们的E步, 最终我们可以求得 θ \theta θ