EM算法
整理了李航的书。
EM算法是一种迭代算法,用于含有隐变量的概率模型参数的极大似然估计,或极大后验概率估计,EM算法的每次迭代由两步组成:E步,求期望;M步,求极大。下文仅讨论极大似然估计。
由一个例子引入EM算法:
假设有3枚硬币,分别记作A,B,C。这些硬币正面出现的概率分别为
π
\pi
π,
p
p
p,
q
q
q。进行如下掷硬币实验:先掷硬币A,若正面则再掷硬币B;若反面则再掷硬币C;记第二次掷硬币的结果,正面记为1,反面记为0;独立地重复
n
n
n次实验(这里
n
=
10
n=10
n=10),结果如下:
1
,
1
,
0
,
1
,
0
,
0
,
1
,
0
,
1
,
1
1,1,0,1,0,0,1,0,1,1
1,1,0,1,0,0,1,0,1,1。假设只能观测到掷硬币的结果,不能观测掷硬币的过程,估计
π
\pi
π,
p
p
p,
q
q
q。
三硬币模型可以写作:
P
(
y
∣
θ
)
=
∑
z
(
y
,
z
∣
θ
)
=
∑
z
P
(
z
∣
θ
)
P
(
y
∣
z
,
θ
)
=
π
p
y
(
1
−
p
)
1
−
y
+
(
1
−
π
)
q
1
−
y
(
1
−
q
)
q
P\left( {y|\theta } \right) = \sum\limits_z {\left( {y,z|\theta } \right) = \sum\limits_z {P\left( {z|\theta } \right)P\left( {y|z,\theta } \right) = \pi {p^y}{{\left( {1 - p} \right)}^{1 - y}} + \left( {1 - \pi } \right){q^{1 - y}}{{\left( {1 - q} \right)}^q}} }
P(y∣θ)=z∑(y,z∣θ)=z∑P(z∣θ)P(y∣z,θ)=πpy(1−p)1−y+(1−π)q1−y(1−q)q这里随机变量
y
y
y是观测变量,表示某次试验的结果是1或0;随机变量
z
z
z是隐变量,表示未观测到的掷硬币A的结果;
θ
=
(
π
,
p
,
q
)
\theta = \left( {\pi ,p,q} \right)
θ=(π,p,q)是模型参数。
然后求模型
θ
=
(
π
,
p
,
q
)
\theta = \left( {\pi ,p,q} \right)
θ=(π,p,q)的极大似然估计,即
θ
^
=
arg
m
a
x
θ
log
P
(
Y
∣
θ
)
\widehat \theta {\text{ = }}\arg \mathop {max}\limits_\theta \log P\left( {Y|\theta } \right)
θ
= argθmaxlogP(Y∣θ)。这个问题没有解析解,只能通过迭代的方法求解。EM算法就是一种解决这类问题的迭算法。
EM算法步骤:
输入:观测变量数据
Y
Y
Y,隐变量数据
Z
Z
Z,联合分布
P
(
Y
,
Z
∣
θ
)
P\left( {Y,Z|\theta } \right)
P(Y,Z∣θ),条件分布
P
(
Z
∣
Y
,
θ
)
P\left( {Z|Y,\theta } \right)
P(Z∣Y,θ);
输出:模型参数
θ
\theta
θ
1)选择参数的初值
θ
(
0
)
{\theta ^{\left( 0 \right)}}
θ(0),开始迭代。初值是可以任意选择,但是EM算法是初值敏感的。
2)E步:记
θ
(
i
)
{\theta ^{\left( i \right)}}
θ(i)为第
i
i
i次迭代参数
θ
\theta
θ的估计值,在第
i
+
1
i+1
i+1次迭代的E步,计算
Q
(
θ
,
θ
(
i
)
)
=
E
Z
[
log
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
=
∑
Z
log
P
(
Y
,
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right) = {E_Z}\left[ {\log P\left( {Y,Z|\theta } \right)|Y,{\theta ^{\left( i \right)}}} \right] = \sum\limits_Z {\log P\left( {Y,Z|\theta } \right)P\left( {Z|Y,{\theta ^{\left( i \right)}}} \right)}
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=Z∑logP(Y,Z∣θ)P(Z∣Y,θ(i))函数
Q
(
θ
,
θ
(
i
)
)
Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)
Q(θ,θ(i))是EM算法的核心,称为Q函数。
3)M步:求使
Q
(
θ
,
θ
(
i
)
)
Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)
Q(θ,θ(i))极大化的
θ
\theta
θ,确定第
i
+
1
i+1
i+1次迭代的参数的估计值
θ
(
i
+
1
)
{{\theta ^{\left( i+1 \right)}}}
θ(i+1)
θ
(
i
+
1
)
=
arg
m
a
x
θ
Q
(
θ
,
θ
(
i
)
)
{\theta ^{\left( {i + 1} \right)}} = \arg \mathop {max}\limits_\theta Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)
θ(i+1)=argθmaxQ(θ,θ(i))4)重复第2步和第3步,直到收敛。
Q函数的定义:完全数据的对数似然函数
log
P
(
Y
,
Z
∣
θ
)
\log P\left( {Y,Z|\theta } \right)
logP(Y,Z∣θ)关于在给定观测数据
Y
Y
Y和当前参数
θ
(
i
)
\theta ^{\left( i \right)}
θ(i)下对未观测数据
Z
Z
Z的条件概率分布
P
(
Z
∣
Y
,
θ
(
i
)
)
P\left( {Z|Y,{\theta ^{\left( i \right)}}}\right)
P(Z∣Y,θ(i))的期望称为Q函数。
求解上述三硬币问题:
1)选择模型参数初值为
θ
(
0
)
=
{
0.5
,
0.5
,
0.5
}
{\theta ^{\left( 0 \right)}} = \left\{ {0.5,0.5,0.5} \right\}
θ(0)={0.5,0.5,0.5}。
2)E步:
Q
(
θ
,
θ
(
i
)
)
=
∑
Z
log
P
(
Y
,
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
=
∑
j
=
1
10
∑
k
=
1
2
log
P
(
y
j
,
z
k
∣
θ
)
P
(
z
k
∣
y
j
,
θ
(
i
)
)
=
Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right) = \sum\limits_Z {\log P\left( {Y,Z|\theta } \right)P\left( {Z|Y,{\theta ^{\left( i \right)}}} \right)} = \sum\limits_{j = 1}^{10} {\sum\limits_{k = 1}^2 {\log P\left( {{y_j},{z_k}|\theta } \right)P\left( {{z_k}|{y_j},{\theta ^{\left( i \right)}}} \right)} } =
Q(θ,θ(i))=Z∑logP(Y,Z∣θ)P(Z∣Y,θ(i))=j=1∑10k=1∑2logP(yj,zk∣θ)P(zk∣yj,θ(i))=
∑
j
=
1
10
log
π
(
i
+
1
)
(
p
(
i
+
1
)
)
y
j
(
1
−
p
(
i
+
1
)
)
1
−
y
j
π
(
i
)
(
p
(
i
)
)
y
j
(
1
−
p
(
i
)
)
1
−
y
j
π
(
i
)
(
p
(
i
)
)
y
j
(
1
−
p
(
i
)
)
1
−
y
j
+
(
1
−
π
(
i
)
)
(
q
(
i
)
)
y
j
(
1
−
q
(
i
)
)
1
−
y
j
+
log
(
1
−
π
(
i
+
1
)
)
(
q
(
i
+
1
)
)
y
j
(
1
−
q
(
i
+
1
)
)
1
−
y
j
(
1
−
π
(
i
)
)
(
q
(
i
)
)
y
j
(
1
−
q
(
i
)
)
1
−
y
j
π
(
i
)
(
p
(
i
)
)
y
j
(
1
−
p
(
i
)
)
1
−
y
j
+
(
1
−
π
(
i
)
)
(
q
(
i
)
)
y
j
(
1
−
q
(
i
)
)
1
−
y
j
\begin{matrix}\sum\limits_{j = 1}^{10}\end{matrix}\begin{matrix} {\log {\pi ^{\left( {i + 1} \right)}}{{\left( {{p^{\left( {i + 1} \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( {i + 1} \right)}}} \right)}^{1 - {y_j}}}\frac{{{\pi ^{\left( i \right)}}{{\left( {{p^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( i \right)}}} \right)}^{1 - {y_j}}}}}{{{\pi ^{\left( i \right)}}{{\left( {{p^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( i \right)}}} \right)}^{1 - {y_j}}} + \left( {1 - {\pi ^{\left( i \right)}}} \right){{\left( {{q^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {q^{\left( i \right)}}} \right)}^{1-{y_j}}}}} + } \\ {\log \left( {1 - {\pi ^{\left( i+1 \right)}}} \right){{\left( {{q^{\left( i+1 \right)}}} \right)}^{{y_j}}}{{\left( {1 - {q^{\left( i+1 \right)}}} \right)}^{1-{y_j}}}\frac{{\left( {1 - {\pi ^{\left( i \right)}}} \right){{\left( {{q^{\left( i \right)}}} \right)}^{y_j}}{{\left( {1 - {q^{\left( i \right)}}} \right)}^{1-{y_j}}}}}{{{\pi ^{\left( i \right)}}{{\left( {{p^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( i \right)}}} \right)}^{1 - {y_j}}} + \left( {1 - {\pi ^{\left( i \right)}}} \right){{\left( {{q^{\left( i \right)}}} \right)}^{y_j}}{{\left( {1 - {q^{\left( i \right)}}} \right)}^{1-{y_j}}}}}}\end{matrix}
j=1∑10logπ(i+1)(p(i+1))yj(1−p(i+1))1−yjπ(i)(p(i))yj(1−p(i))1−yj+(1−π(i))(q(i))yj(1−q(i))1−yjπ(i)(p(i))yj(1−p(i))1−yj+log(1−π(i+1))(q(i+1))yj(1−q(i+1))1−yjπ(i)(p(i))yj(1−p(i))1−yj+(1−π(i))(q(i))yj(1−q(i))1−yj(1−π(i))(q(i))yj(1−q(i))1−yj
3)M步:求使
Q
(
θ
,
θ
(
i
)
)
Q\left( {\theta ,{\theta ^{\left( i \right)}}} \right)
Q(θ,θ(i))极大化的
θ
\theta
θ。令
μ
j
(
i
+
1
)
=
π
(
i
)
(
p
(
i
)
)
y
j
(
1
−
p
(
i
)
)
1
−
y
j
π
(
i
)
(
p
(
i
)
)
y
j
(
1
−
p
(
i
)
)
1
−
y
j
+
(
1
−
π
(
i
)
)
(
q
(
i
)
)
y
j
(
1
−
q
(
i
)
)
1
−
y
j
{\mu_j ^{\left( {i + 1} \right)}} = \frac{{{\pi ^{\left( i \right)}}{{\left( {{p^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( i \right)}}} \right)}^{1 - {y_j}}}}}{{{\pi ^{\left( i \right)}}{{\left( {{p^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {p^{\left( i \right)}}} \right)}^{1 - {y_j}}} + \left( {1 - {\pi ^{\left( i \right)}}} \right){{\left( {{q^{\left( i \right)}}} \right)}^{{y_j}}}{{\left( {1 - {q^{\left( i \right)}}} \right)}^{1 - {y_j}}}}}
μj(i+1)=π(i)(p(i))yj(1−p(i))1−yj+(1−π(i))(q(i))yj(1−q(i))1−yjπ(i)(p(i))yj(1−p(i))1−yj
∂
Q
∂
π
(
i
+
1
)
=
∑
j
=
1
10
μ
j
(
i
+
1
)
π
(
i
+
1
)
−
1
−
μ
j
(
i
+
1
)
1
−
π
(
i
+
1
)
=
0
{\bf \frac{{\partial Q}}{{\partial {\pi ^{\left( {i + 1} \right)}}}}} = \sum\limits_{j = 1}^{10} {\frac{{{\mu_j ^{\left( i+1 \right)}}}}{{{\pi ^{\left( {i + 1} \right)}}}} - \frac{{1 - {\mu_j ^{\left( i+1 \right)}}}}{{1 - {\pi ^{\left( {i + 1} \right)}}}}} = 0
∂π(i+1)∂Q=j=1∑10π(i+1)μj(i+1)−1−π(i+1)1−μj(i+1)=0
⇒
∑
j
=
1
10
μ
j
(
i
+
1
)
(
1
−
π
(
i
+
1
)
)
−
(
1
−
μ
j
(
i
+
1
)
)
π
(
i
+
1
)
=
0
\Rightarrow\sum\limits_{j = 1}^{10} {{\mu_j ^{\left( i+1 \right)}}\left( {1 - {\pi ^{\left( {i + 1} \right)}}} \right) - \left( {1 - {\mu_j ^{\left( i+1 \right)}}} \right){\pi ^{\left( {i + 1} \right)}}} = 0
⇒j=1∑10μj(i+1)(1−π(i+1))−(1−μj(i+1))π(i+1)=0
⇒
π
(
i
+
1
)
=
∑
i
=
1
10
μ
j
(
i
+
1
)
10
\Rightarrow{\pi ^{\left( {i + 1} \right)}} = {\bf\frac{{\sum\limits_{i = 1}^{10} {{\mu_j ^{\left( i+1 \right)}}} }}{{10}}}
⇒π(i+1)=10i=1∑10μj(i+1)
∂
Q
∂
p
(
i
+
1
)
=
∑
j
=
1
10
μ
j
(
i
+
1
)
(
y
i
(
p
(
i
+
1
)
)
y
i
−
1
(
1
−
p
(
i
+
1
)
)
1
−
y
i
−
(
1
−
y
i
)
(
p
(
i
+
1
)
)
y
i
(
1
−
p
(
i
+
1
)
)
−
y
i
)
(
p
(
i
+
1
)
)
y
i
(
1
−
p
(
i
+
1
)
)
1
−
y
i
=
0
{\bf \frac{{\partial {\text{Q}}}}{{\partial {p^{\left( {i + 1} \right)}}}}} = \sum\limits_{j = 1}^{10} {\frac{{{\mu _j}^{\left( {i + 1} \right)}\left( {{y_i}{{\left( {{p^{\left( {i + 1} \right)}}} \right)}^{{y_i} - 1}}{{\left( {1 - {p^{\left( {i + 1} \right)}}} \right)}^{1 - {y_i}}} - \left( {1 - {y_i}} \right){{\left( {{p^{\left( {i + 1} \right)}}} \right)}^{{y_i}}}{{\left( {1 - {p^{\left( {i + 1} \right)}}} \right)}^{ - {y_i}}}} \right)}}{{{{\left( {{p^{\left( {i + 1} \right)}}} \right)}^{{y_i}}}{{\left( {1 - {p^{\left( {i + 1} \right)}}} \right)}^{1 - {y_i}}}}}} = 0
∂p(i+1)∂Q=j=1∑10(p(i+1))yi(1−p(i+1))1−yiμj(i+1)(yi(p(i+1))yi−1(1−p(i+1))1−yi−(1−yi)(p(i+1))yi(1−p(i+1))−yi)=0
⇒
∑
j
=
1
10
μ
j
(
i
+
1
)
[
y
j
(
1
−
p
(
i
+
1
)
)
−
(
1
−
y
j
)
p
(
i
+
1
)
]
=
0
\Rightarrow \sum\limits_{j = 1}^{10} {{\mu _j}^{\left( {i + 1} \right)}\left[ {{y_j}\left( {1 - {p^{\left( {i + 1} \right)}}} \right) - \left( {1 - {y_j}} \right){p^{\left( {i + 1} \right)}}} \right]} = 0
⇒j=1∑10μj(i+1)[yj(1−p(i+1))−(1−yj)p(i+1)]=0
⇒
∑
j
=
1
10
μ
j
(
i
+
1
)
y
j
=
p
(
i
+
1
)
∑
j
=
1
10
μ
j
(
i
+
1
)
⇒
p
(
i
+
1
)
=
∑
j
=
1
10
μ
j
(
i
+
1
)
y
j
∑
j
=
1
10
μ
j
(
i
+
1
)
\Rightarrow \sum\limits_{j = 1}^{10} {{\mu _j}^{\left( {i + 1} \right)}{y_j}} = {p^{\left( {i + 1} \right)}}\sum\limits_{j = 1}^{10} {{\mu _j}^{\left( {i + 1} \right)}} \Rightarrow {\bf{p^{\left( {i + 1} \right)}} = \frac{{\sum\limits_{j = 1}^{10} {{\mu _j}^{\left( {i + 1} \right)}{y_j}} }}{{\sum\limits_{j = 1}^{10} {{\mu _j}^{\left( {i + 1} \right)}} }}}
⇒j=1∑10μj(i+1)yj=p(i+1)j=1∑10μj(i+1)⇒p(i+1)=j=1∑10μj(i+1)j=1∑10μj(i+1)yj同理,由
∂
Q
∂
q
(
i
+
1
)
=
0
{\bf \frac{{\partial {\text{Q}}}}{{\partial {q^{\left( {i + 1} \right)}}}} = 0}
∂q(i+1)∂Q=0得
q
(
i
+
1
)
=
∑
j
=
1
10
(
1
−
μ
j
(
i
+
1
)
)
y
j
∑
j
=
1
10
1
−
μ
j
(
i
+
1
)
{\bf{q^{\left( {i + 1} \right)}} = \frac{{\sum\limits_{j = 1}^{10} {\left( {1 - {\mu _j}^{\left( {i + 1} \right)}} \right){y_j}} }}{{\sum\limits_{j = 1}^{10} {1 - {\mu _j}^{\left( {i + 1} \right)}} }}}
q(i+1)=j=1∑101−μj(i+1)j=1∑10(1−μj(i+1))yj
EM算法的收敛性
定理一:设
P
(
Y
∣
θ
)
P\left( {Y|\theta } \right)
P(Y∣θ)为观测数据的似然函数,
θ
(
i
)
(
i
=
1
,
2
,
⋯
)
{\theta ^{\left( i \right)}}\left( {i = 1,2, \cdots } \right)
θ(i)(i=1,2,⋯)为EM算法得到的参数估计序列,
P
(
Y
∣
θ
(
i
)
)
(
i
=
1
,
2
,
⋯
)
P\left( {Y|{\theta ^{\left( i \right)}}} \right)\left( {i = 1,2, \cdots } \right)
P(Y∣θ(i))(i=1,2,⋯)为对应的似然函数序列,则
P
(
Y
∣
θ
(
i
)
)
P\left( {Y|{\theta ^{\left( i \right)}}} \right)
P(Y∣θ(i))是单调递增的,即
P
(
Y
∣
θ
(
i
+
1
)
)
⩾
P
(
Y
∣
θ
(
i
)
)
P\left( {Y|{\theta ^{\left( {i + 1} \right)}}} \right) \geqslant P\left( {Y|{\theta ^{\left( i \right)}}} \right)
P(Y∣θ(i+1))⩾P(Y∣θ(i))。
定理二:设
L
(
θ
)
=
log
P
(
Y
∣
θ
)
L\left( \theta \right) = \log P\left( {Y|\theta } \right)
L(θ)=logP(Y∣θ)为观测数据的对数似然函数,
θ
(
i
)
(
i
=
1
,
2
,
⋯
)
{\theta ^{\left( i \right)}}\left( {i = 1,2, \cdots } \right)
θ(i)(i=1,2,⋯)为EM算法得到的参数估计序列,
L
(
θ
(
i
)
)
(
i
=
1
,
2
,
⋯
)
L\left( {{\theta ^{\left( i \right)}}} \right)\left( {i = 1,2, \cdots } \right)
L(θ(i))(i=1,2,⋯)为对应的对数似然函数序列。
1)如果
log
P
(
Y
∣
θ
)
\log P\left( {Y|\theta } \right)
logP(Y∣θ)有上界,则
L
(
θ
(
i
)
)
=
log
P
(
Y
∣
θ
(
i
)
)
L\left( {{\theta ^{\left( i \right)}}} \right) = \log P\left( {Y|{\theta ^{\left( i \right)}}} \right)
L(θ(i))=logP(Y∣θ(i))收敛到某一值
L
∗
{L^*}
L∗;
2)在函数
Q
(
θ
,
θ
′
)
Q\left( {\theta ,{\theta^\prime}} \right)
Q(θ,θ′)与
L
(
θ
)
L\left( \theta \right)
L(θ)满足一定条件下,由EM算法得到的参数估计序列
θ
(
i
)
\theta ^{\left( i \right)}
θ(i)的收敛值
θ
∗
{\theta ^ * }
θ∗是
L
(
θ
)
L\left( \theta \right)
L(θ)的稳定点。
EM算法的收敛性包含关于对数似然函数序列
L
(
θ
(
i
)
)
L\left( {{\theta ^{\left( i \right)}}} \right)
L(θ(i))的收敛性和关于参数估计序列
θ
(
i
)
\theta ^{\left( i \right)}
θ(i)的收敛性两层意思,前者不包括后者。定理只能保证参数估计序列收敛到对数似然函数序列的稳定点,不能保证收敛到极大值点。
下一篇《隐马尔可夫模型的训练》