最大期望算法(Expectation-Maximization algorithm,EM)
最大期望算法,也被译作最大化算法(Minorize-Maxization,MM),是在概率模型中寻找参数最大似然估计或者最大后验估计的算法,其中概率模型依赖于无法观测的隐性变量。
最大期望算法就是E-step和M-step交替进行计算,直至满足收敛条件。所以它是一种迭代算法。
EM算法适用场景:当数据有缺失值时,即数据不完整时。还有很多机器学习模型的求解经常用到Em,比如GMM(高斯混合模型)、HMM(隐马尔科夫模型)等等。
一、EM算法的广义步骤
E-step:利用可用的数据来估算(猜测)潜在变量的值;
M-step:根据E步骤中生成的估计值,使用完整的数据更新参数。
二、先写出EM的公式
这里以最大似然估计作为准则:
θ
^
M
L
E
=
arg
max
log
P
(
X
∣
θ
)
{\hat \theta _{MLE}} = \arg \max \log P(X|\theta )
θ^MLE=argmaxlogP(X∣θ)
EM公式
θ
(
t
+
1
)
=
arg
max
θ
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
{\theta ^{(t + 1)}} = \arg \mathop {\max }\limits_\theta \int_Z {\log } P(X,Z|\theta )P(Z|X,{\theta ^{(t)}})dZ
θ(t+1)=argθmax∫ZlogP(X,Z∣θ)P(Z∣X,θ(t))dZ
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
\int_Z {\log } P(X,Z|\theta )P(Z|X,{\theta ^{(t)}})dZ
∫ZlogP(X,Z∣θ)P(Z∣X,θ(t))dZ也可以写作
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
]
{E_{Z|X,{\theta ^{(t)}}}}[\log P(X,Z|\theta )]
EZ∣X,θ(t)[logP(X,Z∣θ)]或者
∑
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
\sum\limits_Z {\log P(X,Z|\theta )P(Z|X,{\theta ^{(t)}})}
Z∑logP(X,Z∣θ)P(Z∣X,θ(t))
三、其收敛性的证明
此处的证明并不是非常严格的证明。要证明其收敛,就是要证明当 θ ( t ) → θ ( t + 1 ) {\theta ^{(t)}} \to {\theta ^{(t + 1)}} θ(t)→θ(t+1), P ( X ∣ θ ( t ) ) → P ( X ∣ θ ( t + 1 ) ) P(X|{\theta ^{(t)}}) \to P(X|{\theta ^{(t + 1)}}) P(X∣θ(t))→P(X∣θ(t+1))时 P ( X ∣ θ ( t ) ) ≤ P ( X ∣ θ ( t + 1 ) ) P(X|{\theta ^{(t)}}) \le P(X|{\theta ^{(t + 1)}}) P(X∣θ(t))≤P(X∣θ(t+1))。证明过程如下:
证明:
log
P
(
X
∣
θ
)
=
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
\log P(X|\theta ) = \log P(X,Z|\theta ) - \log P(Z|X,\theta )
logP(X∣θ)=logP(X,Z∣θ)−logP(Z∣X,θ)
等式两边关于
Z
∣
X
,
θ
(
t
)
{Z|X,{\theta ^{(t)}}}
Z∣X,θ(t)分布同时求期望:
左边:
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
∣
θ
)
]
=
∫
Z
log
P
(
X
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
log
P
(
X
∣
θ
)
∫
Z
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
log
P
(
X
∣
θ
)
\begin{array}{l} {E_{Z|X,{\theta ^{(t)}}}}[\log P(X|\theta )]\\ = \int_Z {\log } P(X|\theta )P(Z|X,{\theta ^{(t)}})dZ\\ = \log P(X|\theta )\int_Z {P(Z|X,{\theta ^{(t)}})} dZ\\ = \log P(X|\theta ) \end{array}
EZ∣X,θ(t)[logP(X∣θ)]=∫ZlogP(X∣θ)P(Z∣X,θ(t))dZ=logP(X∣θ)∫ZP(Z∣X,θ(t))dZ=logP(X∣θ)
右边:
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
]
=
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
−
∫
Z
log
P
(
Z
∣
X
,
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
\begin{array}{l} {E_{Z|X,{\theta ^{(t)}}}}\left[ {\log P(X,Z|\theta ) - \log P(Z|X,\theta )} \right]\\ = \int_Z {\log P(X,Z|\theta )P(Z|X,{\theta ^{(t)}})dZ - } \int_Z {\log P(Z|X,\theta )P(Z|X,{\theta ^{(t)}})dZ} \end{array}
EZ∣X,θ(t)[logP(X,Z∣θ)−logP(Z∣X,θ)]=∫ZlogP(X,Z∣θ)P(Z∣X,θ(t))dZ−∫ZlogP(Z∣X,θ)P(Z∣X,θ(t))dZ
令
Q
(
θ
,
θ
(
t
)
)
=
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
Q(\theta ,{\theta ^{(t)}}) = \int_Z {\log P(X,Z|\theta )P(Z|X,{\theta ^{(t)}})dZ}
Q(θ,θ(t))=∫ZlogP(X,Z∣θ)P(Z∣X,θ(t))dZ,
H
(
θ
,
θ
(
t
)
)
=
∫
Z
log
P
(
Z
∣
X
,
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
H(\theta ,{\theta ^{(t)}}) = \int_Z {\log P(Z|X,\theta )P(Z|X,{\theta ^{(t)}})dZ}
H(θ,θ(t))=∫ZlogP(Z∣X,θ)P(Z∣X,θ(t))dZ。
则
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
]
=
Q
(
θ
,
θ
(
t
)
)
−
H
(
θ
,
θ
(
t
)
)
{E_{Z|X,{\theta ^{(t)}}}}\left[ {\log P(X,Z|\theta ) - \log P(Z|X,\theta )} \right] = Q(\theta ,{\theta ^{(t)}}) - H(\theta ,{\theta ^{(t)}})
EZ∣X,θ(t)[logP(X,Z∣θ)−logP(Z∣X,θ)]=Q(θ,θ(t))−H(θ,θ(t))
由EM的算法公式可得:(因为就是要求满足要求的最大
θ
\theta
θ作为
θ
(
t
+
1
)
\theta ^{(t+1)}
θ(t+1))
Q
(
θ
(
t
+
1
)
,
θ
(
t
)
)
≥
Q
(
θ
,
θ
(
t
)
)
Q({\theta ^{(t + 1)}},{\theta ^{(t)}}) \ge Q(\theta ,{\theta ^{(t)}})
Q(θ(t+1),θ(t))≥Q(θ,θ(t))
也即:(
θ
\theta
θ取
θ
(
t
)
\theta ^{(t)}
θ(t)时)
Q
(
θ
(
t
+
1
)
,
θ
(
t
)
)
≥
Q
(
θ
(
t
)
,
θ
(
t
)
)
Q({\theta ^{(t + 1)}},{\theta ^{(t)}}) \ge Q({\theta ^{(t)}},{\theta ^{(t)}})
Q(θ(t+1),θ(t))≥Q(θ(t),θ(t))
对于
H
(
θ
,
θ
(
t
)
)
H(\theta ,{\theta ^{(t)}})
H(θ,θ(t)):
H
(
θ
(
t
+
1
)
,
θ
(
t
)
)
−
H
(
θ
(
t
)
,
θ
(
t
)
)
=
∫
Z
log
P
(
Z
∣
X
,
θ
(
t
+
1
)
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
−
∫
Z
log
P
(
Z
∣
X
,
θ
t
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
∫
Z
P
(
Z
∣
X
,
θ
(
t
)
)
log
P
(
Z
∣
X
,
θ
(
t
+
1
)
)
P
(
Z
∣
X
,
θ
t
)
d
Z
=
−
K
L
(
P
(
Z
∣
X
,
θ
(
t
)
)
∣
∣
P
(
Z
∣
X
,
θ
(
t
+
1
)
)
)
≤
0
\begin{array}{l} H({\theta ^{(t + 1)}},{\theta ^{(t)}}) - H({\theta ^{(t)}},{\theta ^{(t)}})\\ = \int_Z {\log P(Z|X,{\theta ^{(t + 1)}})} P(Z|X,{\theta ^{(t)}})dZ - \int_Z {\log P(Z|X,{\theta ^t})} P(Z|X,{\theta ^{(t)}})dZ\\ = \int_Z {P(Z|X,{\theta ^{(t)}})} \log \frac{{P(Z|X,{\theta ^{(t + 1)}})}}{{P(Z|X,{\theta ^t})}}dZ\\ = - KL(P(Z|X,{\theta ^{(t)}})||P(Z|X,{\theta ^{(t + 1)}}))\\ \le 0 \end{array}
H(θ(t+1),θ(t))−H(θ(t),θ(t))=∫ZlogP(Z∣X,θ(t+1))P(Z∣X,θ(t))dZ−∫ZlogP(Z∣X,θt)P(Z∣X,θ(t))dZ=∫ZP(Z∣X,θ(t))logP(Z∣X,θt)P(Z∣X,θ(t+1))dZ=−KL(P(Z∣X,θ(t))∣∣P(Z∣X,θ(t+1)))≤0
也即
H
(
θ
(
t
+
1
)
,
θ
(
t
)
)
≤
H
(
θ
(
t
)
,
θ
(
t
)
)
H({\theta ^{(t + 1)}},{\theta ^{(t)}}) \le H({\theta ^{(t)}},{\theta ^{(t)}})
H(θ(t+1),θ(t))≤H(θ(t),θ(t))
综上,
P
(
X
∣
θ
(
t
)
)
≤
P
(
X
∣
θ
(
t
+
1
)
)
P(X|{\theta ^{(t)}}) \le P(X|{\theta ^{(t + 1)}})
P(X∣θ(t))≤P(X∣θ(t+1)),证毕。
四、公式推导方法1
说明下数据:
X: observed data
X
=
{
x
1
,
x
2
,
⋯
x
N
}
X = \{ {x_1},{x_2}, \cdots {x_N}\}
X={x1,x2,⋯xN}
Z: unovserved data(latent data)
Z
=
{
z
i
}
i
=
1
K
Z = \{ {z_i}\} _{i = 1}^K
Z={zi}i=1K
(X,Z): complete data
θ
\theta
θ: parameter
4.1 E-M步骤公式
E-step:
P
(
Z
∣
X
,
θ
(
t
)
)
→
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
]
P(Z|X,{\theta ^{(t)}}) \to {E_{Z|X,{\theta ^{(t)}}}}[\log P(X,Z|\theta )]
P(Z∣X,θ(t))→EZ∣X,θ(t)[logP(X,Z∣θ)]
M-step:
θ
(
t
+
1
)
=
arg
max
θ
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
]
{\theta ^{(t + 1)}} = \arg \mathop {\max }\limits_\theta {E_{Z|X,{\theta ^{(t)}}}}[\log P(X,Z|\theta )]
θ(t+1)=argθmaxEZ∣X,θ(t)[logP(X,Z∣θ)]
4.2 推导过程
log
P
(
X
∣
θ
)
=
log
(
X
,
Z
∣
θ
)
−
log
(
Z
∣
X
,
θ
)
\log P(X|\theta ) = \log (X,Z|\theta ) - \log (Z|X,\theta )
logP(X∣θ)=log(X,Z∣θ)−log(Z∣X,θ)
等价代换,引入分布
q
(
Z
)
q(Z)
q(Z):
log
P
(
X
∣
θ
)
=
log
P
(
X
,
Z
∣
θ
)
q
(
z
)
−
log
P
(
Z
∣
X
,
θ
)
q
(
z
)
,
q
(
Z
)
≠
0
\log P(X|\theta ) = \log \frac{{P(X,Z|\theta )}}{{q(z)}} - \log \frac{{P(Z|X,\theta )}}{{q(z)}} , q(Z) \ne 0
logP(X∣θ)=logq(z)P(X,Z∣θ)−logq(z)P(Z∣X,θ),q(Z)=0
两边同时关于分布 q ( Z ) q(Z) q(Z)求期望
对于左边:
E
q
(
Z
)
[
log
P
(
X
∣
θ
)
]
=
∫
Z
log
P
(
X
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
log
P
(
X
∣
θ
)
∫
Z
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
log
P
(
X
∣
θ
)
\begin{array}{l} {E_{q(Z)}}[\log P(X|\theta )] = \int_Z {\log } P(X|\theta )P(Z|X,{\theta ^{(t)}})dZ\\ = \log P(X|\theta )\int_Z {P(Z|X,{\theta ^{(t)}})} dZ\\ = \log P(X|\theta ) \end{array}
Eq(Z)[logP(X∣θ)]=∫ZlogP(X∣θ)P(Z∣X,θ(t))dZ=logP(X∣θ)∫ZP(Z∣X,θ(t))dZ=logP(X∣θ)
对于右边:
E
Z
∣
X
,
θ
(
t
)
[
log
P
(
X
,
Z
∣
θ
)
q
(
z
)
−
log
P
(
Z
∣
X
,
θ
)
q
(
z
)
]
=
∫
Z
log
P
(
X
,
Z
∣
θ
)
q
(
z
)
q
(
z
)
d
Z
−
∫
Z
log
P
(
Z
∣
X
,
θ
)
q
(
z
)
q
(
z
)
d
Z
=
E
L
B
O
+
K
L
(
q
(
Z
)
∣
∣
P
(
Z
∣
X
,
θ
)
)
\begin{array}{l} {E_{Z|X,{\theta ^{(t)}}}}\left[ {\log \frac{{P(X,Z|\theta )}}{{q(z)}} - \log \frac{{P(Z|X,\theta )}}{{q(z)}}} \right]\\ = \int_Z {\log \frac{{P(X,Z|\theta )}}{{q(z)}}} q(z)dZ - \int_Z {\log \frac{{P(Z|X,\theta )}}{{q(z)}}} q(z)dZ\\ = ELBO + KL\left( {q(Z)||P(Z|X,\theta )} \right) \end{array}
EZ∣X,θ(t)[logq(z)P(X,Z∣θ)−logq(z)P(Z∣X,θ)]=∫Zlogq(z)P(X,Z∣θ)q(z)dZ−∫Zlogq(z)P(Z∣X,θ)q(z)dZ=ELBO+KL(q(Z)∣∣P(Z∣X,θ))
其中
P
(
Z
∣
X
,
θ
)
{P(Z|X,\theta )}
P(Z∣X,θ)为后验概率。ELBO为evidence lower bound
∴ log P ( X ∣ θ ) = E L B O + K L ( q ∣ ∣ P ) ∵ K L ( q ∣ ∣ P ) ≥ 0 ∴ log P ( X ∣ θ ) ≥ E L B O \begin{array}{l} \therefore \log P(X|\theta ) = ELBO + KL\left( {q||P} \right)\\ \because KL\left( {q||P} \right) \ge 0\\ \therefore \log P(X|\theta ) \ge ELBO \end{array} ∴logP(X∣θ)=ELBO+KL(q∣∣P)∵KL(q∣∣P)≥0∴logP(X∣θ)≥ELBO
则取最大值,就等价于ELBO取最大值,此时:
θ
^
(
t
+
1
)
=
arg
max
θ
E
L
B
O
=
arg
max
θ
∫
Z
log
P
(
X
,
Z
∣
θ
)
q
(
z
)
q
(
z
)
d
Z
{{\hat \theta }^{(t + 1)}} = \arg \mathop {\max }\limits_\theta ELBO = \arg \mathop {\max }\limits_\theta \int_Z {\log \frac{{P(X,Z|\theta )}}{{q(z)}}} q(z)dZ
θ^(t+1)=argθmaxELBO=argθmax∫Zlogq(z)P(X,Z∣θ)q(z)dZ
∵ \because ∵当 q = P q=P q=P时, K L = 0 KL=0 KL=0,即 log P ( X ∣ θ ) = E L B O \log P(X|\theta ) = ELBO logP(X∣θ)=ELBO的等号成立。
则取
q
(
z
)
=
P
(
Z
∣
X
,
θ
(
t
)
)
q(z) = P(Z|X,{\theta ^{(t)}})
q(z)=P(Z∣X,θ(t)),即上一时刻的后验。
∴
\therefore
∴
θ
^
(
t
+
1
)
=
arg
max
θ
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
=
arg
max
θ
∫
Z
[
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
(
t
)
)
]
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
\begin{array}{l} {{\hat \theta }^{(t + 1)}} = \arg \mathop {\max }\limits_\theta \int_Z {\log \frac{{P(X,Z|\theta )}}{{P(Z|X,{\theta ^{(t)}})}}} P(Z|X,{\theta ^{(t)}})dZ\\ = \arg \mathop {\max }\limits_\theta \int_Z {\left[ {\log P(X,Z|\theta ) - \log P(Z|X,{\theta ^{(t)}})} \right]} P(Z|X,{\theta ^{(t)}})dZ \end{array}
θ^(t+1)=argθmax∫ZlogP(Z∣X,θ(t))P(X,Z∣θ)P(Z∣X,θ(t))dZ=argθmax∫Z[logP(X,Z∣θ)−logP(Z∣X,θ(t))]P(Z∣X,θ(t))dZ
此时 θ ( t ) \theta ^{(t)} θ(t)为上一时刻的参数,故在此可视作常数,则减号后面的参数都视作常数,与目标 θ \theta θ无关,在求解是无用,故可以省去。
则此时:
θ
^
(
t
+
1
)
=
arg
max
θ
∫
Z
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
t
)
)
d
Z
{{\hat \theta }^{(t + 1)}} = \arg \mathop {\max }\limits_\theta \int_Z {\log P(X,Z|\theta )} P(Z|X,{\theta ^{(t)}})dZ
θ^(t+1)=argθmax∫ZlogP(X,Z∣θ)P(Z∣X,θ(t))dZ
证毕
五、公式推导方法2(涉及Jensen不等式)
5.1 Jensen不等式
对于concave function(凹函数),设 t ∈ [ 0 , 1 ] t \in [0,1] t∈[0,1],则有 f ( t ⋅ a + ( 1 − t ) ⋅ b ) ≥ t f ( a ) + ( 1 − t ) f ( b ) f\left( {t \cdot a + (1 - t) \cdot b} \right) \ge tf\left( a \right) + (1 - t)f\left( b \right) f(t⋅a+(1−t)⋅b)≥tf(a)+(1−t)f(b)。特别的,当 a = b = 1 2 a = b = \frac{1}{2} a=b=21时, f ( a + b 2 ) ≥ f ( a ) + f ( b ) 2 f\left( {\frac{{a + b}}{2}} \right) \ge \frac{{f\left( a \right) + f\left( b \right)}}{2} f(2a+b)≥2f(a)+f(b)。从期望的角度来看就是 f ( E ) ≥ E ( f ) f\left( E \right) \ge E\left( f \right) f(E)≥E(f),先期望后函数大于等于先函数后期望。
5.2 关于E-M算法的理解
对于E-step:
θ
(
t
)
\theta ^{(t)}
θ(t)是上一次求出的参数,在这一步就是常数,然后对于后验分布
Z
∣
X
,
θ
(
t
)
Z|X,{\theta ^{(t)}}
Z∣X,θ(t)求关于最大似然函数
log
P
(
X
,
Z
∣
θ
)
{\log P(X,Z|\theta )}
logP(X,Z∣θ)的期望,即写出一个关于
θ
\theta
θ的函数。
对于M-step: 针对E-step写出的期望函数,求使参数
θ
\theta
θ满足其取最大值的参数作为当前时刻的参数目标
θ
(
t
+
1
)
\theta ^{(t+1)}
θ(t+1).
5.3 推导过程
log P ( X ∣ θ ) = log ∫ Z P ( X , Z ∣ θ ) d Z \log P(X|\theta ) = \log \int_Z {P(X,Z|\theta )} dZ logP(X∣θ)=log∫ZP(X,Z∣θ)dZ
引入分布
q
(
Z
)
q(Z)
q(Z):
原
式
=
log
∫
Z
P
(
X
,
Z
∣
θ
)
q
(
Z
)
⋅
q
(
Z
)
d
Z
=
log
[
E
Z
∣
X
,
θ
(
t
)
(
P
(
X
,
Z
∣
θ
)
q
(
Z
)
)
]
\begin{array}{l}原式= \log \int_Z {\frac{{P(X,Z|\theta )}}{{q(Z)}}} \cdot q(Z)dZ\\ = \log \left[ {{E_{Z|X,{\theta ^{(t)}}}}\left( {\frac{{P(X,Z|\theta )}}{{q(Z)}}} \right)} \right] \end{array}
原式=log∫Zq(Z)P(X,Z∣θ)⋅q(Z)dZ=log[EZ∣X,θ(t)(q(Z)P(X,Z∣θ))]
由Jensen不等式:
原
式
≥
E
Z
∣
X
,
θ
(
t
)
[
log
(
P
(
X
,
Z
∣
θ
)
q
(
Z
)
)
]
=
E
L
B
O
原式\ge {E_{Z|X,{\theta ^{(t)}}}}\left[ {\log \left( {\frac{{P(X,Z|\theta )}}{{q(Z)}}} \right)} \right] = ELBO
原式≥EZ∣X,θ(t)[log(q(Z)P(X,Z∣θ))]=ELBO
当 P ( X , Z ∣ θ ) q ( Z ) = c {\frac{{P(X,Z|\theta )}}{{q(Z)}}} = c q(Z)P(X,Z∣θ)=c, c c c为常数时,等号成立。
则
q
(
Z
)
=
1
c
P
(
X
,
Z
∣
θ
)
q(Z) = \frac{1}{c}P(X,Z|\theta )
q(Z)=c1P(X,Z∣θ),两边同时对
Z
Z
Z求积分:
∫
Z
q
(
Z
)
d
Z
=
∫
Z
1
c
P
(
X
,
Z
∣
θ
)
d
Z
1
=
1
c
P
(
X
∣
θ
)
\begin{array}{c} \int_Z {q(Z)dZ = \int_Z {\frac{1}{c}P(X,Z|\theta )dZ} } \\ \\ 1 = \frac{1}{c}P(X|\theta ) \end{array}
∫Zq(Z)dZ=∫Zc1P(X,Z∣θ)dZ1=c1P(X∣θ)
∴
\therefore
∴
q
(
Z
)
=
P
(
X
∣
θ
)
⋅
P
(
X
,
Z
∣
θ
)
=
P
(
Z
∣
X
,
θ
)
q(Z) = P(X|\theta ) \cdot P(X,Z|\theta ) = P(Z|X,\theta )
q(Z)=P(X∣θ)⋅P(X,Z∣θ)=P(Z∣X,θ)
后续步骤同方法一。
六、广义EM算法
①狭义EM算法是广义EM算法的一种特例;
②生成模型中如果
Z
Z
Z的复杂度太高,则后验概率
P
(
Z
∣
X
,
θ
)
P(Z|X,\theta)
P(Z∣X,θ)很难求出(intractable)。但是像GMM和HNN的
Z
Z
Z是结构化的,相对简单,所以可以用狭义EM算法进行优化。
log P ( X ∣ θ ) = E L B O + K L ( q ∣ ∣ P ) = E q ( Z ) [ log P ( X , Z ∣ θ ) q ( Z ) ] − E q ( Z ) [ log P ( Z ∣ X , θ ) q ( Z ) ] \begin{array}{l} \log P(X|\theta ) = ELBO + KL(q||P)\\ = {E_{q(Z)}}\left[ {\log \frac{{P(X,Z|\theta )}}{{q(Z)}}} \right] - {E_{q(Z)}}\left[ {\log \frac{{P(Z|X,\theta )}}{{q(Z)}}} \right] \end{array} logP(X∣θ)=ELBO+KL(q∣∣P)=Eq(Z)[logq(Z)P(X,Z∣θ)]−Eq(Z)[logq(Z)P(Z∣X,θ)]
广义EM步骤:
{ 1. 固 定 θ : q ^ = arg min q K L ( q ∣ ∣ P ) = arg max q E L B O ( q ^ , θ ) 2. 固 定 q ^ : θ = arg max θ E L B O ( q ^ , θ ) \left\{ \begin{array}{l} 1.固定\theta :\hat q = \arg \mathop {\min }\limits_q KL(q||P) = \arg \mathop {\max }\limits_q ELBO(\hat q,\theta)\\ 2.固定\hat q:\theta = \arg \mathop {\max }\limits_\theta ELBO(\hat q,\theta) \end{array} \right. {1.固定θ:q^=argqminKL(q∣∣P)=argqmaxELBO(q^,θ)2.固定q^:θ=argθmaxELBO(q^,θ)
对应的:
{ E − s t e p : q ( t + 1 ) = arg max q E L B O ( q , θ ( t ) ) M − s t e p : θ ( t + 1 ) = arg max θ E L B O ( q ( t + 1 ) , θ ( t ) ) \left\{ \begin{array}{l} E-step:{q^{(t + 1)}} = \arg \mathop {\max }\limits_q ELBO(q,{\theta ^{(t)}})\\ M-step:{\theta ^{(t + 1)}} = \arg \mathop {\max }\limits_\theta ELBO({q^{(t + 1)}},{\theta ^{(t)}}) \end{array} \right. ⎩⎨⎧E−step:q(t+1)=argqmaxELBO(q,θ(t))M−step:θ(t+1)=argθmaxELBO(q(t+1),θ(t))
E L B O ( q , θ ) = E q ( Z ) log P ( X , Z ∣ θ ) − E q ( Z ) log q ( Z ) ELBO(q,\theta ) = {E_{q(Z)}} \log P(X,Z|\theta ) - {E_{q(Z)}}\log q(Z) ELBO(q,θ)=Eq(Z)logP(X,Z∣θ)−Eq(Z)logq(Z)
其中
−
E
q
(
Z
)
log
q
(
Z
)
- {E_{q(Z)}}\log q(Z)
−Eq(Z)logq(Z)是熵
H
[
q
(
z
)
]
H[q(z)]
H[q(z)],则
E
L
B
O
(
q
,
θ
)
=
E
q
(
Z
)
log
P
(
X
,
Z
∣
θ
)
+
H
[
q
(
z
)
]
ELBO(q,\theta ) = {E_{q(Z)}} \log P(X,Z|\theta ) + H[q(z)]
ELBO(q,θ)=Eq(Z)logP(X,Z∣θ)+H[q(z)]
广义EM是先固定一个参数在计算另一个参数,故可以从坐标上升法的角度去看。
七、EM算法的改进
①变分贝叶斯EM算法,VBEM/VIEM/VEM,三个简称叫法不同,但内容基本一致;
②蒙特卡洛EM算法,MCEM。
以上内容是在B站上up主的白板推导系列和几何周志华的西瓜书总结的内容,强烈推荐该系列课程,通俗易懂,还有一定的实时性,附上链接机器学习白板推导系列
本人第一次发博,不周之处还请多多批评,欢迎交流。