EM算法是一种迭代算法,1977年由Dempster等人总结提出,用于含有隐变量(hidden variable)的概率模型参数的极大似然估计,或极大后验概率估计。EM算法的每次迭代由两步组成:E步,求期望(expectation);M步,求极大(maximization)。所以这一算法称为期望极大算法(expectation maximization algorithm),简称EM算法。
1、EM算法的引入
EM算法
不完全数据:观测随机变量
Y
Y
Y。
完全数据:观测随机变量
Y
Y
Y和隐随机变量
Z
Z
Z。
Q
Q
Q函数:完全数据的对数似然函数
log
P
(
Y
,
Z
∣
θ
)
\log P \left( Y , Z | \theta \right)
logP(Y,Z∣θ)关于在给定观测数据
Y
Y
Y和当前参数
θ
(
i
)
\theta^{\left( i \right)}
θ(i)下对未观测数据
Z
Z
Z的条件概率分布
P
(
Z
∣
Y
,
θ
(
i
)
)
P \left( Z | Y, \theta^{\left( i \right)} \right)
P(Z∣Y,θ(i))的期望
Q
(
θ
,
θ
(
i
)
)
=
E
Z
[
log
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
\begin{aligned} & Q \left( \theta, \theta^{\left( i \right)} \right) = E_{Z} \left[ \log P \left( Y, Z | \theta \right) | Y , \theta^{\left( i \right)} \right] \end{aligned}
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
EM算法的导出
含有隐变量 Z Z Z的概率模型,目标是极大化观测变量 Y Y Y关于参数 θ \theta θ的对数似然函数,即 max L ( θ ) = log P ( Y ∣ θ ) = log ∑ Z P ( Y , Z ∣ θ ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) \begin{aligned} & \max L \left( \theta \right) = \log P \left( Y | \theta \right) \\ & = \log \sum_{Z} P \left( Y,Z | \theta \right) \\ & = \log \left( \sum_{Z} P \left( Y|Z,\theta \right) P \left( Z| \theta \right) \right)\end{aligned} maxL(θ)=logP(Y∣θ)=logZ∑P(Y,Z∣θ)=log(Z∑P(Y∣Z,θ)P(Z∣θ))
对数似然函数 L ( θ ) L \left( \theta \right) L(θ)与第 i i i次迭代后的对数似然函数估计值 L ( θ ( i ) ) L \left( \theta^{\left( i \right)} \right) L(θ(i))的差 L ( θ ) − L ( θ ( i ) ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − log P ( Y ∣ θ ( i ) ) = log ( ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log P ( Y ∣ θ ( i ) ) ≥ ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − log P ( Y ∣ θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) \begin{aligned} & L \left( \theta \right) - L \left( \theta^{\left( i \right)} \right) = \log \left( \sum_{Z} P \left( Y|Z,\theta \right) P \left( Z| \theta \right) \right) - \log P \left( Y| \theta^{ \left( i \right)} \right) \\ & = \log \left( \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \dfrac { P \left( Y|Z,\theta \right) P \left( Z| \theta \right)} {P \left( Z | Y , \theta^{\left( i \right)} \right)} \right) - \log P \left( Y| \theta^{ \left( i \right)} \right)\\ &\geq \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)}{P \left( Z | Y , \theta^{\left( i \right)} \right)} - \log P \left( Y| \theta^{ \left( i \right)} \right) \\ & = \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)}\end{aligned} L(θ)−L(θ(i))=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))=log(Z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))≥Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−logP(Y∣θ(i))=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
令 B ( θ , θ ( i ) ) = L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) \begin{aligned} & B \left( \theta , \theta^{\left ( i \right)} \right) = L \left( \theta^{\left ( i \right)} \right) + \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)} \end{aligned} B(θ,θ(i))=L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
则 L ( θ ) ≥ B ( θ , θ ( i ) ) \begin{aligned} & L \left( \theta \right) \geq B \left( \theta, \theta^{\left( i \right)} \right) \end{aligned} L(θ)≥B(θ,θ(i))
即函 B ( θ , θ ( i ) ) B \left( \theta, \theta^{\left( i \right)} \right) B(θ,θ(i)) 是 L ( θ ) L \left( \theta \right) L(θ) 的一个下界。选择 θ ( i ) \theta^{\left( i \right)} θ(i)使 B ( θ , θ ( i ) ) B \left( \theta, \theta^{\left( i \right)} \right) B(θ,θ(i))达到极大,即 θ ( i + 1 ) = arg max B ( θ , θ ( i ) ) = arg max ( L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg max ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log ( P ( Y ∣ Z , θ ) ) P ( Z ∣ θ ) ) = arg max ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) ) = arg max Q ( θ , θ ( i ) ) \begin{aligned} & \theta^{\left( i+1 \right)}= \arg \max B \left( \theta, \theta^{\left( i \right)} \right) \\ & = \arg \max \left( L \left( \theta^{\left ( i \right)} \right) + \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)} \right) \\ & = \arg \max \left( \sum_{Z} P \left( Z | Y, \theta^{\left( i \right)} \right) \log \left( P \left( Y | Z, \theta \right) \right) P \left( Z | \theta \right) \right) \\ & = \arg \max \left( \sum_{Z} P \left( Z | Y, \theta^{\left( i \right)} \right) \log P \left( Y, Z | \theta\right) \right) \\ & = \arg \max Q \left( \theta, \theta^{\left( i \right)} \right) \end{aligned} θ(i+1)=argmaxB(θ,θ(i))=argmax(L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argmax(Z∑P(Z∣Y,θ(i))log(P(Y∣Z,θ))P(Z∣θ))=argmax(Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ))=argmaxQ(θ,θ(i))
EM算法:
- 输入:观测随机变量数据 Y Y Y,隐随机变量数据 Z Z Z,联合分布 P ( Y , Z ∣ θ ) P\left(Y,Z|\theta \right) P(Y,Z∣θ),条件分布 P ( Y | Z , θ ) P\left(Y|Z,\theta\right) P(Y|Z,θ);
- 输出:模型参数 θ \theta θ
-
初值 θ ( 0 ) \theta^{\left(0\right)} θ(0)
-
E E E步: Q ( θ , θ ( i ) ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log P ( Y , Z ∣ θ ) ⋅ P ( Z ∣ Y , θ ( i ) ) Q\left(\theta,\theta^{\left(i\right)}\right)=E_{Z}\left[\log P\left(Y,Z|\theta\right)|Y,\theta^{\left(i\right)}\right] \\ = \sum_{Z} \log P\left(Y,Z|\theta \right) \cdot P\left(Z|Y, \theta^{\left(i\right)}\right) Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=Z∑logP(Y,Z∣θ)⋅P(Z∣Y,θ(i))
-
M M M步: θ ( i + 1 ) = arg max Q ( θ , θ ( i ) ) \begin{aligned} & \theta^{\left( i+1 \right)} = \arg \max Q\left(\theta, \theta^{\left( i \right)} \right)\end{aligned} θ(i+1)=argmaxQ(θ,θ(i))
-
重复2. 3.,直到收敛。
2、EM算法的推广
F函数的极大算法
F
F
F函数:隐变量
Z
Z
Z的概率分布为
P
~
(
Z
)
\tilde{P} \left( Z \right)
P~(Z),关于分布
P
~
\tilde{P}
P~与参数
θ
\theta
θ的函数
F
(
P
~
,
θ
)
=
E
P
~
[
log
P
(
Y
,
Z
∣
θ
)
]
+
H
(
P
~
)
\begin{aligned} \\ & F \left( \tilde{P}, \theta \right) = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P} \right) \end{aligned}
F(P~,θ)=EP~[logP(Y,Z∣θ)]+H(P~)
其中,
H
(
P
~
)
=
−
E
P
~
[
log
P
~
(
Z
)
]
H \left( \tilde{P} \right) = - E_{\tilde{P}} \left[ \log \tilde{P} \left( Z\right)\right]
H(P~)=−EP~[logP~(Z)]是分布
P
~
(
Z
)
\tilde{P} \left( Z \right)
P~(Z)的熵。
对于固定的
θ
\theta
θ,极大化
F
F
F函数
max
P
~
F
(
P
~
,
θ
)
s
.
t
.
∑
Z
P
~
θ
(
Z
)
=
1
\begin{aligned} \\ & \max_{\tilde{P}} F \left( \tilde{P}, \theta \right) \\ & s.t. \sum_{Z} \tilde{P}_{\theta} \left( Z \right) = 1 \end{aligned}
P~maxF(P~,θ)s.t.Z∑P~θ(Z)=1
引入拉格朗日乘子 λ \lambda λ,构造拉格朗日函数 L = E P ~ [ log P ( Y , Z ∣ θ ) ] − E P ~ [ log P ~ ( Z ) ] + λ ( 1 − ∑ Z P ~ ( Z ) ) = ∑ Z log P ( Y , Z ∣ θ ) P ~ ( Z ) − ∑ Z log P ( Z ) P ~ ( Z ) + λ − λ ∑ Z P ~ ( Z ) \begin{aligned} \\ & L = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] - E_{\tilde{P}} \left[ \log \tilde{P} \left( Z\right)\right] + \lambda \left( 1 - \sum_{Z} \tilde{P} \left( Z \right) \right) \\ & = \sum_{Z} \log P \left( Y, Z | \theta \right) \tilde{P} \left( Z \right) - \sum_{Z} \log P \left( Z \right) \tilde{P} \left( Z \right) + \lambda - \lambda \sum_{Z} \tilde{P} \left( Z \right) \end{aligned} L=EP~[logP(Y,Z∣θ)]−EP~[logP~(Z)]+λ(1−Z∑P~(Z))=Z∑logP(Y,Z∣θ)P~(Z)−Z∑logP(Z)P~(Z)+λ−λZ∑P~(Z)
将其对 P ~ ( Z ) \tilde{P} \left( Z \right) P~(Z)求偏导,得 ∂ L ∂ P ~ ( Z ) = log P ( Y , Z ∣ θ ) − 1 − log P ( Z ) − λ \begin{aligned} \\ & \dfrac {\partial L}{\partial \tilde{P} \left( Z \right) } = \log P \left( Y, Z | \theta \right) - 1 - \log P \left( Z \right) - \lambda \end{aligned} ∂P~(Z)∂L=logP(Y,Z∣θ)−1−logP(Z)−λ
令其等于0,得 λ = log P ( Y , Z ∣ θ ) − 1 − log P ( Z ) P ( Y , Z ∣ θ ) P ~ θ ( Z ) = e 1 + λ ∑ Z P ( Y , Z ∣ θ ) = e 1 + λ ∑ Z P ~ θ ( Z ) \begin{aligned} & \lambda = \log P \left( Y, Z | \theta \right) - 1 - \log P \left( Z \right) \\ & \dfrac{P \left( Y, Z | \theta \right) }{\tilde{P}_{\theta} \left( Z \right) } = e^{1 + \lambda } \\ & \sum_{Z} P \left( Y, Z | \theta \right) = e^{1 + \lambda } \sum_{Z} \tilde{P}_{\theta} \left( Z \right) \end{aligned} λ=logP(Y,Z∣θ)−1−logP(Z)P~θ(Z)P(Y,Z∣θ)=e1+λZ∑P(Y,Z∣θ)=e1+λZ∑P~θ(Z)
由于
∑
Z
P
~
θ
(
Z
)
=
1
\sum_{Z} \tilde{P}_{\theta} \left( Z \right) = 1
∑ZP~θ(Z)=1,得
P
(
Y
)
=
e
1
+
λ
\begin{aligned} & P \left( Y \right) = e^{1 + \lambda } \end{aligned}
P(Y)=e1+λ
代回,得
P
~
θ
(
Z
)
=
P
(
Z
∣
Y
,
θ
)
\begin{aligned} & \tilde{P}_{\theta} \left( Z \right) = P \left( Z | Y, \theta \right) \end{aligned}
P~θ(Z)=P(Z∣Y,θ)
则 F ( P ~ , θ ) = E P ~ [ log P ( Y , Z ∣ θ ) ] + H ( P ~ ) = ∑ Z log P ( Y , Z ∣ θ ) P ~ ( Z ) − ∑ Z log P ( Z ) P ~ ( Z ) = ∑ Z P ~ ( Z ) log P ( Y , Z ∣ θ ) P ~ ( Z ) = ∑ Z P ~ ( Z ) log P ( Z ∣ Y , θ ) P ( Y ∣ θ ) P ~ ( Z ) = log P ( Y ∣ θ ) ∑ Z P ~ ( Z ) = log P ( Y ∣ θ ) = L ( θ ) \begin{aligned} & F \left( \tilde{P}, \theta \right) = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P} \right) \\ & = \sum_{Z} \log P \left( Y, Z | \theta \right) \tilde{P} \left( Z \right) - \sum_{Z} \log P \left( Z \right) \tilde{P} \left( Z \right) \\ & = \sum_{Z} \tilde{P} \left( Z \right) \log \dfrac{P \left( Y, Z | \theta \right) }{\tilde{P} \left( Z \right) } \\ & = \sum_{Z} \tilde{P} \left( Z \right) \log \dfrac{P \left( Z | Y, \theta \right) P \left(Y | \theta \right) }{\tilde{P} \left( Z \right) } \\ & = \log P \left(Y | \theta \right) \sum_{Z} \tilde{P} \left( Z \right) \\ & = \log P \left(Y | \theta \right) \\ & = L \left( \theta \right) \end{aligned} F(P~,θ)=EP~[logP(Y,Z∣θ)]+H(P~)=Z∑logP(Y,Z∣θ)P~(Z)−Z∑logP(Z)P~(Z)=Z∑P~(Z)logP~(Z)P(Y,Z∣θ)=Z∑P~(Z)logP~(Z)P(Z∣Y,θ)P(Y∣θ)=logP(Y∣θ)Z∑P~(Z)=logP(Y∣θ)=L(θ)
对于使 F ( P ~ , θ ) F \left( \tilde{P}, \theta \right) F(P~,θ)达到最大值的参数 θ ∗ \theta^{*} θ∗,有 L ( θ ∗ ) = F ( P ~ θ ∗ , θ ∗ ) = F ( P ~ ∗ , θ ∗ ) \begin{aligned} L \left( \theta^{*} \right) = F \left( \tilde{P}_{\theta^{*}}, \theta^{*} \right) = F \left( \tilde{P}^{*}, \theta^{*} \right)\end{aligned} L(θ∗)=F(P~θ∗,θ∗)=F(P~∗,θ∗)
即,如果 F ( P ~ , θ ) F \left( \tilde{P}, \theta \right) F(P~,θ)在 P ~ ∗ , θ ∗ \tilde{P}^{*}, \theta^{*} P~∗,θ∗达到局部极大值(全局最大值),则 L ( θ ∗ ) L \left( \theta^{*} \right) L(θ∗)在 P ~ ∗ , θ ∗ \tilde{P}^{*}, \theta^{*} P~∗,θ∗也达到局部极大值(全局最大值)。
由 P ~ θ ( Z ) = P ( Z ∣ Y , θ ) \tilde{P}_{\theta} \left( Z \right) = P \left( Z | Y, \theta \right) P~θ(Z)=P(Z∣Y,θ),对固定的 θ ( i ) \theta^{\left( i \right) } θ(i), P ~ ( i + 1 ) ( Z ) = P ~ θ ( i ) ( Z ) = P ( Z ∣ Y , θ ( i ) ) \begin{aligned} \tilde{P}^{\left( i + 1 \right)} \left( Z \right) = \tilde{P}_{\theta^{\left( i \right)}} \left( Z \right) = P \left( Z | Y, \theta^{\left( i \right) } \right)\end{aligned} P~(i+1)(Z)=P~θ(i)(Z)=P(Z∣Y,θ(i))
使
F
(
P
~
,
θ
(
i
)
)
F \left( \tilde{P}, \theta^{\left( i \right)} \right)
F(P~,θ(i))极大化,
则
F
(
P
~
(
i
+
1
)
,
θ
)
=
E
P
~
(
i
+
1
)
[
log
P
(
Y
,
Z
∣
θ
)
]
+
H
(
P
~
(
i
+
1
)
)
=
∑
Z
l
o
g
P
(
Y
,
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
+
H
(
P
~
(
i
+
1
)
)
=
Q
(
θ
,
θ
(
i
)
)
+
H
(
P
~
(
i
+
1
)
)
\begin{aligned} & F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) = E_{\tilde{P}^{\left( i + 1 \right)}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P}^{\left( i + 1 \right)} \right) \\ & = \sum_{Z} log P \left(Y , Z | \theta \right) P \left( Z | Y, \theta^{\left( i \right)} \right) + H \left( \tilde{P}^{\left( i + 1 \right)} \right) \\ & =Q \left( \theta, \theta^{\left( i \right)} \right) + H \left( \tilde{P}^{\left( i + 1 \right)} \right)\end{aligned}
F(P~(i+1),θ)=EP~(i+1)[logP(Y,Z∣θ)]+H(P~(i+1))=Z∑logP(Y,Z∣θ)P(Z∣Y,θ(i))+H(P~(i+1))=Q(θ,θ(i))+H(P~(i+1))
固定
P
~
(
i
+
1
)
\tilde{P}^{\left( i + 1 \right)}
P~(i+1),求
θ
(
i
)
\theta^{\left( i \right)}
θ(i)使
F
(
P
~
(
i
+
1
)
,
θ
)
F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right)
F(P~(i+1),θ)极大化,得
θ
(
i
+
1
)
=
arg
max
θ
F
(
P
~
(
i
+
1
)
,
θ
)
=
arg
max
θ
Q
(
θ
,
θ
(
i
)
)
\begin{aligned} \theta^{\left( i + 1 \right)} = \arg \max_{\theta} F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) = \arg \max_{\theta} Q \left( \theta, \theta^{\left( i \right)} \right) \end{aligned}
θ(i+1)=argθmaxF(P~(i+1),θ)=argθmaxQ(θ,θ(i))
即,由 E M EM EM算法与 F F F函数的极大-极大算法的到的参数估计序列 θ ( i ) , i = 1 , 2 , ⋯ , \theta^{\left( i \right)},i = 1, 2, \cdots, θ(i),i=1,2,⋯,是一致的。
GEM算法
G E M GEM GEM算法:
- 输入:观测数据 Y Y Y, F F F函数;
- 输出:模型参数 θ \theta θ
- 初值 θ ( 0 ) \theta^{\left(0\right)} θ(0)
- 第 i + 1 i+1 i+1次迭代,第1步:记 θ ( i ) \theta^{\left( i \right)} θ(i)为参数 θ \theta θ的估计值, P ~ ( i ) \tilde{P}^{\left( i \right)} P~(i)为函数 P ~ \tilde{P} P~的估计。求 P ~ ( i + 1 ) \tilde{P}^{\left( i+1 \right)} P~(i+1)使 P ~ \tilde{P} P~极大化 F ( P ~ ( i + 1 ) , θ ) F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) F(P~(i+1),θ)
- 第2步:求 θ ( i ) \theta^{\left( i \right)} θ(i)使 F ( P ~ ( i + 1 ) , θ ) F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) F(P~(i+1),θ)极大化
- 重复(2)和(3),直到收敛。
3、EM算法在高斯混合模型学习中的应用
高斯混合模型
高斯混合模型 P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) \begin{aligned} & P \left( y | \theta \right) = \sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) \end{aligned} P(y∣θ)=k=1∑Kαkϕ(y∣θk)
其中, α k \alpha_{k} αk是系数, α k ≥ 0 \alpha_{k} \geq 0 αk≥0, ∑ k = 1 K α k = 1 \sum_{k=1}^{K} \alpha_{k} = 1 ∑k=1Kαk=1; ϕ ( y ∣ θ k ) \phi \left( y | \theta_{k} \right) ϕ(y∣θk)是高斯分布密度, θ k = ( μ k , σ k 2 ) \theta_{k} = \left( \mu_{k} , \sigma_{k}^{2} \right) θk=(μk,σk2), ϕ ( y ∣ θ k ) = 1 2 π σ k exp ( − ( y − μ k ) 2 2 σ k 2 ) \begin{aligned} & \phi \left( y | \theta_{k} \right) = \dfrac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left( - \dfrac{\left( y - \mu_{k} \right)^2}{2 \sigma_{k}^{2}} \right)\end{aligned} ϕ(y∣θk)=2πσk1exp(−2σk2(y−μk)2)
称为第 k k k个分模型。
高斯混合模型参数估计的EM算法
假设观测数据 ( y 1 , y 2 , ⋯ , y N ) \left( y_{1}, y_{2}, \cdots, y_{N} \right) (y1,y2,⋯,yN)由高斯混合模型 P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) \begin{aligned} & P \left( y | \theta \right) = \sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) \end{aligned} P(y∣θ)=k=1∑Kαkϕ(y∣θk)
生成,其中, θ = ( α 1 , α 2 , ⋯ , α K ; θ 1 , θ 2 , ⋯ , θ K ) \theta = \left( \alpha_{1}, \alpha_{2}, \cdots, \alpha_{K}; \theta_{1}, \theta_{2}, \cdots, \theta_{K}\right) θ=(α1,α2,⋯,αK;θ1,θ2,⋯,θK)是模型参数。
-
隐变量 γ j k \gamma_{jk} γjk是0-1变量,表示观测数据 y j y_{j} yj来自第 k k k个分模型 γ j k = { 1 , 第 j 个 观 测 数 据 来 自 第 k 个 分 模 型 0 , 否 则 ( j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K ) \begin{aligned} \\& \gamma_{jk} = \begin{cases} 1,第j个观测数据来自第k个分模型\\ 0,否则\end{cases} \quad \quad \quad \quad \quad \left( j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \right)\end{aligned} γjk={1,第j个观测数据来自第k个分模型0,否则(j=1,2,⋯,N;k=1,2,⋯,K)
完全数据 ( y j , γ j 1 , γ j 2 , ⋯ , γ j k ) j = 1 , 2 , ⋯ , N \begin{aligned} \\& \left( y_{j}, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jk}\right) \quad j = 1,2, \cdots, N\end{aligned} (yj,γj1,γj2,⋯,γjk)j=1,2,⋯,N
完全数据似然函数 P ( y , γ ∣ θ ) = ∏ j = 1 N P ( y j , γ j 1 , γ j 2 , ⋯ , γ j K ∣ θ ) = ∏ k = 1 K ∏ j = 1 N [ α k ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ( − ( y − μ k ) 2 2 σ k 2 ) ] γ j k \begin{aligned} \\& P \left( y, \gamma | \theta \right) = \prod_{j=1}^{N} P \left( y_{j}, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jK} | \theta \right) \\ & = \prod_{k=1}^{K} \prod_{j=1}^{N} \left[ \alpha_{k} \phi \left( y_{j} | \theta_{k} \right)\right]^{\gamma_{jk}} \\ & = \prod_{k=1}^{K} \alpha_{k}^{n_{k}}\prod_{j=1}^{N} \left[ \phi \left( y_{j} | \theta_{k} \right)\right]^{\gamma_{jk}} \\& = \prod_{k=1}^{K} \alpha_{k}^{n_{k}}\prod_{j=1}^{N} \left[ \dfrac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left( - \dfrac{\left( y - \mu_{k} \right)^2}{2 \sigma_{k}^{2}} \right) \right]^{\gamma_{jk}} \end{aligned} P(y,γ∣θ)=j=1∏NP(yj,γj1,γj2,⋯,γjK∣θ)=k=1∏Kj=1∏N[αkϕ(yj∣θk)]γjk=k=1∏Kαknkj=1∏N[ϕ(yj∣θk)]γjk=k=1∏Kαknkj=1∏N[2πσk1exp(−2σk2(y−μk)2)]γjk
式中, n k = ∑ j = 1 N γ j k n_{k} = \sum_{j=1}^{N} \gamma_{jk} nk=∑j=1Nγjk。完全数据的对数似然函数 log P ( y , γ ∣ θ ) = ∑ k = 1 K { ∑ j = 1 K γ j k log α k + ∑ j = 1 K γ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } \begin{aligned} & \log P \left( y, \gamma | \theta \right) = \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \gamma_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \gamma_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \end{aligned} logP(y,γ∣θ)=k=1∑K{j=1∑Kγjklogαk+j=1∑Kγjk[log(2π1)−logσk−2σk21(yj−μk)2]}
-
Q ( θ , θ ( i ) ) Q\left( \theta, \theta^{\left( i \right)} \right) Q(θ,θ(i))函数 Q ( θ , θ ( i ) ) = E [ log P ( y , γ ∣ θ ) ∣ y , θ ( i ) ] = E { ∑ k = 1 K { ∑ j = 1 K γ j k log α k + ∑ j = 1 K γ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } } = ∑ k = 1 K { ∑ j = 1 K E ( γ j k ) log α k + ∑ j = 1 K E ( γ j k ) [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } = ∑ k = 1 K { ∑ j = 1 K γ ^ j k log α k + ∑ j = 1 K γ ^ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } \begin{aligned} & Q \left( \theta , \theta^{\left( i \right)} \right) = E \left[ \log P \left( y, \gamma | \theta \right) | y, \theta^{ \left( i \right) }\right] \\ & = E \left\{ \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \gamma_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \gamma_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\}\right\} \\ & = \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} E \left( \gamma_{jk} \right) \log \alpha_{k} + \sum_{j=1}^{K} E \left( \gamma_{jk} \right)\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \\ & =\sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \hat{\gamma}_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \hat{\gamma}_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \end{aligned} Q(θ,θ(i))=E[logP(y,γ∣θ)∣y,θ(i)]=E{k=1∑K{j=1∑Kγjklogαk+j=1∑Kγjk[log(2π1)−logσk−2σk21(yj−μk)2]}}=k=1∑K{j=1∑KE(γjk)logαk+j=1∑KE(γjk)[log(2π1)−logσk−2σk21(yj−μk)2]}=k=1∑K{j=1∑Kγ^jklogαk+j=1∑Kγ^jk[log(2π1)−logσk−2σk21(yj−μk)2]}
其中,分模型 k k k对观测数据 y j y_{j} yj的响应度 γ ^ j k \hat{\gamma}_{jk} γ^jk是在当前模型参数下第 j j j个观测数据来自第 k k k个分模型的概率。 γ ^ j k = E ( γ j k ∣ y , θ ) = P ( γ j k = 1 ∣ y , θ ) = P ( γ j k = 1 , y j ∣ θ ) ∑ k = 1 K P ( γ j k = 1 , y j ∣ θ ) = α k ϕ ( y ∣ θ k ) ∑ k = 1 K α k ϕ ( y ∣ θ k ) ( j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K ) \begin{aligned} & \hat{\gamma}_{jk} = E \left( \gamma_{jk} | y, \theta \right) = P \left( \gamma_{jk} = 1 | y, \theta \right) \\ & = \dfrac{P \left( \gamma_{jk} = 1, y_{j} | \theta \right)}{ \sum_{k=1}^{K} P \left( \gamma_{jk} = 1, y_{j} | \theta \right)} \\ & = \dfrac{\alpha_{k} \phi \left( y | \theta_{k} \right) }{\sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) } \quad \quad \quad \left( j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \right) \end{aligned} γ^jk=E(γjk∣y,θ)=P(γjk=1∣y,θ)=∑k=1KP(γjk=1,yj∣θ)P(γjk=1,yj∣θ)=∑k=1Kαkϕ(y∣θk)αkϕ(y∣θk)(j=1,2,⋯,N;k=1,2,⋯,K)
-
求 Q ( θ , θ ( i ) ) Q\left( \theta, \theta^{\left( i \right)} \right) Q(θ,θ(i))函数对 θ \theta θ的极大值 θ ( i + 1 ) = arg max Q ( θ , θ ( i ) ) \begin{aligned} \theta^{\left( i+1 \right)} = \arg \max Q\left(\theta, \theta^{\left( i \right)} \right) \end{aligned} θ(i+1)=argmaxQ(θ,θ(i))
得 μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K α ^ k = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯ , K \begin{aligned} & \hat{\mu}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} y_{j}}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\sigma}_{k}^2 = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} \left( y_{j} - \mu_{k}\right)^2}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\alpha}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} }{N}, \quad k = 1, 2, \cdots, K\end{aligned} μ^k=∑j=1Nγ^jk∑j=1Nγ^jkyj,k=1,2,⋯,Kσ^k2=∑j=1Nγ^jk∑j=1Nγ^jk(yj−μk)2,k=1,2,⋯,Kα^k=N∑j=1Nγ^jk,k=1,2,⋯,K
高斯混合模型参数估计得EM算法:
- 输入:观测数据 y 1 , y 2 , ⋯ , y N y_{1}, y_{2}, \cdots, y_{N} y1,y2,⋯,yN,高斯混合模型;
- 输出:高斯混合模型参数
- 取参数的初始值开始迭代
- E E E步:计算分模型 k k k对观测数据 y i y_{i} yi的响应度 γ ^ j k = α k ϕ ( y ∣ θ k ) ∑ k = 1 K α k ϕ ( y ∣ θ k ) j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K \begin{aligned} & \hat{\gamma}_{jk} = \dfrac{\alpha_{k} \phi \left( y | \theta_{k} \right) }{\sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) } \quad \quad \quad j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \end{aligned} γ^jk=∑k=1Kαkϕ(y∣θk)αkϕ(y∣θk)j=1,2,⋯,N;k=1,2,⋯,K
- M M M步:计算新迭代的模型参数 μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K α ^ k = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯ , K \begin{aligned} & \hat{\mu}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} y_{j}}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\sigma}_{k}^2 = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} \left( y_{j} - \mu_{k}\right)^2}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\alpha}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} }{N}, \quad k = 1, 2, \cdots, K\end{aligned} μ^k=∑j=1Nγ^jk∑j=1Nγ^jkyj,k=1,2,⋯,Kσ^k2=∑j=1Nγ^jk∑j=1Nγ^jk(yj−μk)2,k=1,2,⋯,Kα^k=N∑j=1Nγ^jk,k=1,2,⋯,K
- 重复2.步和3.步,直到收敛。
4、概要总结
1.EM算法是含有隐变量的概率模型极大似然估计或极大后验概率估计的迭代算法。含有隐变量的概率模型的数据表示为 θ \theta θ )。这里, Y Y Y是观测变量的数据, Z Z Z是隐变量的数据, θ \theta θ 是模型参数。EM算法通过迭代求解观测数据的对数似然函数 L ( θ ) = log P ( Y ∣ θ ) {L}(\theta)=\log {P}(\mathrm{Y} | \theta) L(θ)=logP(Y∣θ)的极大化,实现极大似然估计。每次迭代包括两步:
E E E步,求期望,即求 l o g P ( Z ∣ Y , θ ) logP\left(Z | Y, \theta\right) logP(Z∣Y,θ) )关于$ P\left(Z | Y, \theta^{(i)}\right)$)的期望:
Q ( θ , θ ( i ) ) = ∑ Z log P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) Q\left(\theta, \theta^{(i)}\right)=\sum_{Z} \log P(Y, Z | \theta) P\left(Z | Y, \theta^{(i)}\right) Q(θ,θ(i))=Z∑logP(Y,Z∣θ)P(Z∣Y,θ(i))
称为 Q Q Q函数,这里 θ ( i ) \theta^{(i)} θ(i)是参数的现估计值;
M M M步,求极大,即极大化 Q Q Q函数得到参数的新估计值:
θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg \max _{\theta} Q\left(\theta, \theta^{(i)}\right) θ(i+1)=argθmaxQ(θ,θ(i))
在构建具体的EM算法时,重要的是定义 Q Q Q函数。每次迭代中,EM算法通过极大化 Q Q Q函数来增大对数似然函数 L ( θ ) {L}(\theta) L(θ)。
2.EM算法在每次迭代后均提高观测数据的似然函数值,即
P ( Y ∣ θ ( i + 1 ) ) ⩾ P ( Y ∣ θ ( i ) ) P\left(Y | \theta^{(i+1)}\right) \geqslant P\left(Y | \theta^{(i)}\right) P(Y∣θ(i+1))⩾P(Y∣θ(i))
在一般条件下EM算法是收敛的,但不能保证收敛到全局最优。
3.EM算法应用极其广泛,主要应用于含有隐变量的概率模型的学习。高斯混合模型的参数估计是EM算法的一个重要应用,下一章将要介绍的隐马尔可夫模型的非监督学习也是EM算法的一个重要应用。
4.EM算法还可以解释为 F F F函数的极大-极大算法。EM算法有许多变形,如GEM算法。GEM算法的特点是每次迭代增加 F F F函数值(并不一定是极大化 F F F函数),从而增加似然函数值。