【统计学习方法】第9章 EM算法及其推广

EM算法是一种迭代算法,1977年由Dempster等人总结提出,用于含有隐变量(hidden variable)的概率模型参数的极大似然估计,或极大后验概率估计。EM算法的每次迭代由两步组成:E步,求期望(expectation);M步,求极大(maximization)。所以这一算法称为期望极大算法(expectation maximization algorithm),简称EM算法。

1、EM算法的引入

EM算法

不完全数据:观测随机变量 Y Y Y
完全数据:观测随机变量 Y Y Y和隐随机变量 Z Z Z

Q Q Q函数:完全数据的对数似然函数 log ⁡ P ( Y , Z ∣ θ ) \log P \left( Y , Z | \theta \right) logP(Y,Zθ)关于在给定观测数据 Y Y Y和当前参数 θ ( i ) \theta^{\left( i \right)} θ(i)下对未观测数据 Z Z Z的条件概率分布 P ( Z ∣ Y , θ ( i ) ) P \left( Z | Y, \theta^{\left( i \right)} \right) P(ZY,θ(i))的期望
Q ( θ , θ ( i ) ) = E Z [ log ⁡ P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] \begin{aligned} & Q \left( \theta, \theta^{\left( i \right)} \right) = E_{Z} \left[ \log P \left( Y, Z | \theta \right) | Y , \theta^{\left( i \right)} \right] \end{aligned} Q(θ,θ(i))=EZ[logP(Y,Zθ)Y,θ(i)]

EM算法的导出

含有隐变量 Z Z Z的概率模型,目标是极大化观测变量 Y Y Y关于参数 θ \theta θ的对数似然函数,即 max ⁡ L ( θ ) = log ⁡ P ( Y ∣ θ ) = log ⁡ ∑ Z P ( Y , Z ∣ θ ) = log ⁡ ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) \begin{aligned} & \max L \left( \theta \right) = \log P \left( Y | \theta \right) \\ & = \log \sum_{Z} P \left( Y,Z | \theta \right) \\ & = \log \left( \sum_{Z} P \left( Y|Z,\theta \right) P \left( Z| \theta \right) \right)\end{aligned} maxL(θ)=logP(Yθ)=logZP(Y,Zθ)=log(ZP(YZ,θ)P(Zθ))

对数似然函数 L ( θ ) L \left( \theta \right) L(θ)与第 i i i次迭代后的对数似然函数估计值 L ( θ ( i ) ) L \left( \theta^{\left( i \right)} \right) L(θ(i))的差 L ( θ ) − L ( θ ( i ) ) = log ⁡ ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − log ⁡ P ( Y ∣ θ ( i ) ) = log ⁡ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log ⁡ P ( Y ∣ θ ( i ) ) ≥ ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − log ⁡ P ( Y ∣ θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) \begin{aligned} & L \left( \theta \right) - L \left( \theta^{\left( i \right)} \right) = \log \left( \sum_{Z} P \left( Y|Z,\theta \right) P \left( Z| \theta \right) \right) - \log P \left( Y| \theta^{ \left( i \right)} \right) \\ & = \log \left( \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \dfrac { P \left( Y|Z,\theta \right) P \left( Z| \theta \right)} {P \left( Z | Y , \theta^{\left( i \right)} \right)} \right) - \log P \left( Y| \theta^{ \left( i \right)} \right)\\ &\geq \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)}{P \left( Z | Y , \theta^{\left( i \right)} \right)} - \log P \left( Y| \theta^{ \left( i \right)} \right) \\ & = \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)}\end{aligned} L(θ)L(θ(i))log(ZP(YZ,θ)P(Zθ))logP(Yθ(i))=log(ZP(ZY,θ(i))P(ZY,θ(i))P(YZ,θ)P(Zθ))logP(Yθ(i))ZP(ZY,θ(i))logP(ZY,θ(i))P(YZ,θ)P(Zθ)logP(Yθ(i))=ZP(ZY,θ(i))logP(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ)

B ( θ , θ ( i ) ) = L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) \begin{aligned} & B \left( \theta , \theta^{\left ( i \right)} \right) = L \left( \theta^{\left ( i \right)} \right) + \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)} \end{aligned} B(θ,θ(i))=L(θ(i))+ZP(ZY,θ(i))logP(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ)

L ( θ ) ≥ B ( θ , θ ( i ) ) \begin{aligned} & L \left( \theta \right) \geq B \left( \theta, \theta^{\left( i \right)} \right) \end{aligned} L(θ)B(θ,θ(i))

即函 B ( θ , θ ( i ) ) B \left( \theta, \theta^{\left( i \right)} \right) B(θ,θ(i)) L ( θ ) L \left( \theta \right) L(θ) 的一个下界。选择 θ ( i ) \theta^{\left( i \right)} θ(i)使 B ( θ , θ ( i ) ) B \left( \theta, \theta^{\left( i \right)} \right) B(θ,θ(i))达到极大,即 θ ( i + 1 ) = arg ⁡ max ⁡ B ( θ , θ ( i ) ) = arg ⁡ max ⁡ ( L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg ⁡ max ⁡ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ ( P ( Y ∣ Z , θ ) ) P ( Z ∣ θ ) ) = arg ⁡ max ⁡ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log ⁡ P ( Y , Z ∣ θ ) ) = arg ⁡ max ⁡ Q ( θ , θ ( i ) ) \begin{aligned} & \theta^{\left( i+1 \right)}= \arg \max B \left( \theta, \theta^{\left( i \right)} \right) \\ & = \arg \max \left( L \left( \theta^{\left ( i \right)} \right) + \sum_{Z} P \left( Z | Y , \theta^{\left( i \right)} \right) \log \dfrac {P \left( Y | Z, \theta \right) P \left(Z|\theta\right)} {P \left( Z | Y , \theta^{\left( i \right)} \right) P \left(Y|\theta^{\left( i \right)} \right)} \right) \\ & = \arg \max \left( \sum_{Z} P \left( Z | Y, \theta^{\left( i \right)} \right) \log \left( P \left( Y | Z, \theta \right) \right) P \left( Z | \theta \right) \right) \\ & = \arg \max \left( \sum_{Z} P \left( Z | Y, \theta^{\left( i \right)} \right) \log P \left( Y, Z | \theta\right) \right) \\ & = \arg \max Q \left( \theta, \theta^{\left( i \right)} \right) \end{aligned} θ(i+1)argmaxB(θ,θ(i))=argmax(L(θ(i))+ZP(ZY,θ(i))logP(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ))=argmax(ZP(ZY,θ(i))log(P(YZ,θ))P(Zθ))=argmax(ZP(ZY,θ(i))logP(Y,Zθ))=argmaxQ(θ,θ(i))

EM算法

  • 输入:观测随机变量数据 Y Y Y,隐随机变量数据 Z Z Z,联合分布 P ( Y , Z ∣ θ ) P\left(Y,Z|\theta \right) P(Y,Zθ),条件分布 P ( Y | Z , θ ) P\left(Y|Z,\theta\right) P(YZθ)
  • 输出:模型参数 θ \theta θ
  1. 初值 θ ( 0 ) \theta^{\left(0\right)} θ(0)

  2. E E E步: Q ( θ , θ ( i ) ) = E Z [ log ⁡ P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log ⁡ P ( Y , Z ∣ θ ) ⋅ P ( Z ∣ Y , θ ( i ) ) Q\left(\theta,\theta^{\left(i\right)}\right)=E_{Z}\left[\log P\left(Y,Z|\theta\right)|Y,\theta^{\left(i\right)}\right] \\ = \sum_{Z} \log P\left(Y,Z|\theta \right) \cdot P\left(Z|Y, \theta^{\left(i\right)}\right) Q(θ,θ(i))=EZ[logP(Y,Zθ)Y,θ(i)]=ZlogP(Y,Zθ)P(ZY,θ(i))

  3. M M M步: θ ( i + 1 ) = arg ⁡ max ⁡ Q ( θ , θ ( i ) ) \begin{aligned} & \theta^{\left( i+1 \right)} = \arg \max Q\left(\theta, \theta^{\left( i \right)} \right)\end{aligned} θ(i+1)=argmaxQ(θ,θ(i))

  4. 重复2. 3.,直到收敛。

2、EM算法的推广

F函数的极大算法

F F F函数:隐变量 Z Z Z的概率分布为 P ~ ( Z ) \tilde{P} \left( Z \right) P~(Z),关于分布 P ~ \tilde{P} P~与参数 θ \theta θ的函数 F ( P ~ , θ ) = E P ~ [ log ⁡ P ( Y , Z ∣ θ ) ] + H ( P ~ ) \begin{aligned} \\ & F \left( \tilde{P}, \theta \right) = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P} \right) \end{aligned} F(P~,θ)=EP~[logP(Y,Zθ)]+H(P~)
其中, H ( P ~ ) = − E P ~ [ log ⁡ P ~ ( Z ) ] H \left( \tilde{P} \right) = - E_{\tilde{P}} \left[ \log \tilde{P} \left( Z\right)\right] H(P~)=EP~[logP~(Z)]是分布 P ~ ( Z ) \tilde{P} \left( Z \right) P~(Z)的熵。

对于固定的 θ \theta θ,极大化 F F F函数
max ⁡ P ~ F ( P ~ , θ ) s . t . ∑ Z P ~ θ ( Z ) = 1 \begin{aligned} \\ & \max_{\tilde{P}} F \left( \tilde{P}, \theta \right) \\ & s.t. \sum_{Z} \tilde{P}_{\theta} \left( Z \right) = 1 \end{aligned} P~maxF(P~,θ)s.t.ZP~θ(Z)=1

引入拉格朗日乘子 λ \lambda λ,构造拉格朗日函数 L = E P ~ [ log ⁡ P ( Y , Z ∣ θ ) ] − E P ~ [ log ⁡ P ~ ( Z ) ] + λ ( 1 − ∑ Z P ~ ( Z ) ) = ∑ Z log ⁡ P ( Y , Z ∣ θ ) P ~ ( Z ) − ∑ Z log ⁡ P ( Z ) P ~ ( Z ) + λ − λ ∑ Z P ~ ( Z ) \begin{aligned} \\ & L = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] - E_{\tilde{P}} \left[ \log \tilde{P} \left( Z\right)\right] + \lambda \left( 1 - \sum_{Z} \tilde{P} \left( Z \right) \right) \\ & = \sum_{Z} \log P \left( Y, Z | \theta \right) \tilde{P} \left( Z \right) - \sum_{Z} \log P \left( Z \right) \tilde{P} \left( Z \right) + \lambda - \lambda \sum_{Z} \tilde{P} \left( Z \right) \end{aligned} L=EP~[logP(Y,Zθ)]EP~[logP~(Z)]+λ(1ZP~(Z))=ZlogP(Y,Zθ)P~(Z)ZlogP(Z)P~(Z)+λλZP~(Z)

将其对 P ~ ( Z ) \tilde{P} \left( Z \right) P~(Z)求偏导,得 ∂ L ∂ P ~ ( Z ) = log ⁡ P ( Y , Z ∣ θ ) − 1 − log ⁡ P ( Z ) − λ \begin{aligned} \\ & \dfrac {\partial L}{\partial \tilde{P} \left( Z \right) } = \log P \left( Y, Z | \theta \right) - 1 - \log P \left( Z \right) - \lambda \end{aligned} P~(Z)L=logP(Y,Zθ)1logP(Z)λ

令其等于0,得 λ = log ⁡ P ( Y , Z ∣ θ ) − 1 − log ⁡ P ( Z ) P ( Y , Z ∣ θ ) P ~ θ ( Z ) = e 1 + λ ∑ Z P ( Y , Z ∣ θ ) = e 1 + λ ∑ Z P ~ θ ( Z ) \begin{aligned} & \lambda = \log P \left( Y, Z | \theta \right) - 1 - \log P \left( Z \right) \\ & \dfrac{P \left( Y, Z | \theta \right) }{\tilde{P}_{\theta} \left( Z \right) } = e^{1 + \lambda } \\ & \sum_{Z} P \left( Y, Z | \theta \right) = e^{1 + \lambda } \sum_{Z} \tilde{P}_{\theta} \left( Z \right) \end{aligned} λlogP(Y,Zθ)1logP(Z)P~θ(Z)P(Y,Zθ)=e1+λZP(Y,Zθ)e1+λZP~θ(Z)

由于 ∑ Z P ~ θ ( Z ) = 1 \sum_{Z} \tilde{P}_{\theta} \left( Z \right) = 1 ZP~θ(Z)=1,得
P ( Y ) = e 1 + λ \begin{aligned} & P \left( Y \right) = e^{1 + \lambda } \end{aligned} P(Y)=e1+λ

代回,得
P ~ θ ( Z ) = P ( Z ∣ Y , θ ) \begin{aligned} & \tilde{P}_{\theta} \left( Z \right) = P \left( Z | Y, \theta \right) \end{aligned} P~θ(Z)=P(ZY,θ)

F ( P ~ , θ ) = E P ~ [ log ⁡ P ( Y , Z ∣ θ ) ] + H ( P ~ ) = ∑ Z log ⁡ P ( Y , Z ∣ θ ) P ~ ( Z ) − ∑ Z log ⁡ P ( Z ) P ~ ( Z ) = ∑ Z P ~ ( Z ) log ⁡ P ( Y , Z ∣ θ ) P ~ ( Z ) = ∑ Z P ~ ( Z ) log ⁡ P ( Z ∣ Y , θ ) P ( Y ∣ θ ) P ~ ( Z ) = log ⁡ P ( Y ∣ θ ) ∑ Z P ~ ( Z ) = log ⁡ P ( Y ∣ θ ) = L ( θ ) \begin{aligned} & F \left( \tilde{P}, \theta \right) = E_{\tilde{P}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P} \right) \\ & = \sum_{Z} \log P \left( Y, Z | \theta \right) \tilde{P} \left( Z \right) - \sum_{Z} \log P \left( Z \right) \tilde{P} \left( Z \right) \\ & = \sum_{Z} \tilde{P} \left( Z \right) \log \dfrac{P \left( Y, Z | \theta \right) }{\tilde{P} \left( Z \right) } \\ & = \sum_{Z} \tilde{P} \left( Z \right) \log \dfrac{P \left( Z | Y, \theta \right) P \left(Y | \theta \right) }{\tilde{P} \left( Z \right) } \\ & = \log P \left(Y | \theta \right) \sum_{Z} \tilde{P} \left( Z \right) \\ & = \log P \left(Y | \theta \right) \\ & = L \left( \theta \right) \end{aligned} F(P~,θ)=EP~[logP(Y,Zθ)]+H(P~)=ZlogP(Y,Zθ)P~(Z)ZlogP(Z)P~(Z)=ZP~(Z)logP~(Z)P(Y,Zθ)=ZP~(Z)logP~(Z)P(ZY,θ)P(Yθ)=logP(Yθ)ZP~(Z)=logP(Yθ)=L(θ)

对于使 F ( P ~ , θ ) F \left( \tilde{P}, \theta \right) F(P~,θ)达到最大值的参数 θ ∗ \theta^{*} θ,有 L ( θ ∗ ) = F ( P ~ θ ∗ , θ ∗ ) = F ( P ~ ∗ , θ ∗ ) \begin{aligned} L \left( \theta^{*} \right) = F \left( \tilde{P}_{\theta^{*}}, \theta^{*} \right) = F \left( \tilde{P}^{*}, \theta^{*} \right)\end{aligned} L(θ)=F(P~θ,θ)=F(P~,θ)

即,如果 F ( P ~ , θ ) F \left( \tilde{P}, \theta \right) F(P~,θ) P ~ ∗ , θ ∗ \tilde{P}^{*}, \theta^{*} P~,θ达到局部极大值(全局最大值),则 L ( θ ∗ ) L \left( \theta^{*} \right) L(θ) P ~ ∗ , θ ∗ \tilde{P}^{*}, \theta^{*} P~,θ也达到局部极大值(全局最大值)。

P ~ θ ( Z ) = P ( Z ∣ Y , θ ) \tilde{P}_{\theta} \left( Z \right) = P \left( Z | Y, \theta \right) P~θ(Z)=P(ZY,θ),对固定的 θ ( i ) \theta^{\left( i \right) } θ(i) P ~ ( i + 1 ) ( Z ) = P ~ θ ( i ) ( Z ) = P ( Z ∣ Y , θ ( i ) ) \begin{aligned} \tilde{P}^{\left( i + 1 \right)} \left( Z \right) = \tilde{P}_{\theta^{\left( i \right)}} \left( Z \right) = P \left( Z | Y, \theta^{\left( i \right) } \right)\end{aligned} P~(i+1)(Z)=P~θ(i)(Z)=P(ZY,θ(i))

使 F ( P ~ , θ ( i ) ) F \left( \tilde{P}, \theta^{\left( i \right)} \right) F(P~,θ(i))极大化,
F ( P ~ ( i + 1 ) , θ ) = E P ~ ( i + 1 ) [ log ⁡ P ( Y , Z ∣ θ ) ] + H ( P ~ ( i + 1 ) ) = ∑ Z l o g P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) + H ( P ~ ( i + 1 ) ) = Q ( θ , θ ( i ) ) + H ( P ~ ( i + 1 ) ) \begin{aligned} & F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) = E_{\tilde{P}^{\left( i + 1 \right)}} \left[ \log P \left( Y, Z | \theta \right)\right] + H \left( \tilde{P}^{\left( i + 1 \right)} \right) \\ & = \sum_{Z} log P \left(Y , Z | \theta \right) P \left( Z | Y, \theta^{\left( i \right)} \right) + H \left( \tilde{P}^{\left( i + 1 \right)} \right) \\ & =Q \left( \theta, \theta^{\left( i \right)} \right) + H \left( \tilde{P}^{\left( i + 1 \right)} \right)\end{aligned} F(P~(i+1),θ)EP~(i+1)[logP(Y,Zθ)]+H(P~(i+1))=ZlogP(Y,Zθ)P(ZY,θ(i))+H(P~(i+1))=Q(θ,θ(i))+H(P~(i+1))

固定 P ~ ( i + 1 ) \tilde{P}^{\left( i + 1 \right)} P~(i+1),求 θ ( i ) \theta^{\left( i \right)} θ(i)使 F ( P ~ ( i + 1 ) , θ ) F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) F(P~(i+1),θ)极大化,得
θ ( i + 1 ) = arg ⁡ max ⁡ θ F ( P ~ ( i + 1 ) , θ ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \begin{aligned} \theta^{\left( i + 1 \right)} = \arg \max_{\theta} F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) = \arg \max_{\theta} Q \left( \theta, \theta^{\left( i \right)} \right) \end{aligned} θ(i+1)=argθmaxF(P~(i+1),θ)=argθmaxQ(θ,θ(i))

即,由 E M EM EM算法与 F F F函数的极大-极大算法的到的参数估计序列 θ ( i ) , i = 1 , 2 , ⋯   , \theta^{\left( i \right)},i = 1, 2, \cdots, θ(i),i=1,2,,是一致的。

GEM算法

G E M GEM GEM算法

  • 输入:观测数据 Y Y Y F F F函数;
  • 输出:模型参数 θ \theta θ
  1. 初值 θ ( 0 ) \theta^{\left(0\right)} θ(0)
  2. i + 1 i+1 i+1次迭代,第1步:记 θ ( i ) \theta^{\left( i \right)} θ(i)为参数 θ \theta θ的估计值, P ~ ( i ) \tilde{P}^{\left( i \right)} P~(i)为函数 P ~ \tilde{P} P~的估计。求 P ~ ( i + 1 ) \tilde{P}^{\left( i+1 \right)} P~(i+1)使 P ~ \tilde{P} P~极大化 F ( P ~ ( i + 1 ) , θ ) F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) F(P~(i+1),θ)
  3. 第2步:求 θ ( i ) \theta^{\left( i \right)} θ(i)使 F ( P ~ ( i + 1 ) , θ ) F \left( \tilde{P}^{\left( i + 1 \right)}, \theta \right) F(P~(i+1),θ)极大化
  4. 重复(2)和(3),直到收敛。

3、EM算法在高斯混合模型学习中的应用

高斯混合模型

高斯混合模型 P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) \begin{aligned} & P \left( y | \theta \right) = \sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) \end{aligned} P(yθ)=k=1Kαkϕ(yθk)

其中, α k \alpha_{k} αk是系数, α k ≥ 0 \alpha_{k} \geq 0 αk0 ∑ k = 1 K α k = 1 \sum_{k=1}^{K} \alpha_{k} = 1 k=1Kαk=1; ϕ ( y ∣ θ k ) \phi \left( y | \theta_{k} \right) ϕ(yθk)是高斯分布密度, θ k = ( μ k , σ k 2 ) \theta_{k} = \left( \mu_{k} , \sigma_{k}^{2} \right) θk=(μk,σk2), ϕ ( y ∣ θ k ) = 1 2 π σ k exp ⁡ ( − ( y − μ k ) 2 2 σ k 2 ) \begin{aligned} & \phi \left( y | \theta_{k} \right) = \dfrac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left( - \dfrac{\left( y - \mu_{k} \right)^2}{2 \sigma_{k}^{2}} \right)\end{aligned} ϕ(yθk)=2π σk1exp(2σk2(yμk)2)

称为第 k k k个分模型。

高斯混合模型参数估计的EM算法

假设观测数据 ( y 1 , y 2 , ⋯   , y N ) \left( y_{1}, y_{2}, \cdots, y_{N} \right) (y1,y2,,yN)由高斯混合模型 P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) \begin{aligned} & P \left( y | \theta \right) = \sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) \end{aligned} P(yθ)=k=1Kαkϕ(yθk)

生成,其中, θ = ( α 1 , α 2 , ⋯   , α K ; θ 1 , θ 2 , ⋯   , θ K ) \theta = \left( \alpha_{1}, \alpha_{2}, \cdots, \alpha_{K}; \theta_{1}, \theta_{2}, \cdots, \theta_{K}\right) θ=(α1,α2,,αK;θ1,θ2,,θK)是模型参数。

  1. 隐变量 γ j k \gamma_{jk} γjk是0-1变量,表示观测数据 y j y_{j} yj来自第 k k k个分模型 γ j k = { 1 , 第 j 个 观 测 数 据 来 自 第 k 个 分 模 型 0 , 否 则 ( j = 1 , 2 , ⋯   , N ; k = 1 , 2 , ⋯   , K ) \begin{aligned} \\& \gamma_{jk} = \begin{cases} 1,第j个观测数据来自第k个分模型\\ 0,否则\end{cases} \quad \quad \quad \quad \quad \left( j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \right)\end{aligned} γjk={1,jk0,(j=1,2,,N;k=1,2,,K)

    完全数据 ( y j , γ j 1 , γ j 2 , ⋯   , γ j k ) j = 1 , 2 , ⋯   , N \begin{aligned} \\& \left( y_{j}, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jk}\right) \quad j = 1,2, \cdots, N\end{aligned} (yj,γj1,γj2,,γjk)j=1,2,,N

    完全数据似然函数 P ( y , γ ∣ θ ) = ∏ j = 1 N P ( y j , γ j 1 , γ j 2 , ⋯   , γ j K ∣ θ ) = ∏ k = 1 K ∏ j = 1 N [ α k ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ⁡ ( − ( y − μ k ) 2 2 σ k 2 ) ] γ j k \begin{aligned} \\& P \left( y, \gamma | \theta \right) = \prod_{j=1}^{N} P \left( y_{j}, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jK} | \theta \right) \\ & = \prod_{k=1}^{K} \prod_{j=1}^{N} \left[ \alpha_{k} \phi \left( y_{j} | \theta_{k} \right)\right]^{\gamma_{jk}} \\ & = \prod_{k=1}^{K} \alpha_{k}^{n_{k}}\prod_{j=1}^{N} \left[ \phi \left( y_{j} | \theta_{k} \right)\right]^{\gamma_{jk}} \\& = \prod_{k=1}^{K} \alpha_{k}^{n_{k}}\prod_{j=1}^{N} \left[ \dfrac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left( - \dfrac{\left( y - \mu_{k} \right)^2}{2 \sigma_{k}^{2}} \right) \right]^{\gamma_{jk}} \end{aligned} P(y,γθ)=j=1NP(yj,γj1,γj2,,γjKθ)=k=1Kj=1N[αkϕ(yjθk)]γjk=k=1Kαknkj=1N[ϕ(yjθk)]γjk=k=1Kαknkj=1N[2π σk1exp(2σk2(yμk)2)]γjk
    式中, n k = ∑ j = 1 N γ j k n_{k} = \sum_{j=1}^{N} \gamma_{jk} nk=j=1Nγjk

    完全数据的对数似然函数 log ⁡ P ( y , γ ∣ θ ) = ∑ k = 1 K { ∑ j = 1 K γ j k log ⁡ α k + ∑ j = 1 K γ j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } \begin{aligned} & \log P \left( y, \gamma | \theta \right) = \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \gamma_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \gamma_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \end{aligned} logP(y,γθ)=k=1K{j=1Kγjklogαk+j=1Kγjk[log(2π 1)logσk2σk21(yjμk)2]}

  2. Q ( θ , θ ( i ) ) Q\left( \theta, \theta^{\left( i \right)} \right) Q(θ,θ(i))函数 Q ( θ , θ ( i ) ) = E [ log ⁡ P ( y , γ ∣ θ ) ∣ y , θ ( i ) ] = E { ∑ k = 1 K { ∑ j = 1 K γ j k log ⁡ α k + ∑ j = 1 K γ j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } } = ∑ k = 1 K { ∑ j = 1 K E ( γ j k ) log ⁡ α k + ∑ j = 1 K E ( γ j k ) [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } = ∑ k = 1 K { ∑ j = 1 K γ ^ j k log ⁡ α k + ∑ j = 1 K γ ^ j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } \begin{aligned} & Q \left( \theta , \theta^{\left( i \right)} \right) = E \left[ \log P \left( y, \gamma | \theta \right) | y, \theta^{ \left( i \right) }\right] \\ & = E \left\{ \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \gamma_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \gamma_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\}\right\} \\ & = \sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} E \left( \gamma_{jk} \right) \log \alpha_{k} + \sum_{j=1}^{K} E \left( \gamma_{jk} \right)\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \\ & =\sum_{k=1}^{K} \left\{ \sum_{j=1}^{K} \hat{\gamma}_{jk} \log \alpha_{k} + \sum_{j=1}^{K} \hat{\gamma}_{jk}\left[ \log \left( \dfrac{1}{ \sqrt{2 \pi} } \right) - \log \sigma_{k} - \dfrac{1}{ 2 \sigma_{k}^{2} } \left( y_{j} - \mu_{k} \right)^{2} \right]\right\} \end{aligned} Q(θ,θ(i))=E[logP(y,γθ)y,θ(i)]=E{k=1K{j=1Kγjklogαk+j=1Kγjk[log(2π 1)logσk2σk21(yjμk)2]}}=k=1K{j=1KE(γjk)logαk+j=1KE(γjk)[log(2π 1)logσk2σk21(yjμk)2]}=k=1K{j=1Kγ^jklogαk+j=1Kγ^jk[log(2π 1)logσk2σk21(yjμk)2]}

    其中,分模型 k k k对观测数据 y j y_{j} yj的响应度 γ ^ j k \hat{\gamma}_{jk} γ^jk是在当前模型参数下第 j j j个观测数据来自第 k k k个分模型的概率。 γ ^ j k = E ( γ j k ∣ y , θ ) = P ( γ j k = 1 ∣ y , θ ) = P ( γ j k = 1 , y j ∣ θ ) ∑ k = 1 K P ( γ j k = 1 , y j ∣ θ ) = α k ϕ ( y ∣ θ k ) ∑ k = 1 K α k ϕ ( y ∣ θ k ) ( j = 1 , 2 , ⋯   , N ; k = 1 , 2 , ⋯   , K ) \begin{aligned} & \hat{\gamma}_{jk} = E \left( \gamma_{jk} | y, \theta \right) = P \left( \gamma_{jk} = 1 | y, \theta \right) \\ & = \dfrac{P \left( \gamma_{jk} = 1, y_{j} | \theta \right)}{ \sum_{k=1}^{K} P \left( \gamma_{jk} = 1, y_{j} | \theta \right)} \\ & = \dfrac{\alpha_{k} \phi \left( y | \theta_{k} \right) }{\sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) } \quad \quad \quad \left( j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \right) \end{aligned} γ^jkE(γjky,θ)=P(γjk=1y,θ)=k=1KP(γjk=1,yjθ)P(γjk=1,yjθ)=k=1Kαkϕ(yθk)αkϕ(yθk)(j=1,2,,N;k=1,2,,K)

  3. Q ( θ , θ ( i ) ) Q\left( \theta, \theta^{\left( i \right)} \right) Q(θ,θ(i))函数对 θ \theta θ的极大值 θ ( i + 1 ) = arg ⁡ max ⁡ Q ( θ , θ ( i ) ) \begin{aligned} \theta^{\left( i+1 \right)} = \arg \max Q\left(\theta, \theta^{\left( i \right)} \right) \end{aligned} θ(i+1)=argmaxQ(θ,θ(i))

    μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯   , K σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯   , K α ^ k = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯   , K \begin{aligned} & \hat{\mu}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} y_{j}}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\sigma}_{k}^2 = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} \left( y_{j} - \mu_{k}\right)^2}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\alpha}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} }{N}, \quad k = 1, 2, \cdots, K\end{aligned} μ^k=j=1Nγ^jkj=1Nγ^jkyj,k=1,2,,Kσ^k2=j=1Nγ^jkj=1Nγ^jk(yjμk)2,k=1,2,,Kα^k=Nj=1Nγ^jk,k=1,2,,K

高斯混合模型参数估计得EM算法

  • 输入:观测数据 y 1 , y 2 , ⋯   , y N y_{1}, y_{2}, \cdots, y_{N} y1,y2,,yN,高斯混合模型;
  • 输出:高斯混合模型参数
  1. 取参数的初始值开始迭代
  2. E E E步:计算分模型 k k k对观测数据 y i y_{i} yi的响应度 γ ^ j k = α k ϕ ( y ∣ θ k ) ∑ k = 1 K α k ϕ ( y ∣ θ k ) j = 1 , 2 , ⋯   , N ; k = 1 , 2 , ⋯   , K \begin{aligned} & \hat{\gamma}_{jk} = \dfrac{\alpha_{k} \phi \left( y | \theta_{k} \right) }{\sum_{k=1}^{K} \alpha_{k} \phi \left( y | \theta_{k} \right) } \quad \quad \quad j = 1, 2, \cdots, N; k = 1, 2, \cdots, K \end{aligned} γ^jk=k=1Kαkϕ(yθk)αkϕ(yθk)j=1,2,,N;k=1,2,,K
  3. M M M步:计算新迭代的模型参数 μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯   , K σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯   , K α ^ k = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯   , K \begin{aligned} & \hat{\mu}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} y_{j}}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\sigma}_{k}^2 = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} \left( y_{j} - \mu_{k}\right)^2}{\sum_{j=1}^{N} \hat{\gamma}_{jk}}, \quad k = 1, 2, \cdots, K \\ & \hat{\alpha}_{k} = \dfrac{\sum_{j=1}^{N} \hat{\gamma}_{jk} }{N}, \quad k = 1, 2, \cdots, K\end{aligned} μ^k=j=1Nγ^jkj=1Nγ^jkyj,k=1,2,,Kσ^k2=j=1Nγ^jkj=1Nγ^jk(yjμk)2,k=1,2,,Kα^k=Nj=1Nγ^jk,k=1,2,,K
  4. 重复2.步和3.步,直到收敛。

4、概要总结

1.EM算法是含有隐变量的概率模型极大似然估计或极大后验概率估计的迭代算法。含有隐变量的概率模型的数据表示为 θ \theta θ )。这里, Y Y Y是观测变量的数据, Z Z Z是隐变量的数据, θ \theta θ 是模型参数。EM算法通过迭代求解观测数据的对数似然函数 L ( θ ) = log ⁡ P ( Y ∣ θ ) {L}(\theta)=\log {P}(\mathrm{Y} | \theta) L(θ)=logP(Yθ)的极大化,实现极大似然估计。每次迭代包括两步:

E E E步,求期望,即求 l o g P ( Z ∣ Y , θ ) logP\left(Z | Y, \theta\right) logP(ZY,θ) )关于$ P\left(Z | Y, \theta^{(i)}\right)$)的期望:

Q ( θ , θ ( i ) ) = ∑ Z log ⁡ P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) Q\left(\theta, \theta^{(i)}\right)=\sum_{Z} \log P(Y, Z | \theta) P\left(Z | Y, \theta^{(i)}\right) Q(θ,θ(i))=ZlogP(Y,Zθ)P(ZY,θ(i))

称为 Q Q Q函数,这里 θ ( i ) \theta^{(i)} θ(i)是参数的现估计值;

M M M步,求极大,即极大化 Q Q Q函数得到参数的新估计值:

θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg \max _{\theta} Q\left(\theta, \theta^{(i)}\right) θ(i+1)=argθmaxQ(θ,θ(i))

在构建具体的EM算法时,重要的是定义 Q Q Q函数。每次迭代中,EM算法通过极大化 Q Q Q函数来增大对数似然函数 L ( θ ) {L}(\theta) L(θ)

2.EM算法在每次迭代后均提高观测数据的似然函数值,即

P ( Y ∣ θ ( i + 1 ) ) ⩾ P ( Y ∣ θ ( i ) ) P\left(Y | \theta^{(i+1)}\right) \geqslant P\left(Y | \theta^{(i)}\right) P(Yθ(i+1))P(Yθ(i))

在一般条件下EM算法是收敛的,但不能保证收敛到全局最优。

3.EM算法应用极其广泛,主要应用于含有隐变量的概率模型的学习。高斯混合模型的参数估计是EM算法的一个重要应用,下一章将要介绍的隐马尔可夫模型的非监督学习也是EM算法的一个重要应用。

4.EM算法还可以解释为 F F F函数的极大-极大算法。EM算法有许多变形,如GEM算法。GEM算法的特点是每次迭代增加 F F F函数值(并不一定是极大化 F F F函数),从而增加似然函数值。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值