【机器学习算法推导】高斯混合模型GMM与EM算法

1.高斯混合模型

  极大似然估计是一种应用很广泛的参数估计方法。在已有某个地区身高数据以及知道身高服从高斯分布的情况下,利用极大似然估计的方法可以估计出高斯分布 μ , σ \mu,\sigma μ,σ两个参数。
  如果是多组数据,多个模型呢?获取现在我们有全国多个省份的身高数据,但并不知道它们具体属于哪个省份,只知道每个省之间服从不同的高斯分布,此时的模型称为高斯混合模型(GMM),其公式为
P ( x ; θ ) = ∑ i = 1 k p k N ( x ; μ k , Σ k ) . . . . . . . . . . ( 1 ) P(x;\theta)=\sum_{i=1}^kp_kN(x;\mu_k,\Sigma_k)..........(1) P(x;θ)=i=1kpkN(x;μk,Σk)..........(1)
  此时用极大似然估计的方法并不能很好地求出模型里面的参数,因为模型中存在一个隐变量——样本的label。令其为Z,则可以写出似然函数的形式:
L ( θ ) = ∑ i = 1 m log ⁡ p ( x i ; θ ) = ∑ i = 1 m log ⁡ ∫ p ( x i , z ; θ ) d z . . . . . . . . . . ( 2 ) L(\theta)=\sum_{i=1}^m\log p(x_i;\theta)=\sum_{i=1}^m\log \int p(x_i,z;\theta)dz..........(2) L(θ)=i=1mlogp(xi;θ)=i=1mlogp(xi,z;θ)dz..........(2)
  如果根据极大似然估计的算法对式子(2)进行求导,意味着我们要对 log ⁡ ( y 1 + y 2 + . . . ) \log (y1+y2+...) log(y1+y2+...)的形式进行求导,这计算量简直爆炸。

2.Jensen不等式

  这里我们需要用到EM算法来进行求解,在介绍EM算法前,我们先来了解一个不等式——Jensen不等式,它的定义如下:
  如果X是一个随机变量,f(X)是一个凸函数(二阶导数大或等于0),那么有:
E ( f ( x ) ) ≥ f [ E ( x ) ] E(f(x)) \ge f[E(x)] E(f(x))f[E(x)]
  如果f(X)是凹函数,不等号反向
E ( f ( x ) ) ≤ f [ E ( x ) ] E(f(x)) \le f[E(x)] E(f(x))f[E(x)]
  当且仅当X是常数的时候等号成立。
在这里插入图片描述
  结合图像我们可以对Jensen不等式有一个直观的理解,对于x=a和x=b两个点,E(f(x))可以看成是 1 2 ( f ( a ) + f ( b ) ) \frac{1}{2}(f(a)+f(b)) 21(f(a)+f(b)),f[E(x)]可以看成是 f ( 1 2 ( f ( a ) + f ( b ) ) ) f(\frac{1}{2}(f(a)+f(b))) f(21(f(a)+f(b))),即ab中点对应的函数值。由于图像是凹图像,所以不难得出 E ( f ( x ) ) ≥ f [ E ( x ) ] E(f(x)) \ge f[E(x)] E(f(x))f[E(x)]

3.EM算法及推导过程

  利用Jensen不等式,由于log函数是凹函数,式子(2)可以改写成:
∑ i = 1 m log ⁡ ∫ p ( x i , z ; θ ) d z = ∑ i = 1 m log ⁡ ∫ Q ( z ) P ( x i , z ; θ ) Q ( z ) d z ≥ ∑ i = 1 m ∫ Q ( z ) log ⁡ P ( x i , z ; θ ) Q ( z ) d z . . . . . . . . . . ( 3 ) \begin{aligned} \sum_{i=1}^m\log \int p(x_i,z;\theta)dz &=\sum_{i=1}^m\log \int Q(z)\frac{P(x_i,z;\theta)}{Q(z)}dz\\ &\ge \sum_{i=1}^m\int Q(z)\log \frac{P(x_i,z;\theta)}{Q(z)}dz..........(3) \end{aligned} i=1mlogp(xi,z;θ)dz=i=1mlogQ(z)Q(z)P(xi,z;θ)dzi=1mQ(z)logQ(z)P(xi,z;θ)dz..........(3)
  由于 Q ( z ) Q(z) Q(z)是概率分布,所以满足
∑ z Q ( z ) = 1.......... ( 4 ) \sum_zQ(z)=1..........(4) zQ(z)=1..........(4)
  根据Jenson不等式,等式成立时,有
p ( x i , z ; θ ) Q ( z ) = C . . . . . . . . . . ( 5 ) \frac{p(x_i,z;\theta)}{Q(z)}=C..........(5) Q(z)p(xi,z;θ)=C..........(5)
  根据4和5,有
∑ z i Q ( z ) = ∑ z i p ( x i , z ; θ ) C = 1.......... ( 6 ) Q ( z ) = p ( x i , z ; θ ) C = p ( x i , z ; θ ) ∑ z p ( x i , z ; θ ) = p ( x i , z ; θ ) p ( x i ) = p ( z ∣ x i ; θ ) . . . . . . . . . . ( 7 ) \sum_{z_i}Q(z)=\sum_{z_i}\frac{p(x_i,z;\theta)}{C}=1..........(6) \\ Q(z)=\frac{p(x_i,z;\theta)}{C}=\frac{p(x_i,z;\theta)}{\sum_{z}p(x_i,z;\theta)}=\frac{p(x_i,z;\theta)}{p(x_i)}=p(z|x_i;\theta)..........(7) ziQ(z)=ziCp(xi,z;θ)=1..........(6)Q(z)=Cp(xi,z;θ)=zp(xi,z;θ)p(xi,z;θ)=p(xi)p(xi,z;θ)=p(zxi;θ)..........(7)
  将(7)回代到(3),有
∑ i = 1 m log ⁡ ∫ p ( x i , z ; θ ) d z ≥ ∑ i = 1 m ∫ p ( z ∣ x i ; θ ) log ⁡ P ( x i , z ; θ ) p ( z ∣ x i ; θ ) d z \sum_{i=1}^m\log\int p(x_i,z;\theta)dz\ge \sum_{i=1}^m\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz i=1mlogp(xi,z;θ)dzi=1mp(zxi;θ)logp(zxi;θ)P(xi,z;θ)dz
  通过每次令
θ ^ = arg ⁡ max ⁡ θ ∫ p ( z ∣ x i ; θ ) log ⁡ P ( x i , z ; θ ) p ( z ∣ x i ; θ ) d z = ∫ p ( z ∣ x i ; θ ) log ⁡ P ( x i , z ; θ ) d z \hat \theta=\arg \max_\theta\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz=\int p(z|x_i;\theta)\log P(x_i,z;\theta)dz θ^=argθmaxp(zxi;θ)logp(zxi;θ)P(xi,z;θ)dz=p(zxi;θ)logP(xi,z;θ)dz
  就可以不断迭代使得似然函数最大。因此EM算法本质上是一种令下界不断提高,从而使得似然函数也不断提高的迭代算法。
  因此,EM算法的流程如下:

  1. 初始化参数 θ 0 θ_0 θ0,开始迭代
  2. E-Step:计算 Q ( θ , θ i ) = E z ∣ x , θ i [ log ⁡ P ( x , z ∣ θ ) ] = ∫ log ⁡ P ( x , z ∣ θ ) P ( z ∣ x , θ i ) d z Q(\theta,\theta_i)=E_{z|x,\theta_i}[\log P(x,z|\theta)]=\int\log P(x,z|\theta)P(z|x,\theta_i)dz Q(θ,θi)=Ezx,θi[logP(x,zθ)]=logP(x,zθ)P(zx,θi)dz
  3. M-Step:根据计算得到的 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi),求使 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi)极大化的 θ \theta θ,得到新的参数 θ i + 1 θ_{i+1} θi+1
  4. 重复2和3直至收敛

4.EM算法的可行性

  为什么EM算法能近似实现对观测数据的极大似然估计呢?我们需要证明我们的对数似然函数的值在迭代的过程中一直在增大。假设当前迭代到第i轮,参数为 θ i \theta_i θi,希望新估计值 θ \theta θ可以使 L ( θ ) L(\theta) L(θ)增大,即 L ( θ ) > L ( θ i ) L(\theta)\gt L(\theta_i) L(θ)>L(θi),则: 
L ( θ ) − L ( θ i ) = log ⁡ ( ∑ Z p ( x ∣ z , θ ) p ( z ∣ θ ) ) − log ⁡ ( p ( x ∣ θ i ) ) = log ⁡ ∑ Z p ( z ∣ x , θ i ) p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) − log ⁡ ( p ( x ∣ θ i ) ) ≥ ∑ Z p ( z ∣ x , θ i ) log ⁡ p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) − log ⁡ ( p ( x ∣ θ i ) ) = ∑ Z p ( z ∣ x , θ i ) log ⁡ p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) . . . . . . . . . . . ( 8 ) \begin{aligned} L(\theta)-L(\theta_{i}) &=\log (\sum_Zp(x|z,\theta)p(z|\theta))-\log \left(p(x|\theta_{i})\right) \\ &=\log \sum_Zp(z|x,\theta_{i})\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &\ge \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &= \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(8) \end{aligned} L(θ)L(θi)=log(Zp(xz,θ)p(zθ))log(p(xθi))=logZp(zx,θi)p(zx,θi)p(xz,θ)p(zθ)log(p(xθi))Zp(zx,θi)logp(zx,θi)p(xz,θ)p(zθ)log(p(xθi))=Zp(zx,θi)logp(zx,θi)(p(xθi)p(xz,θ)p(zθ)...........(8)
  令 B ( θ , θ i ) = L ( θ i ) + ∑ Z p ( z ∣ x , θ i ) log ⁡ p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) . . . . . . . . . . . ( 9 ) B(\theta,\theta_{i})=L(\theta_{i})+\sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(9) B(θ,θi)=L(θi)+Zp(zx,θi)logp(zx,θi)(p(xθi)p(xz,θ)p(zθ)...........(9)
  则有 L ( θ ) ≥ B ( θ , θ i ) L(\theta)\ge B(\theta,\theta_{i}) L(θ)B(θ,θi),因此可以认为 B ( θ , θ i ) B(\theta,\theta_{i}) B(θ,θi) L ( θ ) L(\theta) L(θ)的一个下界,且有 L ( θ i ) = B ( θ i , θ i ) L(\theta_{i})=B(\theta_{i},\theta_{i}) L(θi)=B(θi,θi),为了使 L ( θ ) L(\theta) L(θ)尽可能地变大,我们解以下式子:
arg ⁡ max ⁡ θ B ( θ , θ i ) = arg ⁡ max ⁡ θ L ( θ i ) + ∑ Z p ( z ∣ x , θ i ) log ⁡ p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) = arg ⁡ max ⁡ θ ∑ Z p ( z ∣ x , θ i ) log ⁡ ( p ( x ∣ z , θ ) p ( z ∣ θ ) ) = arg ⁡ max ⁡ θ ∑ Z p ( z ∣ x , θ i ) log ⁡ p ( x , z ∣ θ ) = arg ⁡ max ⁡ θ Q ( θ , θ i ) . . . . . . . . . . . ( 10 ) \begin{aligned} \arg \max_\theta B(\theta,\theta_{i}) &=\arg \max_\theta L(\theta_{i})+ \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})} \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log (p(x|z,\theta)p(z|\theta)) \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log p(x,z|\theta)\\ &= \arg \max_\theta Q(\theta,\theta_i)...........(10) \end{aligned} argθmaxB(θ,θi)=argθmaxL(θi)+Zp(zx,θi)logp(zx,θi)(p(xθi)p(xz,θ)p(zθ)=argθmaxZp(zx,θi)log(p(xz,θ)p(zθ))=argθmaxZp(zx,θi)logp(x,zθ)=argθmaxQ(θ,θi)...........(10)
  而这正是EM中我们所做的一次迭代,由此可证EM算法是通过不断求解下界的极大化逼近求解对数似然函数极大化的算法。

5.EM算法的收敛性

  可以使用单调有界的思想来证明EM算法收敛。首先, P ( x ∣ θ ) P(x|\theta) P(xθ)为观测数据的似然函数,我们要证明递增性,即
P ( x ∣ θ i + 1 ) ≥ P ( x ∣ θ i ) . . . . . . . . . . . ( 11 ) P(x|\theta_{i+1}) \ge P(x| \theta_i)...........(11) P(xθi+1)P(xθi)...........(11)
  证明如下:
P ( x ∣ θ ) = P ( x , z ∣ θ ) P ( z ∣ x , θ )    ⟺    log ⁡ P ( x ∣ θ ) = log ⁡ P ( x , z ∣ θ ) − log ⁡ P ( z ∣ x , θ ) . . . . . . . . . . . ( 12 ) \begin{aligned} &P(x|\theta)=\frac{P(x,z|\theta)}{P(z|x,\theta)}\iff \log P(x|\theta) = \log P(x,z|\theta) - \log P(z|x,\theta)...........(12) \end{aligned} P(xθ)=P(zx,θ)P(x,zθ)logP(xθ)=logP(x,zθ)logP(zx,θ)...........(12)
  令
H ( θ , θ i ) = ∑ z log ⁡ P ( z ∣ x , θ ) P ( z ∣ x , θ i ) . . . . . . . . . . . ( 13 ) H(\theta,\theta_i)=\sum\limits_z\log P(z|x,\theta)P(z|x,\theta_i)...........(13) H(θ,θi)=zlogP(zx,θ)P(zx,θi)...........(13)
  根据Q函数定义有,
Q ( θ , θ i ) = ∑ z log ⁡ P ( x , z ∣ θ ) P ( z ∣ x , θ i ) Q(\theta,\theta_i)=\sum\limits_z\log P(x,z|\theta)P(z|x,\theta_i) Q(θ,θi)=zlogP(x,zθ)P(zx,θi)
  于是对数似然函数可以写成
log ⁡ P ( x ∣ θ ) = Q ( θ , θ i ) − H ( θ , θ i ) . . . . . . . . . . ( 14 ) \log P(x|\theta) =Q(\theta,\theta_i)-H(\theta,\theta_i)..........(14) logP(xθ)=Q(θ,θi)H(θ,θi)..........(14)
  上式中分别取 θ i \theta_i θi θ i + 1 \theta_{i+1} θi+1并相减,有
log ⁡ P ( x ∣ θ i + 1 ) − log ⁡ P ( x ∣ θ i ) = [ Q ( θ i + 1 , θ i ) − Q ( θ i , θ i ) ] − [ H ( θ i + 1 , θ i ) − H ( θ i , θ i ) ] . . . . . . . . . . ( 15 ) \begin{aligned} & \log P(x|\theta_{i+1})- \log P(x|\theta_i) \\ &=[Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i)]-[H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i)]..........(15) \end{aligned} logP(xθi+1)logP(xθi)=[Q(θi+1,θi)Q(θi,θi)][H(θi+1,θi)H(θi,θi)]..........(15)
  由于 θ i + 1 \theta_{i+1} θi+1使 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi)极大,所以有
Q ( θ i + 1 , θ i ) − Q ( θ i , θ i ) ≥ 0.......... ( 16 ) Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i) \ge 0 ..........(16) Q(θi+1,θi)Q(θi,θi)0..........(16)
  而由式子13以及Jensen不等式,有
H ( θ i + 1 , θ i ) − H ( θ i , θ i ) = ∑ z ( log ⁡ P ( z ∣ x , θ i + 1 ) P ( z ∣ x , θ i ) P ( z ∣ x , θ i ) ) ≤ log ⁡ ( ∑ z P ( z ∣ x , θ i + 1 ) P ( z ∣ x , θ i ) P ( z ∣ x , θ i ) ) = log ⁡ P ( z ∣ x , θ i + 1 ) = 0......... ( 17 ) \begin{aligned} &H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i) \\ &=\sum_z\left(\log \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &\le \log \left(\sum_z \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &=\log P(z|x,\theta_{i+1})=0.........(17) \end{aligned} H(θi+1,θi)H(θi,θi)=z(logP(zx,θi)P(zx,θi+1)P(zx,θi))log(zP(zx,θi)P(zx,θi+1)P(zx,θi))=logP(zx,θi+1)=0.........(17)
  可知
log ⁡ P ( x ∣ θ i + 1 ) − log ⁡ P ( x ∣ θ i ) ≥ 0.......... ( 18 ) \log P(x|\theta_{i+1})- \log P(x|\theta_i) \ge 0 ..........(18) logP(xθi+1)logP(xθi)0..........(18)
  又因为 P ( x ∣ θ i ) ≤ 1 P(x|\theta_i) \le 1 P(xθi)1有界,可知 L ( θ i ) = log ⁡ P ( x ∣ θ i ) L(\theta_i)=\log P(x|\theta_i) L(θi)=logP(xθi)必定收敛。
  由于采用迭代的方式进行求解,EM算法可以保证收敛到一个稳定点,但是却不能保证收敛到全局的极大值点,因此它是局部最优的算法,当然,如果我们的优化目标 B ( θ , θ i ) B(\theta,\theta_i) B(θ,θi)是凸的,则EM算法可以保证收敛到全局最大值,这点和梯度下降法这样的迭代算法相同。初值的选择非常重要,常用的方法是选择多个初值尝试,比较结果选取较好的。

6.EM的另一种推导

log ⁡ P ( x ∣ θ ) = log ⁡ P ( x , z ∣ θ ) − log ⁡ P ( z ∣ x , θ ) = log ⁡ P ( x , z ∣ θ ) q ( z ) − log ⁡ P ( z ∣ x , θ ) q ( z ) \begin{aligned} \log P(x|\theta) &=\log P(x,z|\theta)-\log P(z|x,\theta) \\ &=\log \frac{P(x,z|\theta)}{q(z)}-\log \frac{P(z|x,\theta)}{q(z)} \\ \end{aligned} logP(xθ)=logP(x,zθ)logP(zx,θ)=logq(z)P(x,zθ)logq(z)P(zx,θ)
  左右两边同时对z积分,此时有
l e f t = ∫ log ⁡ P ( x ∣ θ ) q ( z ) d z = log ⁡ P ( x ∣ θ ) ∫ q ( z ) d z = P ( x ∣ θ ) left=\int \log P(x|\theta) q(z)dz=\log P(x|\theta)\int q(z)dz= P(x|\theta) left=logP(xθ)q(z)dz=logP(xθ)q(z)dz=P(xθ)
  右边有
r i g h t = ∫ z q ( z ) log ⁡ P ( x , z ∣ θ ) q ( z ) d z − ∫ z q ( z ) log ⁡ P ( z ∣ x , θ ) q ( z ) d z right=\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz-\int_zq(z)\log \frac{P(z|x,\theta)}{q(z)}dz right=zq(z)logq(z)P(x,zθ)dzzq(z)logq(z)P(zx,θ)dz
  根据相对熵(KL散度)的定义,对于p(x)和q(x)两个分布,p对q的相对熵表示为
D ( p ∣ ∣ q ) = ∑ i = 1 n p ( x ) log ⁡ p ( x ) q ( x ) D(p||q)=\sum_{i=1}^np(x)\log \frac{p(x)}{q(x)} D(pq)=i=1np(x)logq(x)p(x)
  可以发现right等式右边第二项其实就是q(z)对 P ( z ∣ x , θ ) P(z|x,\theta) P(zx,θ)的相对熵,即
P ( x ∣ θ ) = ∫ z q ( z ) log ⁡ P ( x , z ∣ θ ) q ( z ) d z + D ( q ( z ) ∣ ∣ p ( z ∣ x , θ ) ) P(x|\theta) =\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz+D(q(z)||p(z|x,\theta)) P(xθ)=zq(z)logq(z)P(x,zθ)dz+D(q(z)p(zx,θ))
  相对熵有两个性质,一个是不具有对称性,即 D ( p ∣ ∣ q ) ≠ D ( q ∣ ∣ p ) D(p||q)\ne D(q||p) D(pq)̸=D(qp),另一个性质是 D ( p ∣ ∣ q ) ≥ 0 D(p||q) \ge 0 D(pq)0。当p和q的概率分布相同时,相对熵为0。也就是说,第一项其实就是 p ( x ∣ θ ) p(x|\theta) p(xθ)的一个下界,我们称之为ELBO。我们想要让下界ELBO增大,从而使得 p ( x ∣ θ ) p(x|\theta) p(xθ)增大,则
θ ^ = arg ⁡ max ⁡ θ ∫ z q ( z ) log ⁡ P ( x , z ∣ θ ) q ( z ) d z = arg ⁡ max ⁡ θ ∫ z P ( z ∣ x , θ i ) log ⁡ P ( x , z ∣ θ ) P ( z ∣ x , θ i ) d z = arg ⁡ max ⁡ θ ∫ z P ( z ∣ x , θ i ) log ⁡ P ( x , z ∣ θ ) d z \begin{aligned} \hat \theta &=\arg \max_\theta \int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log \frac{P(x,z|\theta)}{P(z|x,\theta_i)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log P(x,z|\theta)dz \\ \end{aligned} θ^=argθmaxzq(z)logq(z)P(x,zθ)dz=argθmaxzP(zx,θi)logP(zx,θi)P(x,zθ)dz=argθmaxzP(zx,θi)logP(x,zθ)dz
  因此得到 θ \theta θ的迭代公式。

7.应用EM算法求解GMM

  下面我们使用EM算法来求解GMM模型,GMM模型如下:
p ( x ∣ θ ) = ∑ k = 1 K p k N ( x ∣ μ k , Σ k ) . . . . . . . . . ( 19 ) p(x|\theta)=\sum_{k=1}^Kp_kN(x|\mu_k,\Sigma_k).........(19) p(xθ)=k=1KpkN(xμk,Σk).........(19)
  联合概率为
p ( x , z ) = p ( z ) p ( x ∣ z ) = p k ⋅ N ( x ∣ μ z , Σ z ) p(x,z)=p(z)p(x|z)=p_k·N(x|\mu_z,\Sigma_z) p(x,z)=p(z)p(xz)=pkN(xμz,Σz)
  后验概率为
p ( z ∣ x ) = p ( x , z ) p ( x ) = p k ⋅ N ( x ∣ μ z , Σ z ) ∑ k = 1 K p k ⋅ N ( x ∣ μ k , Σ k ) p(z|x)=\frac{p(x,z)}{p(x)}=\frac{p_k·N(x|\mu_z,\Sigma_z)}{\sum_{k=1}^Kp_k·N(x|\mu_k,\Sigma_k)} p(zx)=p(x)p(x,z)=k=1KpkN(xμk,Σk)pkN(xμz,Σz)
  在EM算法中,有
Q ( θ , θ t ) = ∫ z log ⁡ P ( x , z ∣ θ ) P ( z ∣ x , θ t ) d z = ∑ z log ⁡ ∏ i = 1 N P ( x i , z i ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n ∑ i = 1 N log ⁡ P ( x i , z i ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n ( log ⁡ P ( x 1 , z 1 ∣ θ ) + log ⁡ P ( x 2 , z 2 ∣ θ ) + . . . + log ⁡ P ( x n , z n ∣ θ ) ) ∏ i = 1 N P ( z i ∣ x i , θ t ) \begin{aligned} Q(\theta,\theta_t) &=\int_z\log P(x,z|\theta)P(z|x,\theta_t)dz\\ &=\sum_z\log \prod_{i=1}^NP(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\sum_{i=1}^N\log P(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ \end{aligned} Q(θ,θt)=zlogP(x,zθ)P(zx,θt)dz=zlogi=1NP(xi,ziθ)i=1NP(zixi,θt)=z1,z2,...,zni=1NlogP(xi,ziθ)i=1NP(zixi,θt)=z1,z2,...,zn(logP(x1,z1θ)+logP(x2,z2θ)+...+logP(xn,znθ))i=1NP(zixi,θt)
  对于其中一项
∑ z 1 , z 2 , . . . , z n log ⁡ P ( x 1 , z 1 ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n log ⁡ P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∏ i = 2 N P ( z i ∣ x i , θ t ) = ∑ z 1 log ⁡ P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∑ z 2 , . . . , z n ∏ i = 2 N P ( z i ∣ x i , θ t ) = ∑ z 1 log ⁡ P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∑ z 2 P ( z 2 ∣ x 2 , θ t ) ∑ z 3 P ( z 3 ∣ x 3 , θ t ) . . . ∑ z n P ( z 3 ∣ x n , θ t ) = ∑ z 1 log ⁡ P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) \begin{aligned} &\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2,...,z_n}\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2}P(z_2|x_2,\theta_t)\sum_{z_3}P(z_3|x_3,\theta_t)...\sum_{z_n}P(z_3|x_n,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t) \end{aligned} z1,z2,...,znlogP(x1,z1θ)i=1NP(zixi,θt)=z1,z2,...,znlogP(x1,z1θ)P(z1x1,θt)i=2NP(zixi,θt)=z1logP(x1,z1θ)P(z1x1,θt)z2,...,zni=2NP(zixi,θt)=z1logP(x1,z1θ)P(z1x1,θt)z2P(z2x2,θt)z3P(z3x3,θt)...znP(z3xn,θt)=z1logP(x1,z1θ)P(z1x1,θt)
  因此
Q ( θ , θ t ) = ∑ z 1 , z 2 , . . . , z n ( log ⁡ P ( x 1 , z 1 ∣ θ ) + log ⁡ P ( x 2 , z 2 ∣ θ ) + . . . + log ⁡ P ( x n , z n ∣ θ ) ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 log ⁡ P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) + . . . + ∑ z n log ⁡ P ( x n , z n ∣ θ ) P ( z n ∣ x n , θ t ) = ∑ i = 1 N ∑ z i log ⁡ P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ t ) = ∑ i = 1 N ∑ z i log ⁡ ( p z i ⋅ N ( x i ∣ μ z i , Σ z i ) ) ⋅ P ( z i ∣ x i , θ t ) = ∑ z i ∑ i = 1 N log ⁡ ( p z i ⋅ N ( x i ∣ μ z i , Σ z i ) ) ⋅ P ( z i ∣ x i , θ t ) = ∑ k = 1 K ∑ i = 1 N log ⁡ ( p k ⋅ N ( x i ∣ μ k , Σ k ) ) ⋅ P ( z i = c k ∣ x i , θ t ) = ∑ k = 1 K ∑ i = 1 N ( log ⁡ p k + log ⁡ N ( x i ∣ μ k , Σ k ) ) ⋅ P ( z i = c k ∣ x i , θ t ) \begin{aligned} Q(\theta,\theta_t) &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)+...+\sum_{z_n}\log P(x_n,z_n|\theta)P(z_n|x_n,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log P(x_i,z_i|\theta)P(z_i|x_i,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{z_i}\sum_{i=1}^N\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N\log(p_k·N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N(\log p_k+\log N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ \end{aligned} Q(θ,θt)=z1,z2,...,zn(logP(x1,z1θ)+logP(x2,z2θ)+...+logP(xn,znθ))i=1NP(zixi,θt)=z1logP(x1,z1θ)P(z1x1,θt)+...+znlogP(xn,znθ)P(znxn,θt)=i=1NzilogP(xi,ziθ)P(zixi,θt)=i=1Nzilog(pziN(xiμzi,Σzi))P(zixi,θt)=zii=1Nlog(pziN(xiμzi,Σzi))P(zixi,θt)=k=1Ki=1Nlog(pkN(xiμk,Σk))P(zi=ckxi,θt)=k=1Ki=1N(logpk+logN(xiμk,Σk))P(zi=ckxi,θt)
  接下来求解 θ t + 1 = ( p k , μ k , Σ k ) \theta_{t+1}=(p_k,\mu_k,\Sigma_k) θt+1=(pk,μk,Σk)
p k t + 1 = arg ⁡ max ⁡ p k ∑ k = 1 K ∑ i = 1 N log ⁡ p k ⋅ P ( z i = c k ∣ x i , θ t ) , s . t . ∑ k = 1 K p k = 1 μ k t + 1 = arg ⁡ max ⁡ μ k ∑ k = 1 K ∑ i = 1 N log ⁡ N ( x i ∣ μ k , Σ k ) ⋅ P ( z i = c k ∣ x i , θ t ) Σ k t + 1 = arg ⁡ max ⁡ Σ k ∑ k = 1 K ∑ i = 1 N log ⁡ N ( x i ∣ μ k , Σ k ) ⋅ P ( z i = c k ∣ x i , θ t ) p_k^{t+1}=\arg \max_{p_k}\sum_{k=1}^K\sum_{i=1}^N\log p_k·P(z_i=c_k|x_i,\theta_t),\quad s.t. \sum_{k=1}^Kp_k=1 \\ \mu_k^{t+1}=\arg \max_{\mu_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\ \Sigma_k^{t+1}=\arg \max_{\Sigma_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\ pkt+1=argpkmaxk=1Ki=1NlogpkP(zi=ckxi,θt),s.t.k=1Kpk=1μkt+1=argμkmaxk=1Ki=1NlogN(xiμk,Σk)P(zi=ckxi,θt)Σkt+1=argΣkmaxk=1Ki=1NlogN(xiμk,Σk)P(zi=ckxi,θt)
  通过展开之后求导得到最优解,其中 p k t + 1 p_k^{t+1} pkt+1的求解使用拉格朗日乘子法进行求解。重复以上E-step和M-step直至对数似然值不再有明显变化为止。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值