《统计学习方法》第九章: EM算法及其推广 读书笔记

9.EM算法及其推广

  • EM算法(expectation maximization algorithm,期望极大算法)
  • 是一种非监督模型
  • 是含有缺失数据的概率模型参数的极大似然估计法
  • 算法每次迭代分两步:
    • E:求期望
    • M:求极大

其实k-means聚类和高斯混合模型都是EM的推广。
分类模型试图从数据的内在联系分析出数据可以分为几类,分别属于哪一类。

9.1概念
  • 不完全数据:观测数据 X X X(观测随机变量得到的结果数据)
  • 完全数据:观测数据 X X X和隐随机变量的数据 Z Z Z
    我的理解,对于分类模型,这里的隐随机变量就是数据的类别。
9.1.1模型解释

设待估计的模型参数为 θ \theta θ
例如对于k-means来说, θ \theta θ就是各聚类的中心 μ 1 , ⋯   , μ k \mu_1,\cdots,\mu_k μ1,,μk;隐变量Z就是最终的K个分类 1 , ⋯   , k 1,\cdots,k 1,,k
对于混合高斯分布来说, θ \theta θ就是各高斯分布的参数 α i , μ i , Σ i \alpha_i,\mu_i,\Sigma_i αi,μi,Σi;隐变量Z就是K个分布 1 , ⋯   , k 1,\cdots,k 1,,k

  • 每个样本 x i x_i xi的真实类别 z i z_i zi是隐随机变量,未知;所以EM算法的步骤:
    • 初始化 θ 0 \theta^0 θ0
    • E步: 计算 E ( z i ) E(z_i) E(zi) E ( z i ) E(z_i) E(zi)可用 θ ( n ) \theta^{(n)} θ(n)表示。即计算在参数值为 θ ( n ) \theta^{(n)} θ(n)的情况下,样本真实类别的期望 E ( z i ) E(z_i) E(zi)
      对于k-means,这一步计算的是在当前聚类中心为 μ 1 ( n ) , ⋯   , μ k ( n ) \mu_1^{(n)},\cdots,\mu_k^{(n)} μ1(n),,μk(n)的条件下,样本的可能分类 z ^ i \hat z_i z^i
    • M步:用 E ( z i ) E(z_i) E(zi)代替 z i z_i zi带入 L ( θ ) L(\theta) L(θ),求本轮迭代中使得极大似然函数最大的 θ \theta θ,即
      θ ( n + 1 ) = arg ⁡ max ⁡ θ L ( θ ) \theta^{(n+1)}=\arg \max_{\theta}L(\theta) θ(n+1)=argmaxθL(θ)
      对于k-means来说,即按照上一轮聚类中心将样本集划分后,将聚类中心更新,值为当前分类子集的质心。
9.1.2极大似然函数

当数据完整时

  • X X X Z Z Z的联合概率分布为 P ( x , z ∣ θ ) P(x,z|\theta) P(x,zθ)
  • 极大似然函数
    P ( X , Z ∣ θ ) = ∏ i = 1 m P ( x i , z i ∣ θ ) P(X,Z|\theta) = \prod_{i=1}^m P(x_i,z_i|\theta) P(X,Zθ)=i=1mP(xi,ziθ)
  • 对数极大似然函数为
    L ( θ ) = log ⁡ P ( X , Z ∣ θ ) = log ⁡ ∏ i = 1 m P ( x i , z i ∣ θ ) = ∑ i = 1 m log ⁡ P ( x i , z i ∣ θ ) L(\theta)=\log P(X,Z|\theta)=\log \prod_{i=1}^m P(x_i,z_i|\theta) = \sum_{i=1}^m\log P(x_i,z_i|\theta) L(θ)=logP(X,Zθ)=logi=1mP(xi,ziθ)=i=1mlogP(xi,ziθ)

当数据不完整时

  • 极大似然函数,假设数据集共有m个样本
    l = P ( X ∣ θ ) = ∏ i = 1 m P ( x i ∣ θ ) = ∏ i = 1 m ( ∑ z i = j k P ( x i , z i ∣ θ ) ) l = P(X|\theta) = \prod_{i=1}^m P(x_i|\theta) =\prod_{i=1}^m\Big( \sum_{z_i = j}^k P(x_i,z_i|\theta)\Big) l=P(Xθ)=i=1mP(xiθ)=i=1m(zi=jkP(xi,ziθ))
    其中 j = 1 , ⋯   , k j=1,\cdots,k j=1,,k
  • 对数极大似然函数
    L ( θ ) = log ⁡ P ( X ∣ θ ) = ∑ i = 1 m log ⁡ P ( x i ∣ θ ) = ∑ i = 1 m log ⁡ ∑ z i = j k P ( x i , z i ∣ θ ) L(\theta)=\log P(X|\theta)= \sum_{i=1}^m\log P(x_i|\theta) = \sum_{i=1}^m\log \sum_{z_i = j}^k P(x_i,z_i|\theta) L(θ)=logP(Xθ)=i=1mlogP(xiθ)=i=1mlogzi=jkP(xi,ziθ)
9.2EM算法
9.2.1推导

EM算法通过一步步迭代 θ \theta θ值,逐步最大化似然函数。
假设在第n次迭代取值 θ ( n ) \theta^{(n)} θ(n),更新 θ \theta θ值时,希望 L ( θ ) − L ( θ ( n ) ) > 0 L(\theta) - L(\theta^{(n)})>0 L(θ)L(θ(n))>0,以此逐步最大化似然函数。
考虑两者的差
L ( θ ) − L ( θ ( n ) ) = ∑ i = 1 m log ⁡ ∑ z i = j k P ( x i , z i ∣ θ ) − ∑ i = 1 m log ⁡ P ( x i ∣ θ ( n ) ) = ∑ i = 1 m log ⁡ [ ∑ z i = j k ( P ( z i ∣ x i , θ ( n ) ) P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) ) ] − ∑ i = 1 m log ⁡ P ( x i ∣ θ ( n ) ) \begin{aligned} L(\theta)-L(\theta^{(n)}) &= \sum_{i=1}^m\log \sum_{z_i = j}^k P(x_i,z_i|\theta) - \sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ &=\sum_{i=1}^m\log\Big[\sum_{z_i = j}^k(P(z_i|x_i,\theta^{(n)})\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})})\Big] -\sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ \end{aligned} L(θ)L(θ(n))=i=1mlogzi=jkP(xi,ziθ)i=1mlogP(xiθ(n))=i=1mlog[zi=jk(P(zixi,θ(n))P(zixi,θ(n))P(xi,ziθ))]i=1mlogP(xiθ(n))

  • 有Jensen不等式,当 f ( x ) f(x) f(x)是凸函数时 f ( ∑ i α i x i ) ⩾ ∑ i α i f ( x i ) f(\sum_i \alpha_ix_i) \geqslant \sum_i \alpha_if(x_i) f(iαixi)iαif(xi);此时 f ( x ) = log ⁡ x f(x)=\log x f(x)=logx,满足不等式
  • ∑ z i = j k P ( z i ∣ x ) = 1 \sum_{z_i=j}^kP(z_i|x)=1 zi=jkP(zix)=1

根据两个性质,得到:
L ( θ ) − L ( θ ( n ) ) = ∑ i = 1 m log ⁡ [ ∑ z i = j k ( P ( z i ∣ x i , θ ( n ) ) P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) ) ] − ∑ i = 1 m log ⁡ P ( x i ∣ θ ( n ) ) ⩾ ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) − ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i ∣ θ ( n ) ) ] = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ ( n ) ) P ( x i ∣ θ ( n ) ) ] = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] 令 B ( θ , θ ( n ) ) = L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] 则 L ( θ ) ⩾ B ( θ , θ ( n ) ) \begin{aligned} L(\theta)-L(\theta^{(n)}) &= \sum_{i=1}^m\log\Big[\sum_{z_i = j}^k(P(z_i|x_i,\theta^{(n)})\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})})\Big] -\sum_{i=1}^m\log P(x_i|\theta^{(n)})\\ &\geqslant \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})} - \sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i|\theta^{(n)})\Big]\\ &= \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(z_i|x_i,\theta^{(n)})P(x_i|\theta^{(n)})} \Big]\\ & = \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ \text{令}B(\theta,\theta^{(n)}) &= L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ \text{则}L(\theta) &\geqslant B(\theta,\theta^{(n)}) \end{aligned} L(θ)L(θ(n))B(θ,θ(n))L(θ)=i=1mlog[zi=jk(P(zixi,θ(n))P(zixi,θ(n))P(xi,ziθ))]i=1mlogP(xiθ(n))i=1m[zi=jkP(zixi,θ(n))logP(zixi,θ(n))P(xi,ziθ)zi=jkP(zixi,θ(n))logP(xiθ(n))]=i=1m[zi=jkP(zixi,θ(n))logP(zixi,θ(n))P(xiθ(n))P(xi,ziθ)]=i=1m[zi=jkP(zixi,θ(n))logP(xi,zi,θ(n))P(xi,ziθ)]=L(θ(n))+i=1m[zi=jkP(zixi,θ(n))logP(xi,zi,θ(n))P(xi,ziθ)]B(θ,θ(n))
上式当 θ \theta θ θ ( n ) \theta^{(n)} θ(n)时等号成立,证:
B ( θ ( n ) , θ ( n ) ) = L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ( n ) ) P ( x i , z i ∣ , θ ( n ) ) ] = L ( θ ( n ) ) \begin{aligned} B(\theta^{(n)},\theta^{(n)}) & = L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta^{(n)})}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ &= L(\theta^{(n)}) \end{aligned} B(θ(n),θ(n))=L(θ(n))+i=1m[zi=jkP(zixi,θ(n))logP(xi,zi,θ(n))P(xi,ziθ(n))]=L(θ(n))
此时 B ( θ , θ ( n ) ) B(\theta,\theta^{(n)}) B(θ,θ(n))相当于 L ( θ ) L(\theta) L(θ)的下界,如果能最大化 B ( θ , θ ( n ) ) B(\theta,\theta^{(n)}) B(θ,θ(n)),也能够使 L ( θ ) L(\theta) L(θ)增大。
更新 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)为使 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))最大的值
θ ( n + 1 ) = arg ⁡ max ⁡ θ B ( θ , θ ( n ) ) = arg ⁡ max ⁡ θ L ( θ ( n ) ) + ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) P ( x i , z i ∣ , θ ( n ) ) ] = arg ⁡ max ⁡ θ ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) ] + c o n s t a n t = arg ⁡ max ⁡ θ Q ( θ , θ ( n ) ) \begin{aligned} \theta^{(n+1)} &=\arg \max_{\theta} B(\theta,\theta^{(n)}) \\ &=\arg \max_{\theta} L(\theta^{(n)}) + \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log\frac{P(x_i,z_i|\theta)}{P(x_i,z_i|,\theta^{(n)})} \Big]\\ &= \arg \max_{\theta} \sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big]+constant \\ &= \arg \max_{\theta} Q(\theta,\theta^{(n)}) \end{aligned} θ(n+1)=argθmaxB(θ,θ(n))=argθmaxL(θ(n))+i=1m[zi=jkP(zixi,θ(n))logP(xi,zi,θ(n))P(xi,ziθ)]=argθmaxi=1m[zi=jkP(zixi,θ(n))logP(xi,ziθ)]+constant=argθmaxQ(θ,θ(n))
其中 Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) ] = ∑ i = 1 m E ( log ⁡ P ( x i , z i ∣ θ ) ∣ x i , θ ( n ) ) Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] = \sum_{i=1}^mE(\log P(x_i,z_i|\theta)\Big|x_i,\theta^{(n)}) Q(θ,θ(n))=i=1m[zi=jkP(zixi,θ(n))logP(xi,ziθ)]=i=1mE(logP(xi,ziθ)xi,θ(n));
其中 P ( z i ∣ x i , θ ( n ) ) P(z_i|x_i,\theta^{(n)}) P(zixi,θ(n))是在给定观测数据 x i x_i xi和当前参数 θ ( n ) \theta^{(n)} θ(n)下,对隐变量 z i z_i zi的期望。
下图为不完全数据的对数似然函数 L ( θ ) L(\theta) L(θ) B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))的关系

L ( θ ) L(\theta) L(θ) θ ( i ) \theta^{(i)} θ(i)时,求 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))曲线,求得时 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))最大点 θ ( i + 1 ) \theta^{(i+1)} θ(i+1);令 L ( θ ) L(\theta) L(θ) θ ( i + 1 ) \theta^{(i+1)} θ(i+1)时,继续求 B ( θ , θ ( i + 1 ) ) B(\theta,\theta^{(i+1)}) B(θ,θ(i+1))曲线,继续循环。

9.2.2算法

(EM算法选择不同的初值可能得到不同的参数估计值)

输入:观测变量X,隐变量数据Z,联合分布 P ( X , Z ∣ θ ) P(X,Z|\theta) P(X,Zθ),条件分布 P ( Z ∣ X , θ ) P(Z|X,\theta) P(ZX,θ)(求得Z的期望,带入联合分布中,以求未知数的最大值)
输出:模型参数 θ \theta θ

  • 1)选择参数的初始值 θ ( 0 ) \theta^{(0)} θ(0)开始迭代;
  • 2)E步: θ ( n ) \theta^{(n)} θ(n)为第n次迭代参数的估计值,在第n+1次迭代的E步,计算
    Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j k P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^k P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] \end{aligned} Q(θ,θ(n))=i=1m[zi=jkP(zixi,θ(n))logP(xi,ziθ)]
    此时 P ( z i ∣ x i , θ ( n ) ) P(z_i|x_i,\theta^{(n)}) P(zixi,θ(n))就是在给定观测数据 x i x_i xi和当前参数估计 θ ( n ) \theta^{(n)} θ(n)下隐变量 z i z_i zi的条件概率分布
  • 3)M步:求使 Q ( θ , θ ( n ) ) Q(\theta,\theta^{(n)}) Q(θ,θ(n))最大化的 θ \theta θ,确定第n+1次的参数估计值 θ ( n + 1 ) \theta^{(n+1)} θ(n+1)
    θ ( n + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( n ) ) \theta^{(n+1)} = \arg \max_{\theta}Q(\theta,\theta^{(n)}) θ(n+1)=argθmaxQ(θ,θ(n))
  • 4)重复2、3步,直到收敛(停止的条件,例如 ∣ ∣ θ ( n + 1 ) − θ ( n ) ∣ ∣ ⩽ ϵ 1 ||\theta^{(n+1)} - \theta^{(n)}|| \leqslant \epsilon_1 θ(n+1)θ(n)ϵ1)
9.2.3 算法的收敛性
  • 观测数据的似然函数 P ( X ∣ θ ) P(X|\theta) P(Xθ) θ \theta θ的迭代过程中单调递增
  • EM算法得到的 θ \theta θ值是观测数据对数似然函数 L ( θ ) = log ⁡ P ( X ∣ θ ) L(\theta)=\log P(X|\theta) L(θ)=logP(Xθ)的局部最优解
9.3EM在高斯混合模型中的应用

高斯混合模型:样本以不同的可能性来自不同的高斯分布。
P ( x ∣ θ ) = ∑ z p ( x , z ∣ θ ) = ∑ z p ( z ∣ θ ) p ( x ∣ z , θ ) = ∑ k = 1 K α k ϕ ( x ∣ θ k ) \begin{aligned} P(x|\theta) &= \sum_z p(x,z|\theta) = \sum_z p(z|\theta)p(x|z,\theta) \\ &=\sum_{k=1}^K\alpha_k \phi(x|\theta_k) \end{aligned} P(xθ)=zp(x,zθ)=zp(zθ)p(xz,θ)=k=1Kαkϕ(xθk)

  • 其中 z = 1 , ⋯   , K z={1,\cdots,K} z=1,,K是隐数据,代表取自第几个高斯分布,其中 p ( z = k ∣ θ ) = α k p(z=k|\theta)=\alpha_k p(z=kθ)=αk,代表第k个分模型的权重, α k ⩾ 0 , ∑ k = 1 K α k = 1 \alpha_k\geqslant 0 ,\sum_{k=1}^K \alpha_k=1 αk0,k=1Kαk=1

  • p ( x ∣ z = k , θ ) = ϕ ( x ∣ θ k ) p(x|z=k,\theta)= \phi(x|\theta_k) p(xz=k,θ)=ϕ(xθk) ϕ ( x ∣ θ k ) \phi(x|\theta_k) ϕ(xθk)是高斯分布密度, θ k = ( μ k , σ k 2 ) \theta_k=(\mu_k,\sigma_k^2) θk=(μk,σk2),其中第k个分模型
    ϕ ( x ∣ θ k ) = 1 2 π σ k e x p ( − ( x − μ k ) 2 2 σ k 2 ) \phi(x|\theta_k) = \frac{1}{\sqrt{2\pi}\sigma_k}exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) ϕ(xθk)=2π σk1exp(2σk2(xμk)2)

  • θ = ( α 1 , ⋯   , α K ; θ 1 , ⋯   , θ K , σ 1 2 , ⋯   , σ K 2 , ) \theta=(\alpha_1,\cdots,\alpha_K;\theta_1,\cdots,\theta_K,\sigma_1^2,\cdots,\sigma_K^2,) θ=(α1,,αK;θ1,,θK,σ12,,σK2,)

  • 观测数据x, x j = 1 , 2 , ⋯   , N x_j = 1,2,\cdots,N xj=1,2,,N

  • 隐变量 γ j k \gamma_{jk} γjk
    γ j k = { 1 , 第j个观测来自第k个分模型 0 , 否则 \gamma_{jk}=\begin{cases}1,& \text{第j个观测来自第k个分模型}\\0,& \text{否则}\end{cases} γjk={1,0,j个观测来自第k个分模型否则
    因此 E γ j k = P ( γ j k = 1 ) E\gamma_{jk} = P(\gamma_{jk}=1) Eγjk=P(γjk=1)
    (样本中只有观测数据 x j x_j xj,并不知道其由哪个模型生成的观测数据)

  • 观测数据 x j x_j xj对应的未观测数据 r j = ( r j 1 , r j 2 , ⋯   , r j K ) r_j=(r_{j1},r_{j2},\cdots,r_{jK}) rj=(rj1,rj2,,rjK),取值可能为 ( 1 , 0 , ⋯   , 0 ) , ( 0 , 1 , ⋯   , 0 ) , ⋯   , ( 0 , 0 , ⋯   , 1 ) (1,0,\cdots,0),(0,1,\cdots,0),\cdots,(0,0,\cdots,1) (1,0,,0),(0,1,,0),,(0,0,,1),只有一个值为1,其他都为0

9.3.1利用EM结论的推导

EM算法的Q函数
Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j K P ( z i ∣ x i , θ ( n ) ) log ⁡ P ( x i , z i ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(n)})=\sum_{i=1}^m \Big[\sum_{z_i = j}^K P(z_i|x_i,\theta^{(n)})\log P(x_i,z_i|\theta)\Big] \end{aligned} Q(θ,θ(n))=i=1m[zi=jKP(zixi,θ(n))logP(xi,ziθ)]
对于高斯混合模型来说:(其中 θ z i \theta_{z_i} θzi是总参数 θ \theta θ中第 z i z_i zi个高斯分布的参数, α z i \alpha_{z_i} αzi是第 z i z_i zi个高斯分布的权重)
P ( x i , z i ∣ θ ) = P ( x i ∣ z i , θ ) P ( z i ∣ θ ) = α z i ϕ ( x i ∣ θ z i ) P ( z i ∣ x i , θ ( n ) ) = α z i ϕ ( x i ∣ θ z i ( n ) ) ∑ k = 1 K α k ϕ ( x i ∣ θ k ( n ) ) Q ( θ , θ ( n ) ) = ∑ i = 1 m [ ∑ z i = j K α z i ϕ ( x i ∣ θ z i ( n ) ) ∑ k = 1 K α k ϕ ( x i ∣ θ k ( n ) ) log ⁡ α z i ϕ ( x i ∣ θ z i ) ] \begin{aligned} P(x_i,z_i|\theta) &= P(x_i|z_i,\theta)P(z_i|\theta)= \alpha_{z_i}\phi(x_i|\theta_{z_i})\\ P(z_i|x_i,\theta^{(n)}) &= \frac{\alpha_{z_i}\phi(x_i|\theta_{z_i}^{(n)})}{\sum_{k=1}^K \alpha_k\phi(x_i|\theta_k^{(n)})}\\ Q(\theta,\theta^{(n)})&=\sum_{i=1}^m\Big[ \sum_{z_i = j}^K \frac{\alpha_{z_i}\phi(x_i|\theta_{z_i}^{(n)})}{\sum_{k=1}^K \alpha_k\phi(x_i|\theta_k^{(n)})}\log \alpha_{z_i}\phi(x_i|\theta_{z_i})\Big] \end{aligned} P(xi,ziθ)P(zixi,θ(n))Q(θ,θ(n))=P(xizi,θ)P(ziθ)=αziϕ(xiθzi)=k=1Kαkϕ(xiθk(n))αziϕ(xiθzi(n))=i=1m[zi=jKk=1Kαkϕ(xiθk(n))αziϕ(xiθzi(n))logαziϕ(xiθzi)]

9.3.2书上的推导

下面的公式中,对于每个样本 x j x_j xj的参数 r j 1 , r j 2 , ⋯   , r j K r_{j1},r_{j2},\cdots,r_{jK} rj1,rj2,,rjK中,只有一个值为1,其他值都为0。
那么每个样本的概率可以表示为
p ( x j , r j ∣ θ ) = ∏ k = 1 K [ α k r j k ϕ ( x j ∣ θ k ) r j k ] p(x_j,r_j|\theta)=\prod_{k=1}^K [\alpha_k^{r_{jk}}\phi(x_j|\theta_k)^{r_{jk}}] p(xj,rjθ)=k=1K[αkrjkϕ(xjθk)rjk]
所有样本的极大似然函数可以表示为
P ( x , γ ∣ θ ) = ∏ j = 1 N ∏ k = 1 K [ α k r j k ϕ ( x j ∣ θ k ) r j k ] ∏ j = 1 N ∏ k = 1 K α k r j k = ∏ k = 1 K ∏ j = 1 N α k r j k = ∏ k = 1 K α k ∑ j = 1 N r j k = ∏ k = 1 K α k n k 令nk为依赖第k个模型生成观测值的样本数 P ( x , γ ∣ θ ) = ∏ k = 1 K [ α k n k ∏ j = 1 N ϕ ( x j ∣ θ k ) r j k ] = ∏ k = 1 K [ α k n k ∏ j = 1 N [ 1 2 π σ k e x p ( − ( x j − μ k ) 2 2 σ k 2 ) ] r j k ] \begin{aligned} P(x,\gamma|\theta) &= \prod_{j=1}^N\prod_{k=1}^K\big[\alpha_k^{r_{jk}}\phi(x_j|\theta_k)^{r_{jk}}\big] \\ \prod_{j=1}^N\prod_{k=1}^K\alpha_k^{r_{jk}} &= \prod_{k=1}^K\prod_{j=1}^N\alpha_k^{r_{jk}} = \prod_{k=1}^K\alpha_k^{\sum_{j=1}^Nr_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k} \quad \text{令nk为依赖第k个模型生成观测值的样本数}\\ P(x,\gamma|\theta)&= \prod_{k=1}^K\Big[\alpha_k^{n_k}\prod_{j=1}^N\phi(x_j|\theta_k)^{r_{jk}}\Big] \\ &= \prod_{k=1}^K\Big[\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}exp(-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]^{r_{jk}} \Big] \end{aligned} P(x,γθ)j=1Nk=1KαkrjkP(x,γθ)=j=1Nk=1K[αkrjkϕ(xjθk)rjk]=k=1Kj=1Nαkrjk=k=1Kαkj=1Nrjk=k=1Kαknknk为依赖第k个模型生成观测值的样本数=k=1K[αknkj=1Nϕ(xjθk)rjk]=k=1K[αknkj=1N[2π σk1exp(2σk2(xjμk)2)]rjk]
其中 n k = ∑ j = 1 N r j k , ∑ k = 1 K n k = N n_k = \sum_{j=1}^Nr_{jk},\sum_{k=1}^Kn_k = N nk=j=1Nrjk,k=1Knk=N

对数似然函数
log ⁡ P ( x , γ ∣ θ ) = ∑ k = 1 K { n k log ⁡ α k + ∑ j = 1 N r j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] } \begin{aligned} \log P(x,\gamma|\theta)= \sum_{k=1}^K \Big\{ n_k\log \alpha_k + \sum_{j=1}^N r_{jk}[\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2] \Big\} \end{aligned} logP(x,γθ)=k=1K{nklogαk+j=1Nrjk[log(2π 1)logσk2σk21(xjμk)2]}
算法E步
Q ( θ , θ ( i ) ) = E γ [ log ⁡ P ( x , γ ∣ θ ) ∣ x , θ ( i ) ] = E γ { ∑ k = 1 K [ n k log ⁡ α k + ∑ j = 1 N r j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] } = ∑ k = 1 K [ E ( n k ) log ⁡ α k + ∑ j = 1 N E ( r j k ) [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] \begin{aligned} Q(\theta,\theta^{(i)}) &= E_{\gamma}[\log P(x,\gamma|\theta)|x,\theta^{(i)}] \\ &= E_{\gamma} \Big\{ \sum_{k=1}^K \Big[n_k\log \alpha_k + \sum_{j=1}^N r_{jk}[\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2] \Big]\Big\} \\ &= \sum_{k=1}^K\Big[E(n_k)\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] \\ \end{aligned} Q(θ,θ(i))=Eγ[logP(x,γθ)x,θ(i)]=Eγ{k=1K[nklogαk+j=1Nrjk[log(2π 1)logσk2σk21(xjμk)2]]}=k=1K[E(nk)logαk+j=1NE(rjk)[log(2π 1)logσk2σk21(xjμk)2]]
计算
E ( n k ) = E ( ∑ j = 1 N r j k ) = ∑ j = 1 N E ( r j k ) E ( r j k ) = E ( r j k ∣ x , θ ( i ) ) = P ( r j k = 1 ∣ x , θ ( i ) ) = P ( r j k = 1 , x ∣ θ ( i ) ) P ( x ∣ θ ( i ) ) = P ( r j k = 1 , x ∣ θ ( i ) ) ∑ k = 1 K P ( r j k = 1 , x ∣ θ ( i ) ) = P ( r j k = 1 ∣ θ ( i ) ) P ( x ∣ r j k = 1 , θ ( i ) ) ∑ k = 1 K P ( r j k = 1 ∣ θ ( i ) ) P ( x ∣ r j k = 1 , θ ( i ) ) = α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) \begin{aligned} E(n_k) &= E(\sum_{j=1}^Nr_{jk}) = \sum_{j=1}^NE(r_{jk}) \\ E(r_{jk}) &= E(r_{jk}|x,\theta^{(i)}) = P(r_{jk}=1|x,\theta^{(i)}) = \frac{P(r_{jk}=1,x|\theta^{(i)})}{P(x|\theta^{(i)})} \\ &=\frac{P(r_{jk}=1,x|\theta^{(i)})}{\sum_{k=1}^KP(r_{jk}=1,x|\theta^{(i)})} \\ &= \frac{P(r_{jk}=1|\theta^{(i)})P(x|r_{jk}=1,\theta^{(i)})}{\sum_{k=1}^KP(r_{jk}=1|\theta^{(i)})P(x|r_{jk}=1,\theta^{(i)})} \\ &= \frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{k=1}^K\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})} \\ \end{aligned} E(nk)E(rjk)=E(j=1Nrjk)=j=1NE(rjk)=E(rjkx,θ(i))=P(rjk=1x,θ(i))=P(xθ(i))P(rjk=1,xθ(i))=k=1KP(rjk=1,xθ(i))P(rjk=1,xθ(i))=k=1KP(rjk=1θ(i))P(xrjk=1,θ(i))P(rjk=1θ(i))P(xrjk=1,θ(i))=k=1Kαk(i)ϕ(xj,θk(i))αk(i)ϕ(xj,θk(i))
带入Q函数得到
Q ( θ , θ ( i ) ) = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log ⁡ α k + ∑ j = 1 N E ( r j k ) [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] Q(\theta,\theta^{(i)}) = \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk})\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] Q(θ,θ(i))=k=1K[j=1NE(rjk)logαk+j=1NE(rjk)[log(2π 1)logσk2σk21(xjμk)2]]


两种方法结果等价
Q ( θ , θ ( i ) ) = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log ⁡ α k + ∑ j = 1 N E ( r j k ) [ log ⁡ ( 1 2 π ) − log ⁡ σ k − 1 2 σ k 2 ( x j − μ k ) 2 ] ] = ∑ k = 1 K [ ∑ j = 1 N E ( r j k ) log ⁡ α k ϕ ( x j ∣ θ ) ] = ∑ j = 1 N [ ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ l = 1 K α l ( i ) ϕ ( x j , θ l ( i ) ) log ⁡ α k ϕ ( x j ∣ θ ) ] \begin{aligned} Q(\theta,\theta^{(i)}) &= \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk})\log \alpha_k + \sum_{j=1}^N E(r_{jk}) [\log(\frac{1}{\sqrt{2\pi}}) - \log \sigma_k - \frac{1}{2\sigma_k^2}(x_j-\mu_k)^2]\Big] \\ &= \sum_{k=1}^K\Big[\sum_{j=1}^N E(r_{jk}) \log \alpha_k \phi(x_j|\theta) \Big] \\ &= \sum_{j=1}^N \Big[\sum_{k=1}^K \frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{l=1}^K\alpha_l^{(i)}\phi(x_j,\theta_l^{(i)})} \log \alpha_k \phi(x_j|\theta)\Big] \end{aligned} Q(θ,θ(i))=k=1K[j=1NE(rjk)logαk+j=1NE(rjk)[log(2π 1)logσk2σk21(xjμk)2]]=k=1K[j=1NE(rjk)logαkϕ(xjθ)]=j=1N[k=1Kl=1Kαl(i)ϕ(xj,θl(i))αk(i)ϕ(xj,θk(i))logαkϕ(xjθ)]


算法M步

  • 对于 μ k , σ k 2 \mu_k,\sigma_k^2 μk,σk2,没有约束条件,直接对Q函数求导
    ∂ Q ( θ , θ ( i ) ) ∂ μ k = ∑ j = 1 N E ( r j k ) [ 1 2 σ 2 ( x j − μ k ) ] = 0 ∂ Q ( θ , θ ( i ) ) ∂ σ k 2 = ∑ j = 1 N E ( r j k ) [ − 1 2 σ k 2 + 1 2 σ k 4 ( x j − μ k ) 2 ] = 0 \begin{aligned} \frac{\partial Q(\theta,\theta^{(i)})}{\partial \mu_k} & = \sum_{j=1}^N E(r_{jk})[\frac{1}{2\sigma^2}(x_j-\mu_k)]=0 \\ \frac{\partial Q(\theta,\theta^{(i)})}{\partial \sigma_k^2}&= \sum_{j=1}^NE(r_{jk})[-\frac{1}{2\sigma_k^2} + \frac{1}{2\sigma_k^4}(x_j-\mu_k)^2]=0 \\ \end{aligned} μkQ(θ,θ(i))σk2Q(θ,θ(i))=j=1NE(rjk)[2σ21(xjμk)]=0=j=1NE(rjk)[2σk21+2σk41(xjμk)2]=0
    得到
    μ k = ∑ j = 1 N E ( r j k ) x j ∑ j = 1 N E ( r j k ) σ k = ∑ j = 1 N E ( r j k ) ( x j − μ k ) 2 ∑ j = 1 N E ( r j k ) \begin{aligned} \mu_k &=\frac{\sum_{j=1}^NE(r_{jk})x_j}{\sum_{j=1}^NE(r_{jk})} \\ \sigma_k &= \frac{\sum_{j=1}^NE(r_{jk})(x_j-\mu_k)^2}{\sum_{j=1}^NE(r_{jk})} \end{aligned} μkσk=j=1NE(rjk)j=1NE(rjk)xj=j=1NE(rjk)j=1NE(rjk)(xjμk)2
  • 对于 α k \alpha_k αk,由于存在条件 ∑ k = 1 K α k = 1 \sum_{k=1}^K\alpha_k=1 k=1Kαk=1,原问题的拉格朗日函数为
    F ( α ) = Q ( α ) + β ( ∑ k = 1 K α k − 1 ) = ∑ k = 1 K ∑ j = 1 N E ( r j k ) log ⁡ α k + G ( ) + β ( ∑ k = 1 K α k − 1 ) ∂ F ( α ) ∂ α k = ∑ j = 1 N E ( r j k ) α k + β = 0 → α k = ∑ j = 1 N E ( r j k ) β ∑ k = 1 K α k = ∑ k = 1 K ∑ j = 1 N E ( r j k ) β = 1 → β = ∑ k = 1 K ∑ j = 1 N E ( r j k ) = N \begin{aligned} F(\alpha) &= Q(\alpha) + \beta(\sum_{k=1}^K\alpha_k-1)= \sum_{k=1}^K\sum_{j=1}^N E(r_{jk})\log \alpha_k + G(\text{}) + \beta(\sum_{k=1}^K\alpha_k-1) \\ \frac{\partial F(\alpha)}{\partial \alpha_k} &= \frac{\sum_{j=1}^N E(r_{jk})}{\alpha_k} + \beta = 0 \rightarrow \alpha_k = \frac{\sum_{j=1}^N E(r_{jk})}{\beta} \\ \sum_{k=1}^K\alpha_k &= \frac{\sum_{k=1}^K\sum_{j=1}^N E(r_{jk})}{\beta}=1 \rightarrow \beta=\sum_{k=1}^K\sum_{j=1}^N E(r_{jk})=N \\ \end{aligned} F(α)αkF(α)k=1Kαk=Q(α)+β(k=1Kαk1)=k=1Kj=1NE(rjk)logαk+G()+β(k=1Kαk1)=αkj=1NE(rjk)+β=0αk=βj=1NE(rjk)=βk=1Kj=1NE(rjk)=1β=k=1Kj=1NE(rjk)=N
    得到
    α k 2 = ∑ j = 1 N E ( r j k ) N \alpha_k^2 = \frac{\sum_{j=1}^N E(r_{jk})}{N} αk2=Nj=1NE(rjk)
9.3.3 西瓜书的推导

西瓜书上并没有利用Q函数,直接拉格朗日函数求导了
P ( x i ∣ θ ) = ∑ k = 1 K α k ϕ ( x i ∣ θ k ) L = ∑ j = 1 N log ⁡ ∑ k = 1 K α k ϕ ( x i ∣ θ k ) s . t . ∑ k = 1 K α k = 1 \begin{aligned} P(x_i|\theta)&= \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k)\\ L &=\sum_{j=1}^N\log \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k) \\ s.t. &\quad \sum_{k=1}^K \alpha_k=1 \end{aligned} P(xiθ)Ls.t.=k=1Kαkϕ(xiθk)=j=1Nlogk=1Kαkϕ(xiθk)k=1Kαk=1
拉格朗日函数为
l = ∑ j = 1 N log ⁡ ∑ k = 1 K α k ϕ ( x i ∣ θ k ) + λ ( ∑ k = 1 K α k − 1 ) l = \sum_{j=1}^N\log \sum_{k=1}^K\alpha_k\phi(x_i|\theta_k) + \lambda(\sum_{k=1}^K \alpha_k -1) l=j=1Nlogk=1Kαkϕ(xiθk)+λ(k=1Kαk1)
α k , μ k , Σ k \alpha_k,\mu_k,\Sigma_k αk,μk,Σk求导会和用EM算法得到一样的结果

9.3.4高斯混合模型参数估计的EM算法

输入:观测数据 x i , ⋯   , x N x_i,\cdots,x_N xi,,xN,高斯混合模型
输出:高斯混合模型参数

  • 1)取参数的初始值开始迭代
  • 2)E步:依据当前模型参数,计算分模型k对观测数据的影响度
    γ ^ j k = E ( γ j k ) = α k ( i ) ϕ ( x j , θ k ( i ) ) ∑ k = 1 K α k ( i ) ϕ ( x j , θ k ( i ) ) , j = 1 , 2 , ⋯   , N ; k = 1 , 2 , ⋯   , K \hat{\gamma}_{jk} = E(\gamma_{jk})=\frac{\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})}{\sum_{k=1}^K\alpha_k^{(i)}\phi(x_j,\theta_k^{(i)})} ,\quad j=1,2,\cdots,N;k=1,2,\cdots,K γ^jk=E(γjk)=k=1Kαk(i)ϕ(xj,θk(i))αk(i)ϕ(xj,θk(i)),j=1,2,,N;k=1,2,,K
  • 3)M步:计算新一轮的模型参数
    μ k = ∑ j = 1 N E ( r j k ) x j ∑ j = 1 N E ( r j k ) \mu_k =\frac{\sum_{j=1}^NE(r_{jk})x_j}{\sum_{j=1}^NE(r_{jk})} μk=j=1NE(rjk)j=1NE(rjk)xj
    σ k 2 = ∑ j = 1 N E ( r j k ) ( x j − μ k ) 2 ∑ j = 1 N E ( r j k ) \sigma_k^2 = \frac{\sum_{j=1}^NE(r_{jk})(x_j-\mu_k)^2}{\sum_{j=1}^NE(r_{jk})} σk2=j=1NE(rjk)j=1NE(rjk)(xjμk)2
    α k = ∑ j = 1 N E ( r j k ) N \alpha_k = \frac{\sum_{j=1}^N E(r_{jk})}{N} αk=Nj=1NE(rjk)
  • 4)重复2,3步至收敛
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值