EM算法推导(收敛性证明和在GMM中的应用)

一、EM算法的提出

当你有一组数据像如下这样:

在这里插入图片描述
Note: picture source

显然用单个高斯分布模型去拟合它们效果不好,这是一个典型的高斯混合模型的例子:
p ( X ) = ∑ l = 1 k α l N ( X ∣ μ l , Σ l ) ∑ l = 1 k α l = 1 p(X)=\sum_{l=1}^k \alpha_lN (X|\mu_l,\Sigma_l) \quad\sum_{l=1}^{k} \alpha_l=1 p(X)=l=1kαlN(Xμl,Σl)l=1kαl=1 ( 其 中 α l 可 以 理 解 为 每 一 个 高 斯 分 布 的 权 重 ) (其中 \alpha_l 可以理解为每一个高斯分布的权重) αl
Θ = { α 1 , … , α k , μ 1 , … , μ k , Σ 1 , … , Σ k } \Theta=\{\alpha_1,\ldots,\alpha_k,\mu_1,\ldots,\mu_k,\Sigma_1,\ldots,\Sigma_k\} Θ={α1,,αk,μ1,,μk,Σ1,,Σk},则有:

Θ M L E = arg ⁡ max ⁡ Θ L ( Θ ∣ X ) = arg ⁡ max ⁡ Θ ( ∑ i = 1 n l o g ∑ l = 1 k α l N ( X ∣ μ l , Σ l ) ) \Theta_{MLE}=\mathop{\arg\max}_{\Theta}L(\Theta|X)\\ =\mathop{\arg\max}_{\Theta} \left ( \sum_{i=1}^{n}log \sum_{l=1}^{k} \alpha_lN(X|\mu_l,\Sigma_l) \right ) ΘMLE=argmaxΘL(ΘX)=argmaxΘ(i=1nlogl=1kαlN(Xμl,Σl))

该式子包含和(或积分)的对数,不能像单个高斯模型那样直接求导,再令导数为0来求解。这时我们需要利用 EM 算法通过迭代逐步近似极大化 L ( Θ ∣ X ) L(\Theta|X) L(ΘX) 来求解。

在这里插入图片描述
Note: picture source

二、EM算法的导出

先提出 Jensen 不等式:
对于凸函数(convex),有:
f ( t ⋅ x 1 + ( 1 − t ) ⋅ x 2 ) ≤ t ⋅ f ( x 1 ) + ( 1 − t ) ⋅ f ( x 2 ) f(t \cdot x_1+(1-t)\cdot x_2) \leq t\cdot f(x_1)+(1-t)\cdot f(x_2) f(tx1+(1t)x2)tf(x1)+(1t)f(x2)扩展到高维,令 ∑ i = 1 k p i = 1 p i ≥ 0 \sum_{i=1}^{k} p_i=1 \quad p_i \geq0 i=1kpi=1pi0
f ( p 1 ⋅ x 1 + … + p k ⋅ x k ) ≤ p 1 ⋅ f ( x 1 ) + … + p k ⋅ f ( x k ) f(p_1 \cdot x_1+\ldots +p_k\cdot x_k) \leq p_1\cdot f(x_1)+\ldots +p_k\cdot f(x_k) f(p1x1++pkxk)p1f(x1)++pkf(xk) f ( ∑ i = 1 k p i ⋅ x i ) ≤ ∑ i = 1 k p i ⋅ f ( x i ) f(\sum_{i=1}^{k} p_i\cdot x_i)\leq \sum_{i=1}^{k} p_i\cdot f(x_i) f(i=1kpixi)i=1kpif(xi) ϕ \phi ϕ 代替 f f f f ( x ) f(x) f(x) 代替 x x x, 我们有
ϕ ( ∑ i = 1 k p i ⋅ f ( x i ) ) ≤ ∑ i = 1 k p i ⋅ ϕ ( f ( x i ) ) \phi(\sum_{i=1}^{k} p_i\cdot f(x_i))\leq \sum_{i=1}^{k} p_i\cdot \phi (f(x_i)) ϕ(i=1kpif(xi))i=1kpiϕ(f(xi))

故对于凸函数(convex),有下面这条结论:
ϕ ( E [ f ( x ) ] ) ≤ E [ ϕ ( f ( x ) ) ] \phi(E[f(x)])\leq E[\phi(f(x))] ϕ(E[f(x)])E[ϕ(f(x))] 同理,对于凹函数(concave),有相反的结论:
ϕ ( E [ f ( x ) ] ) ≥ E [ ϕ ( f ( x ) ) ] \phi(E[f(x)])\geq E[\phi(f(x))] ϕ(E[f(x)])E[ϕ(f(x))]

我们通过引入隐变量 Z 来极大化观测数据 X 关于参数 θ \theta θ 的对数似然函数:
L ( θ ) = l n   P ( X ∣ θ ) = l n ( P ( X , Z ∣ θ ) P ( Z ∣ X , θ ) ) = l n ( P ( X , Z ∣ θ ) Q ( Z ) ⋅ Q ( Z ) P ( Z ∣ X , θ ) ) = l n ( P ( X , Z ∣ θ ) Q ( Z ) ) + l n ( Q ( Z ) P ( Z ∣ X , θ ) ) L(\theta)=ln\ P(X|\theta)=ln\left ( \frac{P(X,Z|\theta )}{P(Z|X,\theta )}\right )\\ =ln\left ( \frac{P(X,Z|\theta )}{Q(Z)} \cdot \frac{Q(Z)}{P(Z|X,\theta )}\right )\\ =ln\left ( \frac{P(X,Z|\theta )}{Q(Z)} \right ) +ln\left ( \frac{Q(Z)}{P(Z|X,\theta )} \right ) L(θ)=ln P(Xθ)=ln(P(ZX,θ)P(X,Zθ))=ln(Q(Z)P(X,Zθ)P(ZX,θ)Q(Z))=ln(Q(Z)P(X,Zθ))+ln(P(ZX,θ)Q(Z))

故:
l n   P ( X ∣ θ ) = ∫ Z l n ( P ( X , Z ∣ θ ) Q ( Z ) ) Q ( Z ) + ∫ Z l n ( Q ( Z ) P ( Z ∣ X , θ ) ) Q ( Z ) = r ( X ∣ θ ) + K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) ln\ P(X|\theta) =\int_{Z}ln\left ( \frac{P(X,Z|\theta )}{Q(Z)} \right )Q(Z) + \int_{Z} ln\left ( \frac{Q(Z)}{P(Z|X,\theta )} \right )Q(Z)\\ =r(X|\theta)+KL(Q(Z)||P(Z|X,\theta )) ln P(Xθ)=Zln(Q(Z)P(X,Zθ))Q(Z)+Zln(P(ZX,θ)Q(Z))Q(Z)=r(Xθ)+KL(Q(Z)P(ZX,θ))

其中, K L ( ⋅ ) ≥ 0 KL(\cdot)\geq 0 KL()0,则 l n   P ( X ∣ θ ) ≥ r ( X ∣ θ ) ln\ P(X|\theta)\geq r(X|\theta) ln P(Xθ)r(Xθ),也可利用上面的 Jensen 不等式证明:
l n   P ( X ∣ θ ) = l n ∫ Z P ( X , Z ∣ θ ) = l n ∫ Z P ( X , Z ∣ θ ) Q ( Z ) ⋅ Q ( Z ) = l n E Q ( Z ) [ f ( Z ) ] ≥ E Q ( Z ) l n [ f ( Z ) ] = ∫ Z l n ( P ( X , Z ∣ θ ) Q ( Z ) ) ⋅ Q ( Z ) ln\ P(X|\theta)=ln \int_ZP(X,Z|\theta)\\ =ln \int_Z \frac{P(X,Z|\theta)}{Q(Z)}\cdot Q(Z)=lnE_{Q(Z)}[f(Z)]\\ \geq E_{Q(Z)}ln[f(Z)]= \int_Z ln \left(\frac{P(X,Z|\theta)}{Q(Z)} \right)\cdot Q(Z) ln P(Xθ)=lnZP(X,Zθ)=lnZQ(Z)P(X,Zθ)Q(Z)=lnEQ(Z)[f(Z)]EQ(Z)ln[f(Z)]=Zln(Q(Z)P(X,Zθ))Q(Z)
又当 Q ( Z ) = P ( Z ∣ X , Θ ( g ) ) Q(Z)=P(Z|X,\Theta^{(g)} ) Q(Z)=P(ZX,Θ(g)) 时 ,有 K L ( ⋅ ) = 0 KL(\cdot)=0 KL()=0,此时有:
l n   P ( X ∣ Θ ( g ) ) = r ( X ∣ Θ ( g ) ) ln\ P(X|\Theta^{(g)}) = r(X|\Theta^{(g)}) ln P(XΘ(g))=r(XΘ(g))
由上 r ( X ∣ Θ ) r(X|\Theta) r(XΘ) L ( Θ ) L(\Theta) L(Θ) 的一个下界函数,我们通过不断求解下界函数的极大化来逼近求解对数似然函数的极大化:

Θ ( g + 1 ) = arg ⁡ max ⁡ Θ ∫ Z l n ( P ( X , Z ∣ Θ ) P ( Z ∣ X , Θ ( g ) ) ) P ( Z ∣ X , Θ ( g ) ) = arg ⁡ max ⁡ Θ ∫ Z l n ( P ( X , Z ∣ Θ ) ) P ( Z ∣ X , Θ ( g ) )   d z \Theta^{(g+1)}=\mathop{\arg\max}_{\Theta} \int_{Z}ln\left ( \frac{P(X,Z|\Theta )}{P(Z|X,\Theta^{(g)})} \right )P(Z|X,\Theta^{(g)})\\ =\mathop{\arg\max}_{\Theta} \int_{Z}ln\left ( P(X,Z|\Theta ) \right )P(Z|X,\Theta^{(g)})\ dz Θ(g+1)=argmaxΘZln(P(ZX,Θ(g))P(X,ZΘ))P(ZX,Θ(g))=argmaxΘZln(P(X,ZΘ))P(ZX,Θ(g)) dz

EM算法每次迭代包含两步:E步,求期望;M步,求极大化。令 : Q ( Θ , Θ ( g ) ) = ∫ Z l n ( P ( X , Z ∣ Θ ) ) P ( Z ∣ X , Θ ( g ) )   d z Q(\Theta,\Theta^{(g)})= \int_{Z}ln\left ( P(X,Z|\Theta ) \right )P(Z|X,\Theta^{(g)})\ dz Q(Θ,Θ(g))=Zln(P(X,ZΘ))P(ZX,Θ(g)) dz
EM算法如下:

EM算法:
输入:观测变量数据X,隐变量数据Z,联合分布 P ( X , Z ∣ Θ ) P(X,Z|\Theta) P(X,ZΘ) ,条件分布 P ( Z ∣ X , Θ ) P(Z|X,\Theta) P(ZX,Θ)
输出:模型参数 Θ \Theta Θ
(1) 选择初始参数 Θ ( 0 ) \Theta^{(0)} Θ(0)
(2) E步,记 Θ ( i ) \Theta^{(i)} Θ(i) 为第 i 次迭代参数 Θ \Theta Θ 的估计值,在第 i+1 次迭代的E步, 计算 Q ( Θ , Θ ( g ) ) Q(\Theta,\Theta^{(g)}) Q(Θ,Θ(g));
(3) M步,确定第 i+1 次迭代的参数的估计值 Θ ( i + 1 ) \Theta^{(i+1)} Θ(i+1),即:
Θ ( i + 1 ) = arg ⁡ max ⁡ Θ   Q ( Θ , Θ ( g ) ) \Theta^{(i+1)}=\mathop{\arg\max}_{\Theta}\ Q(\Theta,\Theta^{(g)}) Θ(i+1)=argmaxΘ Q(Θ,Θ(g))
(4) 重复(2)步和(3)步,直到收敛。

下图给出 EM 算法的直观解释:

这里写图片描述

由图,两个函数在 θ = θ ( g ) \theta=\theta^{(g)} θ=θ(g) 处相等,由EM算法 (3) 步,我们得到下一个点 θ ( g + 1 ) \theta^{(g+1)} θ(g+1) 使下界函数极大化。下界函数的增加保证对数似然函数在每次迭代中也是增加的。EM算法在点 θ ( g + 1 ) \theta^{(g+1)} θ(g+1) 处重新计算 Q ( Θ , Θ ( g + 1 ) ) Q(\Theta,\Theta^{(g+1)}) Q(Θ,Θ(g+1)), 进行下一次迭代。迭代过程中,对数似然函数不断增大,但从图可以看出EM算法不能保证找到全局最优值。

三、EM算法的收敛性

P ( X ∣ θ ) = P ( X , Z ∣ θ ) P ( Z ∣ X , θ ) P(X|\theta)=\frac{P(X,Z|\theta)}{P(Z|X,\theta)} P(Xθ)=P(ZX,θ)P(X,Zθ)取对数有:
l o g P ( X ∣ θ ) = l o g P ( X , Z ∣ θ ) − l o g P ( Z ∣ X , θ ) logP(X|\theta)=logP(X,Z|\theta)-logP(Z|X,\theta) logP(Xθ)=logP(X,Zθ)logP(ZX,θ)
记, Q ( θ , θ ( g ) ) = ∫ Z l o g ( P ( X , Z ∣ θ ) ) P ( Z ∣ X , θ ( g ) )   d z Q(\theta,\theta^{(g)})=\int_{Z}log \left ( P(X,Z|\theta ) \right )P(Z|X,\theta^{(g)})\ dz Q(θ,θ(g))=Zlog(P(X,Zθ))P(ZX,θ(g)) dz H ( θ , θ ( g ) ) = ∫ Z l o g ( P ( Z ∣ X , θ ) ) P ( Z ∣ X , θ ( g ) )   d z H(\theta,\theta^{(g)})=\int_{Z}log \left ( P(Z|X,\theta ) \right )P(Z|X,\theta^{(g)})\ dz H(θ,θ(g))=Zlog(P(ZX,θ))P(ZX,θ(g)) dz
于是对数似然函数可以写成:
l o g P ( X ∣ θ ) = Q ( θ , θ ( g ) ) − H ( θ , θ ( g ) ) logP(X|\theta)=Q(\theta,\theta^{(g)})-H(\theta,\theta^{(g)}) logP(Xθ)=Q(θ,θ(g))H(θ,θ(g))
故有如下等式:
l o g P ( X ∣ θ ( g + 1 ) ) − l o g P ( X ∣ θ ( g ) ) = [ Q ( θ ( g + 1 ) , θ ( g ) ) − Q ( θ ( g ) , θ ( g ) ) ] − [ H ( θ ( g + 1 ) , θ ( g ) ) − H ( θ ( g ) , θ ( g ) ) ] logP(X|\theta^{(g+1)})-logP(X|\theta^{(g)})=[Q(\theta^{(g+1)},\theta^{(g)})-Q(\theta^{(g)},\theta^{(g)})]-[H(\theta^{(g+1)},\theta^{(g)})-H(\theta^{(g)},\theta^{(g)})] logP(Xθ(g+1))logP(Xθ(g))=[Q(θ(g+1),θ(g))Q(θ(g),θ(g))][H(θ(g+1),θ(g))H(θ(g),θ(g))]
显然,右端第一项,由于 θ ( g + 1 ) \theta^{(g+1)} θ(g+1) 使 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))达到极大,所以有:
Q ( θ ( g + 1 ) , θ ( g ) ) − Q ( θ ( g ) , θ ( g ) ) ≥ 0 Q(\theta^{(g+1)},\theta^{(g)})-Q(\theta^{(g)},\theta^{(g)})\geq0 Q(θ(g+1),θ(g))Q(θ(g),θ(g))0其第二项,有:
H ( θ ( g + 1 ) , θ ( g ) ) − H ( θ ( g ) , θ ( g ) ) = ∫ Z l n ( P ( Z ∣ X , θ ( g + 1 ) ) P ( Z ∣ X , θ ( g ) ) ) P ( Z ∣ X , θ ( g ) ) ≤ l n ∫ Z ( P ( Z ∣ X , θ ( g + 1 ) ) P ( Z ∣ X , θ ( g ) ) P ( Z ∣ X , θ ( g ) ) ) = l n ( ∫ Z P ( Z ∣ X , θ ( g + 1 ) ) ) = 0 H(\theta^{(g+1)},\theta^{(g)})-H(\theta^{(g)},\theta^{(g)})\\ =\int_{Z}ln\left ( \frac{P(Z|X,\theta^{(g+1)} )}{P(Z|X,\theta^{(g)})} \right )P(Z|X,\theta^{(g)})\\ \leq ln\int_{Z}\left ( \frac{P(Z|X,\theta^{(g+1)} )}{P(Z|X,\theta^{(g)})} P(Z|X,\theta^{(g)})\right )\\ =ln(\int_Z P(Z|X,\theta^{(g+1)}))=0 H(θ(g+1),θ(g))H(θ(g),θ(g))=Zln(P(ZX,θ(g))P(ZX,θ(g+1)))P(ZX,θ(g))lnZ(P(ZX,θ(g))P(ZX,θ(g+1))P(ZX,θ(g)))=ln(ZP(ZX,θ(g+1)))=0综上,有: l o g P ( X ∣ θ ( g + 1 ) ) ≥ l o g P ( X ∣ θ ( g ) ) logP(X|\theta^{(g+1)})\geq logP(X|\theta^{(g)}) logP(Xθ(g+1))logP(Xθ(g))

四、EM算法在GMM中的应用

在本文的第一部分已经提出高斯混合模型:
p ( X ) = ∑ l = 1 k α l N ( X ∣ μ l , Σ l ) ∑ l = 1 k α l = 1 p(X)=\sum_{l=1}^k \alpha_lN (X|\mu_l,\Sigma_l) \quad\sum_{l=1}^{k} \alpha_l=1 p(X)=l=1kαlN(Xμl,Σl)l=1kαl=1 Θ = { α 1 , … , α k , μ 1 , … , μ k , Σ 1 , … , Σ k } \Theta=\{\alpha_1,\ldots,\alpha_k,\mu_1,\ldots,\mu_k,\Sigma_1,\ldots,\Sigma_k\} Θ={α1,,αk,μ1,,μk,Σ1,,Σk}

在本文的第三部分我们已经推导出EM算法:
Θ ( g + 1 ) = arg ⁡ max ⁡ Θ ∫ Z l n ( P ( X , Z ∣ Θ ) ) P ( Z ∣ X , Θ ( g ) )   d z \Theta^{(g+1)}=\mathop{\arg\max}_{\Theta} \int_{Z}ln\left ( P(X,Z|\Theta ) \right )P(Z|X,\Theta^{(g)})\ dz Θ(g+1)=argmaxΘZln(P(X,ZΘ))P(ZX,Θ(g)) dz

E step:

我们需要定义这两项 l n P ( X , Z ∣ Θ ) lnP(X,Z|\Theta ) lnP(X,ZΘ) 和 $ P(Z|X,\Theta)$;
P ( X ∣ Θ ) = ∑ l = 1 k α l N ( X ∣ μ l , Σ l ) = ∏ i = 1 n ∑ l = 1 k α l N ( x i ∣ μ l , Σ l ) P(X|\Theta )=\sum_{l=1}^k \alpha_lN (X|\mu_l,\Sigma_l) =\prod_{i=1}^n \sum_{l=1}^k \alpha_l N (x_i|\mu_l,\Sigma_l) P(XΘ)=l=1kαlN(Xμl,Σl)=i=1nl=1kαlN(xiμl,Σl)
由上式,我们可以定义:
P ( X , Z ∣ Θ ) = ∏ i = 1 n p ( x i , z i ∣ Θ ) = ∏ i = 1 n p ( x i ∣ z i , Θ ) p ( z i ∣ Θ ) = ∏ i = 1 n α z i N ( μ z i , Σ z i ) P(X,Z|\Theta )=\prod_{i=1}^n p(x_i,z_i|\Theta)\\ =\prod_{i=1}^n p(x_i|z_i,\Theta) p(z_i|\Theta)=\prod_{i=1}^n \alpha_{z_i}N (\mu_{z_i},\Sigma_{z_i}) P(X,ZΘ)=i=1np(xi,ziΘ)=i=1np(xizi,Θ)p(ziΘ)=i=1nαziN(μzi,Σzi)
由贝叶斯公式,我们有:
P ( Z ∣ X , Θ ) = ∏ i = 1 n p ( z i ∣ x i , Θ ) = ∏ i = 1 n α z i N ( μ z i , Σ z i ) ∑ l = 1 k α l N ( μ l , Σ l ) P(Z|X,\Theta)=\prod_{i=1}^np(z_i|x_i,\Theta)= \prod_{i=1}^n \frac{\alpha_{z_i}N (\mu_{z_i},\Sigma_{z_i})}{ \sum_{l=1}^k \alpha_{l}N (\mu_{l},\Sigma_{l})} P(ZX,Θ)=i=1np(zixi,Θ)=i=1nl=1kαlN(μl,Σl)αziN(μzi,Σzi)
结合两式,得到:
Q ( Θ , Θ ( g ) ) = ∫ Z l n ( P ( X , Z ∣ Θ ) ) P ( Z ∣ X , Θ ( g ) )   d z = ∫ z 1 … ∫ z k ( ∑ i = 1 n [ l n α z i + l n N ( μ z i , Σ z i ) ] ) ⋅ ∏ i = 1 n p ( z i ∣ x i , Θ ( g ) )   d z 1 … d z k Q(\Theta,\Theta^{(g)})=\int_{Z}ln\left ( P(X,Z|\Theta ) \right )P(Z|X,\Theta^{(g)})\ dz\\ =\int_{z_1}\ldots \int_{z_k}\left ( \sum_{i=1}^{n}[ln\alpha _{z_i}+lnN (\mu_{z_i},\Sigma_{z_i})] \right )\cdot \prod_{i=1}^{n}p(z_i|x_i,\Theta^{(g)})\ d{z_1}\ldots d{z_k} Q(Θ,Θ(g))=Zln(P(X,ZΘ))P(ZX,Θ(g)) dz=z1zk(i=1n[lnαzi+lnN(μzi,Σzi)])i=1np(zixi,Θ(g)) dz1dzk
令: f ( z i ) = l n α z i + l n N ( μ z i , Σ z i ) f(z_i)=ln\alpha _{z_i}+lnN (\mu_{z_i},\Sigma_{z_i}) f(zi)=lnαzi+lnN(μzi,Σzi) p ( z 1 , … , z k ) = ∏ i = 1 n p ( z i ∣ x i , Θ ( g ) ) p(z_1,\ldots,z_k)=\prod_{i=1}^{n}p(z_i|x_i,\Theta^{(g)}) p(z1,,zk)=i=1np(zixi,Θ(g)) 又可以写成如下形式:
Q ( Θ , Θ ( g ) ) = ∫ z 1 … ∫ z k ( ∑ i = 1 n f ( z i ) ) ⋅ p ( z 1 , … , z k )   d z 1 … d z k Q(\Theta,\Theta^{(g)})=\int_{z_1}\ldots \int_{z_k}\left ( \sum_{i=1}^{n}f(z_i) \right )\cdot p(z_1,\ldots,z_k)\ d{z_1}\ldots d{z_k} Q(Θ,Θ(g))=z1zk(i=1nf(zi))p(z1,,zk) dz1dzk看上式的第一项,可以作如下化简:
∫ z 1 … ∫ z k ( f ( z 1 ) ) ⋅ p ( z 1 , … , z k )   d z 1 … d z k = ∫ z 1 f ( z 1 ) ∫ z 2 … ∫ z k ⋅ p ( z 1 , … , z k )   d z 1 … d z k = ∫ z 1 f ( z 1 ) ⋅ p ( z 1 ) d z 1 \int_{z_1}\ldots \int_{z_k}\left ( f(z_1) \right )\cdot p(z_1,\ldots,z_k)\ d{z_1}\ldots d{z_k}\\ =\int_{z_1} f(z_1) \int_{z_2} \ldots \int_{z_k}\cdot p(z_1,\ldots,z_k)\ d{z_1}\ldots d{z_k}\\ =\int_{z_1} f(z_1)\cdot p(z_1)d{z_1} z1zk(f(z1))p(z1,,zk) dz1dzk=z1f(z1)z2zkp(z1,,zk) dz1dzk=z1f(z1)p(z1)dz1
每一项都作类似的化简,我们得到:
Q ( Θ , Θ ( g ) ) = ∑ i = 1 n ∫ z i f ( z i ) ⋅ p ( z i ) d z i = ∑ i = 1 n ∫ z i ( l n α z i + l n N ( x i ∣ μ z i , Σ z i ) ) ⋅ p ( z i ∣ x i , Θ ( g ) ) d z i = ∑ z i = 1 k ∑ i = 1 n ( l n α z i + l n N ( x i ∣ μ z i , Σ z i ) ⋅ p ( z i ∣ x i , Θ ( g ) ) = ∑ l = 1 k ∑ i = 1 n ( l n α l + l n N ( x i ∣ μ l , Σ l ) ⋅ p ( l ∣ x i , Θ ( g ) ) Q(\Theta,\Theta^{(g)})=\sum_{i=1}^{n}\int_{z_i} f(z_i)\cdot p(z_i)d{z_i}\\ =\sum_{i=1}^{n}\int_{z_i} \left( ln\alpha _{z_i}+lnN (x_i|\mu_{z_i},\Sigma_{z_i}) \right ) \cdot p(z_i|x_i,\Theta^{(g)})d{z_i}\\ =\sum_{z_i=1}^{k} \sum_{i=1}^{n} \left( ln\alpha _{z_i}+lnN (x_i|\mu_{z_i},\Sigma_{z_i}\right ) \cdot p(z_i|x_i,\Theta^{(g)})\\ =\sum_{l=1}^{k} \sum_{i=1}^{n} \left( ln\alpha _{l}+lnN (x_i|\mu_{l},\Sigma_{l}\right ) \cdot p(l|x_i,\Theta^{(g)}) Q(Θ,Θ(g))=i=1nzif(zi)p(zi)dzi=i=1nzi(lnαzi+lnN(xiμzi,Σzi))p(zixi,Θ(g))dzi=zi=1ki=1n(lnαzi+lnN(xiμzi,Σzi)p(zixi,Θ(g))=l=1ki=1n(lnαl+lnN(xiμl,Σl)p(lxi,Θ(g))

M step:

Q ( Θ , Θ ( g ) ) = ∑ l = 1 k ∑ i = 1 n ( l n α l + l n N ( x i ∣ μ l , Σ l ) ) ⋅ p ( l ∣ x i , Θ ( g ) ) = ∑ l = 1 k ∑ i = 1 n l n α l ⋅ p ( l ∣ x i , Θ ( g ) ) + ∑ l = 1 k ∑ i = 1 n l n [ N ( x i ∣ μ l , Σ l ) ] ⋅ p ( l ∣ x i , Θ ( g ) ) Q(\Theta,\Theta^{(g)})=\sum_{l=1}^{k} \sum_{i=1}^{n} \left( ln\alpha _{l}+lnN (x_i|\mu_{l},\Sigma_{l})\right ) \cdot p(l|x_i,\Theta^{(g)})\\ =\sum_{l=1}^{k} \sum_{i=1}^{n} ln\alpha _{l}\cdot p(l|x_i,\Theta^{(g)})\\ \quad +\sum_{l=1}^{k} \sum_{i=1}^{n}ln[N (x_i|\mu_{l},\Sigma_{l})] \cdot p(l|x_i,\Theta^{(g)}) Q(Θ,Θ(g))=l=1ki=1n(lnαl+lnN(xiμl,Σl))p(lxi,Θ(g))=l=1ki=1nlnαlp(lxi,Θ(g))+l=1ki=1nln[N(xiμl,Σl)]p(lxi,Θ(g))
容易看出第一项只含参数 α \alpha α,第二项只含参数 μ , Σ \mu,\Sigma μ,Σ,因此我们可以独立地进行最大化两项。
(1)最大化 α \alpha α
∂ ∑ l = 1 k ∑ i = 1 n l n α l ⋅ p ( l ∣ x i , Θ ( g ) ) ∂ α 1 , … , ∂ α k = [ 0 , … , 0 ] \frac{\partial \sum_{l=1}^{k} \sum_{i=1}^{n} ln\alpha _{l}\cdot p(l|x_i,\Theta^{(g)}) }{\partial \alpha_1,\ldots,\partial \alpha_k}=[0,\ldots,0] α1,,αkl=1ki=1nlnαlp(lxi,Θ(g))=[0,,0] s t . ∑ l = 1 k α l = 1 st.\sum_{l=1}^{k}\alpha _{l}=1 st.l=1kαl=1
这是一个有约束的极值问题,我们利用拉格朗日乘子法进行求解:
L ( α 1 , … , α k , λ ) = ∑ l = 1 k l n ( α l ) ( ∑ i = 1 n p ( l ∣ x i , Θ ( g ) ) − λ ( ∑ l = 1 k α l − 1 ) L(\alpha_1,\ldots,\alpha_k,\lambda)=\sum_{l=1}^k ln(\alpha_l)\left( \sum_{i=1}^n p(l|x_i,\Theta^{(g)}\right)-\lambda \left( \sum_{l=1}^k \alpha_l-1\right) L(α1,,αk,λ)=l=1kln(αl)(i=1np(lxi,Θ(g))λ(l=1kαl1) 求解如下:
⇒ ∂ L ∂ α l = 1 α l ( ∑ i = 1 n p ( l ∣ x i , Θ ( g ) ) ) − λ = 0 \Rightarrow \frac{\partial L}{\partial \alpha_l}=\frac{1}{\alpha_l}\left ( \sum_{i=1}^{n}p(l|x_i,\Theta^{(g)}) \right )-\lambda =0 αlL=αl1(i=1np(lxi,Θ(g)))λ=0 ∂ L ∂ λ = ( ∑ l = 1 k α l − 1 ) = 0 \frac{\partial L}{\partial \lambda }=\left( \sum_{l=1}^k \alpha_l-1\right)=0 λL=(l=1kαl1)=0 ⇒ α l = 1 N ∑ i = 1 n p ( l ∣ x i , Θ ( g ) ) \Rightarrow \alpha_l=\frac{1}{N} \sum_{i=1}^{n} p(l|x_i,\Theta^{(g)}) αl=N1i=1np(lxi,Θ(g))

由下图我们可以直观理解:
α 1 \alpha_1 α1 就是把所有样本点的 a a + b \frac{a}{a+b} a+ba 加起来再除以样本总数N,即求所有样本点的 a a + b \frac{a}{a+b} a+ba 的均值;
α 2 \alpha_2 α2 就是把所有样本点的 b a + b \frac{b}{a+b} a+bb 加起来再除以样本总数N,即求所有样本点的 b a + b \frac{b}{a+b} a+bb 的均值;

在这里插入图片描述

(2)最大化 μ , Σ \mu,\Sigma μ,Σ
∂ ∑ l = 1 k ∑ i = 1 n l n [ N ( x i ∣ μ l , Σ l ) ] ⋅ p ( l ∣ x i , Θ ( g ) ) ∂ μ 1 , … , ∂ μ k , ∂ Σ 1 , … , ∂ Σ k = [ 0 , … , 0 ] \frac{\partial \sum_{l=1}^{k} \sum_{i=1}^{n}ln[N (x_i|\mu_{l},\Sigma_{l})] \cdot p(l|x_i,\Theta^{(g)}) }{\partial \mu_1,\ldots,\partial \mu_k,\partial \Sigma_1,\ldots,\partial \Sigma_k}=[0,\ldots,0] μ1,,μk,Σ1,,Σkl=1ki=1nln[N(xiμl,Σl)]p(lxi,Θ(g))=[0,,0]
经过化简可以得到:
μ l = ∑ i = 1 n x i p ( l ∣ x i , Θ ) ∑ i = 1 n p ( l ∣ x i , Θ ) \mu_l=\frac{\sum_{i=1}^n x_ip(l|x_i,\Theta ) }{\sum_{i=1}^n p(l|x_i,\Theta ) } μl=i=1np(lxi,Θ)i=1nxip(lxi,Θ)
Σ l = ∑ i = 1 n ( x i − μ l ) ( x i − μ l ) T p ( l ∣ x i , Θ ) ∑ i = 1 n p ( l ∣ x i , Θ ) \Sigma_l=\frac{\sum_{i=1}^{n} (x_i-\mu_l)(x_i-\mu_l)^Tp(l|x_i,\Theta)}{\sum_{i=1}^{n} p(l|x_i,\Theta)} Σl=i=1np(lxi,Θ)i=1n(xiμl)(xiμl)Tp(lxi,Θ)

五、PYTHON Demos

Demo1:
这里写图片描述
Demo2:
这里写图片描述

——————————代码链接——————————

六、参考资料


[1] 李航《统计学习方法》
[2] 徐亦达教授的自视频
[3] machine-learning-notes(em.pdf).Professor Richard Xu .

  • 7
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值