EM算法
一、EM算法的引入
概率模型有时既含有观测变量(observable variable),又含有隐变量或潜在变量(latent variable)。如果概率模型的变量都是观测变量,那么给定数据,可以直接用极大似然估计法,或贝叶斯估计法估计模型参数。但是,当模型含有隐变量时,就不能简单地使用这些估计方法。EM算法就是含有隐变量的概率模型参数的极大似然估计法,或极大后验概率估计法。
1. 为什么要使用EM算法
首先我们来看一个使用EM算法的例子。
例1(三硬币模型) 假设有3枚硬币,分别记作 A A A, B B B, C C C。这些硬币正面出现的概率分别是 π \pi π, p p p和 q q q。进行如下掷硬币试验:先掷硬币 A A A,根据其结果选出硬币 B B B或硬币 C C C,正面选硬币 B B B,反面选硬币 C C C;然后掷选出的硬币,掷硬币的结果,出现正面记作1,出现反面记作0;独立地重复 n n n次实验(这里 n = 10 n=10 n=10)观测结果如下:
1 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 1,1,0,1,0,0,1,0,1,1 1,1,0,1,0,0,1,0,1,1
假设只能观测到掷硬币的结果,不能观测掷硬币的过程。问如何估计三枚硬币正面出现的概率,即三硬币模型的参数。
解 三硬币模型可写作P ( y ∣ θ ) = ∑ z P ( y , z ∣ θ ) = ∑ z P ( z ∣ θ ) P ( y ∣ z , θ ) = π p y ( 1 − p ) 1 − y + ( 1 − π ) q y ( 1 − q ) 1 − y (1.1) \begin{aligned} P(y \mid \bm{\theta}) &= \sum_{z} P(y, z \mid \bm{\theta}) = \sum_{z} P(z \mid \bm{\theta}) P(y \mid z, \bm{\theta}) \\ &= \pi {p} ^ {y} {(1 - p)} ^ {1 - y} + (1 - \pi) {q} ^ {y} {(1 - q)} ^ {1 - y} \end{aligned} \tag{1.1} P(y∣θ)=z∑P(y,z∣θ)=z∑P(z∣θ)P(y∣z,θ)=πpy(1−p)1−y+(1−π)qy(1−q)1−y(1.1)
这里,随机变量 y y y是观测变量,表示一次试验观测的结果是 1 1 1和 0 0 0;随机变量 z z z是隐变量,表示未观测到的掷硬币 A A A的结果; θ = ( π , p , q ) \bm{\theta} = (\pi, p, q) θ=(π,p,q)是模型参数。这一模型是以上数据的生成模型。注意,随机变量 y y y的数据可以观测,随机变量 z z z的数据不可观测。
将观测数据表示为 Y = ( Y 1 Y 2 ⋯ Y n ) T \bm{Y} = \begin{pmatrix} Y_1 & Y_2 & \cdots & Y_n \end{pmatrix}^T Y=(Y1Y2⋯Yn)T,未观测数据表示为 Z = ( Z 1 Z 2 ⋯ Z n ) T \bm{Z} = \begin{pmatrix} Z_1 & Z_2 & \cdots & Z_n \end{pmatrix}^T Z=(Z1Z2⋯Zn)T,则观测数据的似然函数为
P ( Y ∣ θ ) = ∑ Z P ( Z ∣ θ ) P ( Y ∣ Z , θ ) (1.2) P(\bm{Y} \mid \bm{\theta}) = \sum_{\bm{Z}} P(\bm{Z} \mid \bm{\theta}) P(\bm{Y} \mid \bm{Z}, \bm{\theta}) \tag{1.2} P(Y∣θ)=Z∑P(Z∣θ)P(Y∣Z,θ)(1.2)
即
P ( Y ∣ θ ) = ∏ j = 1 n [ π p y i ( 1 − p ) 1 − y i + ( 1 − π ) q y i ( 1 − q ) 1 − y i ] (1.3) P(\bm{Y} \mid \bm{\theta}) = \prod_{j = 1}^n \left[ \pi {p} ^ {y_i} {(1 - p)} ^ {1 - y_i} + (1 - \pi) {q} ^ {y_i} {(1 - q)} ^ {1 - y_i} \right] \tag{1.3} P(Y∣θ)=j=1∏n[πpyi(1−p)1−yi+(1−π)qyi(1−q)1−yi](1.3)
考虑求模型参数 θ = ( π , p , q ) \bm{\theta} = (\pi, p, q) θ=(π,p,q)的极大似然估计,即
θ ^ = arg max θ log P ( Y ∣ θ ) (1.4) \hat{\bm{\theta}} = \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \log P(\bm{Y} \mid \bm{\theta}) \tag{1.4} θ^=argθmaxlogP(Y∣θ)(1.4)
这个问题没有解析解,只有通过迭代的方法求解。EM算法就是可以用于求解这个问题的一种迭代算法。下面我们就给出上面问题的EM算法。
EM算法首先呢选取参数的初值,记作 θ ( 0 ) = ( π ( 0 ) , p ( 0 ) , q ( 0 ) ) \bm{\theta}^{(0)} = ({\pi}^{(0)}, {p}^{(0)}, {q}^{(0)}) θ(0)=(π(0),p(0),q(0)),然后通过下面的步骤迭代计算参数的估计值,直至收敛为止。第 i i i次迭代参数的估计值为 θ ( i ) = ( π ( i ) , p ( i ) , q ( i ) ) \bm{\theta}^{(i)} = ({\pi}^{(i)}, {p}^{(i)}, {q}^{(i)}) θ(i)=(π(i),p(i),q(i))。EM算法的第 i + 1 i+1 i+1次迭代如下。
E步:计算在模型参数 π ( i ) {\pi}^{(i)} π(i), p ( i ) {p}^{(i)} p(i), q ( i ) {q}^{(i)} q(i)下观测数据 y j y_j yj来自掷硬币 B B B的概率
μ j ( i + 1 ) = π ( i ) ( p ( i ) ) y i ( 1 − p ( i ) ) 1 − y i π ( i ) ( p ( i ) ) y i ( 1 − p ( i ) ) 1 − y i + ( 1 − π ( i ) ) ( q ( i ) ) y i ( 1 − q ( i ) ) 1 − y i (1.5) \mu_j^{(i+1)} = \cfrac{\pi^{(i)} {(p^{(i)})}^{y_i} {(1 - p^{(i)})}^{1 - y_i}}{\pi^{(i)} {(p^{(i)})}^{y_i} {(1 - p^{(i)})}^{1 - y_i} + (1 - \pi^{(i)}) {(q^{(i)})}^{y_i} {(1 - q^{(i)})}^{1 - y_i}} \tag{1.5} μj(i+1)=π(i)(p(i))yi(1−p(i))1−yi+(1−π(i))(q(i))yi(1−q(i))1−yiπ(i)(p(i))yi(1−p(i))1−yi(1.5)
M步:计算模型参数的新估计值
π ( i + 1 ) = 1 n ∑ j = 1 n μ j ( i + 1 ) (1.6) \pi^{(i + 1)} = \cfrac{1}{n} \sum_{j = 1}^n \mu_j^{(i + 1)} \tag{1.6} π(i+1)=n1j=1∑nμj(i+1)(1.6)
p ( i + 1 ) = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) (1.7) p^{(i + 1)} = \cfrac{\sum\limits_{j = 1}^n \mu_j^{(i + 1)} y_j}{\sum\limits_{j = 1}^n \mu_j^{(i + 1)}} \tag{1.7} p(i+1)=j=1∑nμj(i+1)j=1∑nμj(i+1)yj(1.7)
q ( i + 1 ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) (1.8) q^{(i + 1)} = \cfrac{\sum\limits_{j = 1}^n (1 - \mu_j^{(i + 1)}) y_j}{\sum\limits_{j = 1}^n (1 - \mu_j^{(i + 1)})} \tag{1.8} q(i+1)=j=1∑n(1−μj(i+1))j=1∑n(1−μj(i+1))yj(1.8)
进行数值计算。假设模型参数的初值取为
π ( 0 ) = 0.5 , p ( 0 ) = 0.5 , q ( 0 ) = 0.5 \pi^{(0)} = 0.5,\ p^{(0)} = 0.5,\ q^{(0)} = 0.5 π(0)=0.5, p(0)=0.5, q(0)=0.5
由式 ( 1.5 ) (1.5) (1.5),对 y j = 1 y_j = 1 yj=1和 y j = 0 y_j = 0 yj=0均有 μ j ( 1 ) = 0.5 \mu_j^{(1)} = 0.5 μj(1)=0.5.
利用迭代公式 ( 1.6 ) (1.6) (1.6)~公式 ( 1.8 ) (1.8) (1.8),得到
π ( 1 ) = 0.5 , p ( 1 ) = 0.6 , q ( 1 ) = 0.6 \pi^{(1)} = 0.5,\ p^{(1)} = 0.6,\ q^{(1)} = 0.6 π(1)=0.5, p(1)=0.6, q(1)=0.6
由式 ( 1.5 ) (1.5) (1.5),
μ j ( 2 ) = 0.5 , j = 1 , 2 , ⋯ , 10 \mu_j^{(2)} = 0.5,\ j = 1, 2, \cdots , 10 μj(2)=0.5, j=1,2,⋯,10
继续迭代,得
π ( 2 ) = 0.5 , p ( 2 ) = 0.6 , q ( 2 ) = 0.6 \pi^{(2)} = 0.5,\ p^{(2)} = 0.6,\ q^{(2)} = 0.6 π(2)=0.5, p(2)=0.6, q(2)=0.6
于是得到模型参数 θ \bm{\theta} θ的极大似然估计:
π ^ = 0.5 , p ^ = 0.6 , q ^ = 0.6 \hat{\pi} = 0.5,\ \hat{p} = 0.6,\ \hat{q} = 0.6 π^=0.5, p^=0.6, q^=0.6
π = 0.5 \pi = 0.5 π=0.5表示硬币 A A A是均匀的,这一结果容易理解。
如果取初值 π ( 0 ) = 0.4 , p ( 0 ) = 0.6 , q ( 0 ) = 0.7 \pi^{(0)} = 0.4,\ p^{(0)} = 0.6,\ q^{(0)} = 0.7 π(0)=0.4, p(0)=0.6, q(0)=0.7,那么得到的模型参数的极大似然估计是 π ^ = 0.4064 , p ^ = 0.5368 , q ^ = 0.6432 \hat{\pi} = 0.4064,\ \hat{p} = 0.5368,\ \hat{q} = 0.6432 π^=0.4064, p^=0.5368, q^=0.6432。这就是说,EM算法与初值的选择有关,选择不同的初值可能得到不同的参数估计值。
一般地,用 Y \bm{Y} Y表示观测随机变量的数据, Z \bm{Z} Z表示隐随机变量的数据。 Y \bm{Y} Y和 Z \bm{Z} Z连在一起称为完全数据(complete-data),观测数据 Y \bm{Y} Y又称为不完全数据(incomplete-data)。假设给定观测数据 Y \bm{Y} Y,其概率分布是 P ( Y ∣ θ ) P(\bm{Y} \mid \bm{\theta}) P(Y∣θ),其中 θ \bm{\theta} θ是需要估计的模型参数,那么不完全数据 Y Y Y的似然函数是 P ( Y ∣ θ ) P(\bm{Y} \mid \bm{\theta}) P(Y∣θ),对数似然函数 L ( θ ) = log P ( Y ∣ θ ) L(\bm{\theta}) = \log P(\bm{Y} \mid \bm{\theta}) L(θ)=logP(Y∣θ);假设 Y \bm{Y} Y和 Z \bm{Z} Z的联合概率分布是 P ( Y , Z ∣ θ ) P(\bm{Y}, \bm{Z} \mid \bm{\theta}) P(Y,Z∣θ),那么完全数据的对数似然函数是 log P ( Y , Z ∣ θ ) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) logP(Y,Z∣θ)。
2. EM算法的推导
我们面对一个含有隐变量的概率模型,目标是极大化观测数据(不完全数据) Y Y Y关于参数 θ \theta θ的对数似然函数,即极大化
L ( θ ) = log P ( Y ∣ θ ) = log ∑ Z P ( Y , Z ∣ θ ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) (1.9) \begin{aligned} L(\bm{\theta}) &= \log P(\bm{Y} \mid \bm{\theta}) = \log \sum_{\bm{Z}} P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \\ &= \log \left(\sum_{\bm{Z}} P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right) \end{aligned} \tag{1.9} L(θ)=logP(Y∣θ)=logZ∑P(Y,Z∣θ)=log(Z∑P(Y∣Z,θ)P(Z∣θ))(1.9)
注意到这一极大化的主要困难是式 ( 1.9 ) (1.9) (1.9)中有未观测数据并有包含和(或积分)的对数。
事实上,EM算法是通过迭代逐步近似极大化 L ( θ ) L(\bm{\theta}) L(θ)的。假设在第 i i i次迭代后 θ \bm{\theta} θ的估计值是 θ ( i ) \bm{\theta}^{(i)} θ(i)。我们希望新估计的 θ \bm{\theta} θ能使 L ( θ ) L(\bm{\theta}) L(θ)增加,即 L ( θ ) > L ( θ ( i ) ) L(\bm{\theta}) > L(\bm{\theta}^{(i)}) L(θ)>L(θ(i)),并逐步达到极大值。因此,我们考虑两者的差:
L ( θ ) − L ( θ ( i ) ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − log P ( Y ∣ θ ( i ) ) (1.10) L(\bm{\theta}) - L(\bm{\theta}^{(i)}) = \log \left(\sum_{\bm{Z}} P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right) - \log P(\bm{Y} \mid \bm{\theta}^{(i)}) \tag{1.10} L(θ)−L(θ(i))=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))(1.10)
引理(琴生不等式(Jensen inequality)) 若 f ( x ) f(x) f(x)是区间 ( a , b ) \left( a, b \right) (a,b)上的凸函数,则对于任意 x 1 , x 2 , ⋯ , x n ∈ ( a , b ) x_1, x_2, \cdots, x_n \in \left( a, b \right) x1,x2,⋯,xn∈(a,b), λ 1 , λ 2 , ⋯ , λ n ∈ R + \lambda_1, \lambda_2, \cdots, \lambda_n \in \mathbb{R} ^ {+} λ1,λ2,⋯,λn∈R+且 ∑ i = 1 n λ i = 1 \displaystyle\sum_{i = 1}^n \lambda_i = 1 i=1∑nλi=1,有
f ( ∑ i = 1 n λ i x i ) ⩽ ∑ i = 1 n λ i f ( x i ) (1.11) f \left(\sum\limits_{i = 1}^n \lambda_i x_i \right) \leqslant \sum\limits_{i = 1}^n \lambda_i f \left( x_i \right) \tag{1.11} f(i=1∑nλixi)⩽i=1∑nλif(xi)(1.11)
若 f ( x ) f(x) f(x)是区间 ( a , b ) \left( a, b \right) (a,b)上的凹函数,则对于任意 x 1 , x 2 , ⋯ , x n ∈ ( a , b ) x_1, x_2, \cdots, x_n \in \left( a, b \right) x1,x2,⋯,xn∈(a,b), λ 1 , λ 2 , ⋯ , λ n ∈ R + \lambda_1, \lambda_2, \cdots, \lambda_n \in \mathbb{R} ^ {+} λ1,λ2,⋯,λn∈R+且 ∑ i = 1 n λ i = 1 \displaystyle\sum_{i = 1}^n \lambda_i = 1 i=1∑nλi=1,有
f ( ∑ i = 1 n λ i x i ) ⩾ ∑ i = 1 n λ i f ( x i ) (1.12) f \left(\sum\limits_{i = 1}^n \lambda_i x_i \right) \geqslant \sum\limits_{i = 1}^n \lambda_i f \left( x_i \right) \tag{1.12} f(i=1∑nλixi)⩾i=1∑nλif(xi)(1.12)
证明 以 f ( x ) f(x) f(x)是凸函数为例,采用数学归纳法来证明:
当 n = 1 n = 1 n=1时, λ 1 = 1 \lambda_1 = 1 λ1=1,所以有
f ( x 1 ) ⩽ f ( x 1 ) (1.13) f(x_1) \leqslant f(x_1) \tag{1.13} f(x1)⩽f(x1)(1.13)
恒成立。
当 n = 2 n = 2 n=2时, λ 1 + λ 2 = 1 \lambda_1 + \lambda_2 = 1 λ1+λ2=1,构造函数
F ( x ) = λ 1 f ( x 1 ) + λ 2 f ( x ) − f ( λ 1 x 1 + λ 2 x ) (1.14) F(x) = \lambda_1 f(x_1) + \lambda_2 f(x) - f \left( \lambda_1 x_1 + \lambda_2 x \right) \tag{1.14} F(x)=λ1f(x1)+λ2f(x)−f(λ1x1+λ2x)(1.14)
显然有
F ( x 1 ) = λ 1 f ( x 1 ) + λ 2 f ( x 1 ) − f ( λ 1 x 1 + λ 2 x 1 ) = ( λ 1 + λ 2 ) f ( x 1 ) − f ( ( λ 1 + λ 2 ) x 1 ) = f ( x 1 ) − f ( x 1 ) = 0 (1.15) \begin{aligned} F(x_1) &= \lambda_1 f(x_1) + \lambda_2 f(x_1) - f \left( \lambda_1 x_1 + \lambda_2 x_1 \right) \\ &= \left(\lambda_1 + \lambda_2 \right) f(x_1) - f \left( \left( \lambda_1 + \lambda_2 \right) x_1 \right) \\ &= f(x_1) - f(x_1) \\ &= 0 \end{aligned} \tag{1.15} F(x1)=λ1f(x1)+λ2f(x1)−f(λ1x1+λ2x1)=(λ1+λ2)f(x1)−f((λ1+λ2)x1)=f(x1)−f(x1)=0(1.15)
对 F ( x ) F(x) F(x)求导,有
F ′ ( x ) = λ 2 f ′ ( x ) − λ 2 f ′ ( λ 1 x 1 + λ 2 x ) = λ 2 [ f ′ ( x ) − f ′ ( λ 1 x 1 + λ 2 x ) ] (1.16) \begin{aligned} F ^ {'} (x) &= \lambda_2 f ^ {'}(x) - \lambda_2 f ^ {'} (\lambda_1 x_1 + \lambda_2 x) \\ &= \lambda_2 \left[ f ^ {'} (x) - f ^ {'} (\lambda_1 x_1 + \lambda_2 x) \right] \\ \end{aligned} \tag{1.16} F′(x)=λ2f′(x)−λ2f′(λ1x1+λ2x)=λ2[f′(x)−f′(λ1x1+λ2x)](1.16)
因为 f ( x ) f(x) f(x)是凸函数,所以恒有 f ′ ′ ( x ) ⩾ 0 f ^ {''}(x) \geqslant 0 f′′(x)⩾0,这说明 f ′ ( x ) f ^ {'}(x) f′(x)单调递增。
所以
x − ( λ 1 x 1 + λ 2 x ) = ( 1 − λ 2 ) x − λ 1 x 1 = λ 1 ( x − x 1 ) (1.17) x - (\lambda_1 x_1 + \lambda_2 x) = (1 - \lambda_2) x - \lambda_1 x_1 = \lambda_1 (x - x_1) \tag{1.17} x−(λ1x1+λ2x)=(1−λ2)x−λ1x1=λ1(x−x1)(1.17)
当 x > x 1 x > x_1 x>x1时, x > ( λ 1 x 1 + λ 2 x ) x > (\lambda_1 x_1 + \lambda_2 x) x>(λ1x1+λ2x),则 f ′ ( x ) > f ′ ( λ 1 x 1 + λ 2 x ) f ^ {'} (x) > f ^ {'} (\lambda_1 x_1 + \lambda_2 x) f′(x)>f′(λ1x1+λ2x), F ′ ( x ) > 0 F ^ {'} (x) > 0 F′(x)>0;当 x < x 1 x < x_1 x<x1时, x < ( λ 1 x 1 + λ 2 x ) x < (\lambda_1 x_1 + \lambda_2 x) x<(λ1x1+λ2x),则 f ′ ( x ) < f ′ ( λ 1 x 1 + λ 2 x ) f ^ {'} (x) < f ^ {'} (\lambda_1 x_1 + \lambda_2 x) f′(x)<f′(λ1x1+λ2x), F ′ ( x ) < 0 F ^ {'} (x) < 0 F′(x)<0。
所以 F ( x ) F(x) F(x)在 ( a , x 1 ) (a, x_1) (a,x1)上单调递减,在 ( x 1 , b ) (x_1, b) (x1,b)上单调递增。那么恒有
F ( x ) ⩾ F ( x 1 ) = 0 (1.18) F(x) \geqslant F(x_1) = 0 \tag{1.18} F(x)⩾F(x1)=0(1.18)
即
f ( λ 1 x 1 + λ 2 x 2 ) ⩾ λ 1 f ( x 1 ) + λ 2 f ( x 2 ) (1.19) f \left( \lambda_1 x_1 + \lambda_2 x_2 \right) \geqslant \lambda_1 f \left( x_1 \right) + \lambda_2 f \left( x_2 \right) \tag{1.19} f(λ1x1+λ2x2)⩾λ1f(x1)+λ2f(x2)(1.19)
成立。
现在我们假设当 n = k n = k n=k时,不等式成立,即
f ( ∑ i = 1 k λ i x i ) ⩾ ∑ i = 1 k λ i f ( x i ) (1.20) f \left(\sum\limits_{i = 1}^k \lambda_i x_i \right) \geqslant \sum\limits_{i = 1}^k \lambda_i f \left( x_i \right) \tag{1.20} f(i=1∑kλixi)⩾i=1∑kλif(xi)(1.20)
则当 n = k + 1 n = k + 1 n=k+1时,由 ∑ i = 1 k + 1 λ i = 1 \displaystyle\sum_{i = 1}^{k + 1} \lambda_i = 1 i=1∑k+1λi=1可得 ∑ i = 1 k λ i = 1 − λ k + 1 \displaystyle\sum_{i = 1}^{k} \lambda_i = 1 - \lambda_{k + 1} i=1∑kλi=1−λk+1,从而有
1 1 − λ k + 1 ∑ i = 1 k λ i = 1 (1.21) \cfrac{1}{1 - \lambda_{k + 1}} \displaystyle\sum_{i = 1}^{k} \lambda_i = 1 \tag{1.21} 1−λk+11i=1∑kλi=1(1.21)
那么
∑ i = 1 k + 1 λ i f ( x i ) = ∑ i = 1 k λ i f ( x i ) + λ k + 1 f ( x k + 1 ) = ( 1 − λ k + 1 ) 1 1 − λ k + 1 ∑ i = 1 k λ i f ( x i ) + λ k + 1 f ( x k + 1 ) ⩾ ( 1 − λ k + 1 ) f ( ∑ i = 1 k λ i x i 1 − λ k + 1 ) + λ k + 1 f ( x k + 1 ) ⩾ f ( ( 1 − λ k + 1 ) ∑ i = 1 k λ i x i 1 − λ k + 1 + λ k + 1 x k + 1 ) = f ( ∑ i = 1 k + 1 λ i x i ) (1.22) \begin{aligned} \sum_{i = 1} ^ {k + 1} \lambda_i f(x_i) &= \sum_{i = 1} ^ {k} \lambda_i f(x_i) + \lambda_{k +1} f(x_{k +1}) \\ &= (1 - \lambda_{k + 1}) \cfrac{1}{1 - \lambda_{k + 1}} \sum_{i = 1} ^ {k} \lambda_i f(x_i) + \lambda_{k +1} f(x_{k +1}) \\ &\geqslant (1 - \lambda_{k + 1}) f \left( \cfrac{\sum\limits_{i = 1} ^ {k} \lambda_i x_i}{1 - \lambda_{k + 1}} \right) + \lambda_{k +1} f(x_{k +1}) \\ &\geqslant f \left( (1 - \lambda_{k + 1}) \cfrac{\sum\limits_{i = 1} ^ {k} \lambda_i x_i}{1 - \lambda_{k + 1}} + \lambda_{k+1} x_{k +1} \right) \\ &= f \left( \sum\limits_{i = 1}^{k + 1} \lambda_i x_i \right) \end{aligned} \tag{1.22} i=1∑k+1λif(xi)=i=1∑kλif(xi)+λk+1f(xk+1)=(1−λk+1)1−λk+11i=1∑kλif(xi)+λk+1f(xk+1)⩾(1−λk+1)f⎝⎜⎜⎜⎛1−λk+1i=1∑kλixi⎠⎟⎟⎟⎞+λk+1f(xk+1)⩾f⎝⎜⎜⎜⎛(1−λk+1)1−λk+1i=1∑kλixi+λk+1xk+1⎠⎟⎟⎟⎞=f(i=1∑k+1λixi)(1.22)
综上所述, f ( ∑ i = 1 n λ i x i ) ⩽ ∑ i = 1 n λ i f ( x i ) f \left(\displaystyle\sum_{i = 1}^n \lambda_i x_i \right) \leqslant \displaystyle\sum_{i = 1}^n \lambda_i f \left( x_i \right) f(i=1∑nλixi)⩽i=1∑nλif(xi)成立。
同理,若 f ( x ) f(x) f(x)是区间 ( a , b ) \left( a, b \right) (a,b)上的凹函数,有 f ( ∑ i = 1 n λ i x i ) ⩾ ∑ i = 1 n λ i f ( x i ) f \left( \displaystyle\sum_{i = 1}^n \lambda_i x_i \right) \geqslant \displaystyle\sum_{i = 1}^n \lambda_i f \left( x_i \right) f(i=1∑nλixi)⩾i=1∑nλif(xi)成立。
那么我们注意到 ∑ Z P ( Z ∣ Y , θ ( i ) ) \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) Z∑P(Z∣Y,θ(i))实际上就代表的是隐变量在一组数据下取遍所有可能的概率,所以有
∑ Z P ( Z ∣ Y , θ ( i ) ) = 1 (1.23) \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) = 1 \tag{1.23} Z∑P(Z∣Y,θ(i))=1(1.23)
那么我们就可以利用琴生不等式(Jensen inequality)来得到式 ( 1.10 ) (1.10) (1.10)的下界:
L ( θ ) − L ( θ ( i ) ) = log ( ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log P ( Y ∣ θ ( i ) ) ⩾ ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − log P ( Y ∣ θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) − ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) (1.24) \begin{aligned} L(\bm{\theta}) - L(\bm{\theta}^{(i)}) &= \log \left(\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} \right) - \log P(\bm{Y} \mid \bm{\theta}^{(i)}) \\ &\geqslant \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} - \log P(\bm{Y} \mid \bm{\theta}^{(i)}) \\ &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} - \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y} \mid \bm{\theta}^{(i)}) \\ &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(\bm{Y} \mid \bm{\theta}^{(i)})} \\ \end{aligned} \tag{1.24} L(θ)−L(θ(i))=log(Z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))⩾Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−logP(Y∣θ(i))=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−Z∑P(Z∣Y,θ(i))logP(Y∣θ(i))=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)(1.24)
令
B ( θ , θ ( i ) ) = ^ L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) (1.25) B(\bm{\theta}, \bm{\theta} ^ {(i)}) \hat{=} L(\bm{\theta} ^ {(i)}) + \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(\bm{Y} \mid \bm{\theta}^{(i)})} \tag{1.25} B(θ,θ(i))=^L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)(1.25)
则
L ( θ ) ⩾ B ( θ , θ ( i ) ) (1.26) L(\bm{\theta}) \geqslant B(\bm{\theta}, \bm{\theta} ^ {(i)}) \tag{1.26} L(θ)⩾B(θ,θ(i))(1.26)
即函数 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))是 L ( θ ) L(\bm{\theta}) L(θ)的一个下界,而且由式 ( 1.25 ) (1.25) (1.25)可知,
L ( θ ( i ) ) = B ( θ ( i ) , θ ( i ) ) (1.27) L(\bm{\theta} ^ {(i)}) = B(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) \tag{1.27} L(θ(i))=B(θ(i),θ(i))(1.27)
因此,任何可以使 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))增大的 θ \bm{\theta} θ,也可以使 L ( θ ) L(\bm{\theta}) L(θ)增大。为了使 L ( θ ) L(\bm{\theta}) L(θ)有尽可能大的增长,选择 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1)使 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))达到极大,即
θ ( i + 1 ) = arg max θ B ( θ , θ ( i ) ) (1.28) \bm{\theta} ^ {(i + 1)} = \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} B(\bm{\theta}, \bm{\theta} ^ {(i)}) \tag{1.28} θ(i+1)=argθmaxB(θ,θ(i))(1.28)
将式 ( 1.25 ) (1.25) (1.25)代入,有
θ ( i + 1 ) = arg max θ ( L ( θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg max θ ( log P ( Y ∣ θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ θ ( i ) ) + ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] − ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ( i ) ) ) (1.29) \begin{aligned} \bm{\theta} ^ {(i + 1)} &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( L(\bm{\theta} ^ {(i)}) + \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(\bm{Y} \mid \bm{\theta}^{(i)})} \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \log P(\bm{Y} \mid \bm{\theta}^{(i)}) + \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(\bm{Y} \mid \bm{\theta}^{(i)})} \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y} \mid \bm{\theta}^{(i)}) + \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(\bm{Y} \mid \bm{\theta}^{(i)})} \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \cfrac{P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \left[ P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right] - \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \right) \\ \end{aligned} \tag{1.29} θ(i+1)=argθmax(L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmax(logP(Y∣θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmax(Z∑P(Z∣Y,θ(i))logP(Y∣θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmax(Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmax(Z∑P(Z∣Y,θ(i))log[P(Y∣Z,θ)P(Z∣θ)]−Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i)))(1.29)
注意到, ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ( i ) ) \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))对 θ \bm{\theta} θ而言是常数,换言之,想让 ∑ Z P ( Z ∣ Y , θ ( i ) ) log [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] − ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ( i ) ) \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \left[ P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right] - \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) Z∑P(Z∣Y,θ(i))log[P(Y∣Z,θ)P(Z∣θ)]−Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))取最大值,其实就是让 ∑ Z P ( Z ∣ Y , θ ( i ) ) log [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \left[ P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right] Z∑P(Z∣Y,θ(i))log[P(Y∣Z,θ)P(Z∣θ)]取最大值,即 ( 1.29 ) (1.29) (1.29)可变换为
θ ( i + 1 ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] − ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ( i ) ) ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] ) = arg max θ ( ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) ) (1.30) \begin{aligned} \bm{\theta} ^ {(i + 1)} &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \left[ P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right] - \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log \left[ P(\bm{Y} \mid \bm{Z}, \bm{\theta}) P(\bm{Z} \mid \bm{\theta}) \right] \right) \\ &= \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \right) \\ \end{aligned} \tag{1.30} θ(i+1)=argθmax(Z∑P(Z∣Y,θ(i))log[P(Y∣Z,θ)P(Z∣θ)]−Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i)))=argθmax(Z∑P(Z∣Y,θ(i))log[P(Y∣Z,θ)P(Z∣θ)])=argθmax(Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ))(1.30)
我们记
Q ( θ , θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) (1.31) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) = \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \tag{1.31} Q(θ,θ(i))=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)(1.31)
如此,我们如下定义EM算法的核心: Q Q Q函数。
定义9.1( Q \bm Q Q函数) 完全数据的对数似然函数 log P ( Y , Z ∣ θ ) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) logP(Y,Z∣θ)关于在给定观测数据 Y \bm{Y} Y和当前参数 θ ( i ) \bm{\theta} ^ {(i)} θ(i)下对未观测数据 Z \bm{Z} Z的条件概率分布 log P ( Y , Z ∣ θ ( i ) ) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta} ^ {(i)}) logP(Y,Z∣θ(i))的期望称为 Q Q Q函数,即
Q ( θ , θ ( i ) ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] (1.32) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) = E_\bm{Z} \left[ \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \mid \bm{Y}, \bm{\theta} ^ {(i)} \right] \tag{1.32} Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)](1.32)
于是式 ( 1.30 ) (1.30) (1.30)等价于EM算法的一次迭代,即求 Q Q Q函数极其极大化。EM算法是通过不断求解下界的极大化逼近求解对数似然函数极大化的算法。
图1.1给出了EM算法的直观解释。图中上方曲线为 L ( θ ) L(\bm{\theta}) L(θ),下方曲线为 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))。由式 ( 1.26 ) (1.26) (1.26),两个函数在点 θ = θ ( i ) \bm{\theta} = \bm{\theta} ^ {(i)} θ=θ(i)处相等。由式 ( 1.28 ) (1.28) (1.28),式 ( 1.29 ) (1.29) (1.29)和式 ( 1.30 ) (1.30) (1.30),EM算法找到下一个点 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1)使函数 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))极大化,也使函数 Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))极大化。这时由于 L ( θ ) ⩾ B ( θ , θ ( i ) ) L(\bm{\theta}) \geqslant B(\bm{\theta}, \bm{\theta} ^ {(i)}) L(θ)⩾B(θ,θ(i)),函数 B ( θ , θ ( i ) ) B(\bm{\theta}, \bm{\theta} ^ {(i)}) B(θ,θ(i))的增加,保证对数似然函数 L ( θ ) L(\bm{\theta}) L(θ)在每次迭代中也是增加的。EM算法在点 θ ( i + 1 ) \bm{\theta} ^ {(i +1)} θ(i+1)重新计算 Q Q Q函数值,进行下一次迭代。在这个过程中,对数似然函数 L ( θ ) L(\bm{\theta}) L(θ)不断增大。从图中可以推断出EM算法不能保证找到全局最优值。
二、EM算法的收敛性
EM算法提供一种近似计算含有隐变量概率模型的极大似然估计的方法。EM算法的最大优点是简单性和普适性。那么EM算法得到的估计序列是否收敛?如果收敛,是否收敛到全局最大值或局部极大值?
定理2.1 设 P ( Y ∣ θ ) P(\bm{Y} \mid \bm{\theta}) P(Y∣θ)为观测数据的似然函数, θ ( i ) ( i = 1 , 2 , ⋯ ) \bm{\theta} ^ {(i)} (i = 1, 2, \cdots) θ(i)(i=1,2,⋯)为EM算法得到的参数估计序列, P ( Y ∣ θ ( i ) ) ( i = 1 , 2 , ⋯ ) P(\bm{Y} \mid \bm{\theta} ^ {(i)}) (i = 1, 2, \cdots) P(Y∣θ(i))(i=1,2,⋯)为对应的似然函数序列,则 P ( Y ∣ θ ( i ) ) P(\bm{Y} \mid \bm{\theta} ^ {(i)}) P(Y∣θ(i))是单调递增的,即
P ( Y ∣ θ ( i + 1 ) ) ⩾ P ( Y ∣ θ ( i ) ) (2.1) P(\bm{Y} \mid \bm{\theta} ^ {(i + 1)}) \geqslant P(\bm{Y} \mid \bm{\theta} ^ {(i)}) \tag{2.1} P(Y∣θ(i+1))⩾P(Y∣θ(i))(2.1)
证明 由于
P ( Y ∣ θ ) = P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ ) P(\bm{Y} \mid \bm{\theta}) = \cfrac{P(\bm{Y}, \bm{Z} \mid \bm{\theta})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta})} P(Y∣θ)=P(Z∣Y,θ)P(Y,Z∣θ)
取对数有
log P ( Y ∣ θ ) = log P ( Y , Z ∣ θ ) − log P ( Z ∣ Y , θ ) \log P(\bm{Y} \mid \bm{\theta}) = \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) - \log P(\bm{Z} \mid \bm{Y}, \bm{\theta}) logP(Y∣θ)=logP(Y,Z∣θ)−logP(Z∣Y,θ)
由式 ( 1.32 ) (1.32) (1.32)
Q ( θ , θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) = \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) Q(θ,θ(i))=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)
令
H ( θ , θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ) (2.2) H(\bm{\theta}, \bm{\theta} ^ {(i)}) = \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta}) \tag{2.2} H(θ,θ(i))=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ)(2.2)
注意到 ∑ Z P ( Z ∣ Y , θ ( i ) ) = 1 \displaystyle\sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) = 1 Z∑P(Z∣Y,θ(i))=1,且对数似然函数不含有隐变量 Z \bm{Z} Z,于是
log P ( Y ∣ θ ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y ∣ θ ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) ( log P ( Y , Z ∣ θ ) − log P ( Z ∣ Y , θ ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) − ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Z ∣ Y , θ ) = Q ( θ , θ ( i ) ) − H ( θ , θ ( i ) ) (2.3) \begin{aligned} \log P(\bm{Y} \mid \bm{\theta}) &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y} \mid \bm{\theta}) \\ &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \left( \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) - \log P(\bm{Z} \mid \bm{Y}, \bm{\theta}) \right) \\ &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) - \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Z} \mid \bm{Y}, \bm{\theta}) \\ &= Q(\bm{\theta}, \bm{\theta} ^ {(i)}) - H(\bm{\theta}, \bm{\theta} ^ {(i)}) \\ \end{aligned} \tag{2.3} logP(Y∣θ)=Z∑P(Z∣Y,θ(i))logP(Y∣θ)=Z∑P(Z∣Y,θ(i))(logP(Y,Z∣θ)−logP(Z∣Y,θ))=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)−Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ)=Q(θ,θ(i))−H(θ,θ(i))(2.3)
在式 ( 2.3 ) (2.3) (2.3)中分别取 θ \bm{\theta} θ为 θ ( i ) \bm{\theta} ^ {(i)} θ(i)和 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1)并相减,有
log P ( Y ∣ θ ( i + 1 ) ) − log P ( Y ∣ θ ( i ) ) = [ Q ( θ ( i + 1 ) , θ ( i ) ) − Q ( θ ( i ) , θ ( i ) ) ] − [ H ( θ ( i + 1 ) , θ ( i ) ) − H ( θ ( i ) , θ ( i ) ) ] (2.4) \log P(\bm{Y} \mid \bm{\theta} ^ {(i + 1)}) - \log P(\bm{Y} \mid \bm{\theta} ^ {(i)}) = \left[ Q(\bm{\theta} ^ {(i + 1)}, \bm{\theta} ^ {(i)}) - Q(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) \right] - \left[ H(\bm{\theta} ^ {(i + 1)}, \bm{\theta} ^ {(i)}) - H(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) \right] \tag{2.4} logP(Y∣θ(i+1))−logP(Y∣θ(i))=[Q(θ(i+1),θ(i))−Q(θ(i),θ(i))]−[H(θ(i+1),θ(i))−H(θ(i),θ(i))](2.4)
为证式 ( 2.1 ) (2.1) (2.1),只需证式 ( 2.4 ) (2.4) (2.4)右端是非负的。式 ( 2.4 ) (2.4) (2.4)右端的第 1 1 1项,由于 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1)使 Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))达到极大,所以有
Q ( θ ( i + 1 ) , θ ( i ) ) − Q ( θ ( i ) , θ ( i ) ) ⩾ 0 (2.5) Q(\bm{\theta} ^ {(i + 1)}, \bm{\theta} ^ {(i)}) - Q(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) \geqslant 0 \tag{2.5} Q(θ(i+1),θ(i))−Q(θ(i),θ(i))⩾0(2.5)
其第 2 2 2项,由式 ( 2.2 ) (2.2) (2.2)可得:
H ( θ ( i + 1 ) , θ ( i ) ) − H ( θ ( i ) , θ ( i ) ) = ∑ Z P ( Z ∣ Y , θ ( i ) ) ( log P ( Z ∣ Y , θ ( i + 1 ) ) P ( Z ∣ Y , θ ( i ) ) ) ⩽ log ( ∑ Z P ( Z ∣ Y , θ ( i ) ) P ( Z ∣ Y , θ ( i + 1 ) ) P ( Z ∣ Y , θ ( i ) ) ) = log ( ∑ Z P ( Z ∣ Y , θ ( i + 1 ) ) ) = 0 (2.6) \begin{aligned} H(\bm{\theta} ^ {(i + 1)}, \bm{\theta} ^ {(i)}) - H(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \left( \log \frac{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i + 1)})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} \right) \\ &\leqslant \log \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \frac{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i + 1)})}{P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)})} \right) \\ &= \log \left( \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i + 1)}) \right) \\ &= 0 \\ \end{aligned} \tag{2.6} H(θ(i+1),θ(i))−H(θ(i),θ(i))=Z∑P(Z∣Y,θ(i))(logP(Z∣Y,θ(i))P(Z∣Y,θ(i+1)))⩽log(Z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Z∣Y,θ(i+1)))=log(Z∑P(Z∣Y,θ(i+1)))=0(2.6)
这里的不等号由上面提到的琴生不等式得到。
由式 ( 2.5 ) (2.5) (2.5)和式 ( 2.6 ) (2.6) (2.6)即知式 ( 2.3 ) (2.3) (2.3)右端是非负的。
定理2.2 设 L ( θ ) = P ( Y ∣ θ ) L(\bm{\theta}) = P(\bm{Y} \mid \bm{\theta}) L(θ)=P(Y∣θ)为观测数据的对数似然函数, θ ( i ) ( i = 1 , 2 , ⋯ ) \bm{\theta} ^ {(i)} (i = 1, 2, \cdots) θ(i)(i=1,2,⋯)为EM算法得到的参数估计序列, L ( θ ( i ) ) ( i = 1 , 2 , ⋯ ) L(\bm{\theta} ^ {(i)}) (i = 1, 2, \cdots) L(θ(i))(i=1,2,⋯)为对应的对数似然函数序列。
(1)如果 P ( Y ∣ θ ) P(\bm{Y} \mid \bm{\theta}) P(Y∣θ)有上界,则 L ( θ ( i ) ) = log P ( Y ∣ θ ( i ) ) L(\bm{\theta} ^ {(i)}) = \log P(\bm{Y} \mid \bm{\theta} ^ {(i)}) L(θ(i))=logP(Y∣θ(i))收敛到某一值 L ∗ L^* L∗;
(2)在函数 Q ( θ , θ ′ ) Q(\bm{\theta}, \bm{\theta} ^ {'}) Q(θ,θ′)与 L ( θ ) L(\bm{\theta}) L(θ)满足一定条件下,由EM算法得到的参数估计序列 θ ( i ) \bm{\theta} ^ {(i)} θ(i)的收敛值 θ ∗ \bm{\theta} ^ {*} θ∗是 L ( θ ) L(\bm{\theta}) L(θ)的稳定点。
证明 (1)由 L ( θ ) = log P ( Y ∣ θ ) L(\bm{\theta}) = \log P(\bm{Y} \mid \bm{\theta}) L(θ)=logP(Y∣θ)的单调性及 P ( Y ∣ θ ) P(\bm{Y} \mid \bm{\theta}) P(Y∣θ)的有界性立即得到。
(2)参见论文《On the Convergence Properties of the EM Algorithm》。
定理2.2关于函数 Q ( θ , θ ′ ) Q(\bm{\theta}, \bm{\theta} ^ {'}) Q(θ,θ′)与 L ( θ ) L(\bm{\theta}) L(θ)的条件在大多数情况下都是满足的。EM算法的收敛性包含关于对数似然函数序列 L ( θ ( i ) ) L(\bm{\theta} ^ {(i)}) L(θ(i))的收敛性和关于参数估计序列 θ ( i ) \bm{\theta} ^ {(i)} θ(i)的收敛性两层意思,前者并不蕴涵后者。此外,定理只能保证参数估计序列收敛到对数似然函数序列的稳定点,不能保证收敛到极大值点。所以在应用中,初值的选择变得非常重要,常用的办法是选取几个不同的初值进行迭代,然后对得到的各个估计值加以比较,从中选择最好的。
三、EM算法
输入:观测变量数据 Y \bm{Y} Y,隐变量数据 Z \bm{Z} Z,联合分布 P ( Y , Z ∣ θ ) P(\bm{Y}, \bm{Z} \mid \bm{\theta}) P(Y,Z∣θ),条件分布 P ( Z ∣ Y , θ ) P(\bm{Z} \mid \bm{Y}, \bm{\theta}) P(Z∣Y,θ);
输出:观测模型参数 θ \bm{\theta} θ。
(1)选择参数的初值 θ ( 0 ) \bm{\theta} ^ {(0)} θ(0),开始迭代。
(2)E步:记 θ ( i ) \bm{\theta} ^ {(i)} θ(i)为第 i i i次迭代参数 θ \bm{\theta} θ的估计值,在第 i + 1 i + 1 i+1次迭代的E步,计算
Q ( θ , θ ( i ) ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) (3.1) \begin{aligned} Q(\bm{\theta}, \bm{\theta} ^ {(i)}) &= E_\bm{Z} \left[ \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \mid \bm{Y}, \bm{\theta} ^ {(i)} \right] \\ &= \sum_{\bm{Z}} P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) \log P(\bm{Y}, \bm{Z} \mid \bm{\theta}) \end{aligned} \tag{3.1} Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)(3.1)
这里, P ( Z ∣ Y , θ ( i ) ) P(\bm{Z} \mid \bm{Y}, \bm{\theta} ^ {(i)}) P(Z∣Y,θ(i))是在给定观测数据 Y \bm{Y} Y和当前的参数估计 θ ( i ) \bm{\theta} ^ {(i)} θ(i)下隐变量数据 Z \bm{Z} Z的条件概率分布;
(3)M步:求使 Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))极大化的 θ \bm{\theta} θ,确定第 i + 1 i + 1 i+1次迭代的参数的估计值 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1)
θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) (3.2) \bm{\theta} ^ {(i + 1)} = \mathop{\arg} \mathop{\max}\limits_{\bm{\theta}} Q(\bm{\theta}, \bm{\theta} ^ {(i)}) \tag{3.2} θ(i+1)=argθmaxQ(θ,θ(i))(3.2)
(4)重复第(2)步和第(3)步,直到收敛。
下面关于EM算法作几点说明:
步骤(1) 参数的初值 可以任意选择,但需注意EM算法对初值是敏感的。
步骤(2) E步求 Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))。 Q Q Q函数式中 Z \bm{Z} Z是未观测数据, Y \bm{Y} Y是观测数据。注意, Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))的第1个变元表示要极大化的参数,第2个变元表示参数的当前估计值。每次迭代实际在求 Q Q Q函数及其极大。
步骤(3) M步求 Q ( θ , θ ( i ) ) Q(\bm{\theta}, \bm{\theta} ^ {(i)}) Q(θ,θ(i))的极大化,得到 θ ( i + 1 ) \bm{\theta} ^ {(i + 1)} θ(i+1),完成一次迭代 θ ( i ) → θ ( i + 1 ) \bm{\theta} ^ {(i)} \rightarrow \bm{\theta} ^ {(i + 1)} θ(i)→θ(i+1)。后面将证明每次迭代使似然函数增大或达到局部极值。
步骤(4) 给出停止迭代的条件,一般是对较小的正数 ε 1 \varepsilon_1 ε1, ε 2 \varepsilon_2 ε2,若满足
∥ θ ( i + 1 ) − θ ( i ) ∥ < ε 1 或 ∥ Q ( θ ( i + 1 ) , θ ( i ) ) − Q ( θ ( i ) , θ ( i ) ) ∥ < ε 2 \lVert \bm{\theta} ^ {(i + 1)} - \bm{\theta} ^ {(i)} \rVert < \varepsilon_1或\lVert Q(\bm{\theta} ^ {(i + 1)}, \bm{\theta} ^ {(i)}) - Q(\bm{\theta} ^ {(i)}, \bm{\theta} ^ {(i)}) \rVert < \varepsilon_2 ∥θ(i+1)−θ(i)∥<ε1或∥Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∥<ε2
则停止迭代。
四、EM算法在高斯混合模型学习中的应用
EM算法的一个重要应用是高斯混合模型(Gaussian mixture model)的参数估计。高斯混合模型应用广泛,在许多情况下,EM算法是学习高斯混合模型的有效方法。
1. 高斯混合模型
定义4.1(高斯混合模型) 高斯混合模型是指具有如下形式的概率分布模型:
P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) (4.1) P(y \mid \theta) = \displaystyle\sum_{k = 1} ^ {K} \alpha_k \phi \left( y \mid \theta_k \right) \tag{4.1} P(y∣θ)=k=1∑Kαkϕ(y∣θk)(4.1)
其中, α k \alpha_k αk是系数, α k ⩾ 0 \alpha_k \geqslant 0 αk⩾0, ∑ k = 1 K α k = 1 \displaystyle\sum_{k = 1} ^ {K} \alpha_k = 1 k=1∑Kαk=1; ϕ ( y ∣ θ k ) \phi \left( y \mid \theta_k \right) ϕ(y∣θk)是高斯分布密度, θ k = ( μ k , σ k 2 ) \theta_k = \left( \mu_k, \sigma_k ^ {2} \right) θk=(μk,σk2),
ϕ ( y ∣ θ k ) = 1 2 π σ k exp ( − ( y − μ k ) 2 2 σ k 2 ) (4.2) \phi \left( y \mid \theta_k \right) = \frac{1}{\sqrt{2\pi}\sigma_k} \exp \left( - \frac{\left( y - \mu_k \right) ^ {2}}{2 \sigma_k ^ {2}} \right) \tag{4.2} ϕ(y∣θk)=2πσk1exp(−2σk2(y−μk)2)(4.2)
称为第 k k k个分模型。
一般混合模型可以由任意概率分布密度代替式 ( 4.2 ) (4.2) (4.2)中的高斯分布密度,例如:01分布,二项分布,泊松分布等等。
高斯混合模型本质上就是将多个高斯分布产生的数据放在一起,第 k k k个模型产生的数据的数量占总数据数量的比例就是 α k \alpha_k αk,现在这些数据我们都不知道是由哪一个高斯分布产生的,EM算法的作用就是进行划分,分清楚每一个数据到底是由哪一个高斯分布产生的。
2. 高斯混合模型参数估计的EM算法
假设观测数据 y 1 , y 2 , ⋯ , y N y_1, y_2, \cdots, y_N y1,y2,⋯,yN由高斯混合模型生成,那么
P ( y j ∣ θ ) = ∑ k = 1 K α k ϕ ( y j ∣ θ k ) , j = 1 , 2 , ⋯ , N (4.3) P(y_j \mid \theta) = \displaystyle\sum_{k = 1} ^ {K} \alpha_k \phi \left( y_j \mid \theta_k \right),\ j = 1, 2, \cdots, N \tag{4.3} P(yj∣θ)=k=1∑Kαkϕ(yj∣θk), j=1,2,⋯,N(4.3)
其中, θ = ( α 1 , α 2 , ⋯ , α K ; θ 1 , θ 2 , ⋯ , θ K ) \theta = \left( \alpha_1, \alpha_2, \cdots, \alpha_K; \theta_1, \theta_2, \cdots, \theta_K \right) θ=(α1,α2,⋯,αK;θ1,θ2,⋯,θK)。我们用EM算法估计高斯混合模型的参数 θ \theta θ。
- 明确隐变量,写出完全数据的对数似然函数
可以设想观测数据 y j y_j yj, j = 1 , 2 , ⋯ , N j = 1, 2, \cdots, N j=1,2,⋯,N,是这样产生的:首先依概率 α k \alpha_k αk选择第 k k k个高斯分布分模型 ϕ ( y ∣ θ k ) \phi \left( y \mid \theta_k \right) ϕ(y∣θk),然后依第 k k k个分模型的概率分布 ϕ ( y ∣ θ k ) \phi \left( y \mid \theta_k \right) ϕ(y∣θk)生成观测数据 y j y_j yj。这时观测数据 y j y_j yj, j = 1 , 2 , ⋯ , N j = 1, 2, \cdots, N j=1,2,⋯,N是已知的;反映观测数据 y j y_j yj来自第 k k k个分模型的数据是未知的, k = 1 , 2 , ⋯ , K k = 1, 2, \cdots, K k=1,2,⋯,K,以隐变量 γ j k \gamma_{jk} γjk表示,其定义如下:
γ j k = { 1 , 第 j 个观测来自第 k 个分模量 0 , 否则 j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K (4.4) \gamma_{jk} = \begin{cases} 1 &,\ \text{第$j$个观测来自第$k$个分模量} \\ 0 &,\ \text{否则} \end{cases} \\ j = 1, 2, \cdots, N;\ k = 1, 2, \cdots, K \tag{4.4} γjk={10, 第j个观测来自第k个分模量, 否则j=1,2,⋯,N; k=1,2,⋯,K(4.4)
γ j k \gamma_{jk} γjk是0-1随机变量。
有了观测数据 y j y_j yj及未观测数据 γ j k \gamma_{jk} γjk,那么完全数据是
( y j , γ j 1 , γ j 2 , ⋯ , γ j K ) , j = 1 , 2 , ⋯ , N \left( y_j, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jK} \right),\ j = 1, 2, \cdots, N (yj,γj1,γj2,⋯,γjK), j=1,2,⋯,N
于是,可以写出完全数据的似然函数:
P ( y , γ ∣ θ ) = ∏ j = 1 N P ( y j , γ j 1 , γ j 2 , ⋯ , γ j K ∣ θ ) = ∏ k = 1 K ∏ j = 1 N [ α k ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ ϕ ( y j ∣ θ k ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ( − ( y j − μ k ) 2 2 σ k 2 ) ] γ j k \begin{aligned} P(y, \gamma \mid \theta) &= \prod_{j = 1} ^ {N} P(y_j, \gamma_{j1}, \gamma_{j2}, \cdots, \gamma_{jK} \mid \theta) \\ &= \prod_{k = 1} ^ {K} \prod_{j = 1} ^ {N} \left[ \alpha_k \phi \left( y_j \mid \theta_k \right) \right] ^ {\gamma_{jk}} \\ &= \prod_{k = 1} ^ {K} \alpha_k ^ {n_k} \prod_{j = 1} ^ {N} \left[ \phi \left( y_j \mid \theta_k \right) \right] ^ {\gamma_{jk}} \\ &= \prod_{k = 1} ^ {K} \alpha_k ^ {n_k} \prod_{j = 1} ^ {N} \left[ \frac{1}{\sqrt{2\pi}\sigma_k} \exp \left( - \frac{\left( y_j - \mu_k \right) ^ {2}}{2 \sigma_k ^ {2}} \right) \right] ^ {\gamma_{jk}} \end{aligned} P(y,γ∣θ)=j=1∏NP(yj,γj1,γj2,⋯,γjK∣θ)=k=1∏Kj=1∏N[αkϕ(yj∣θk)]γjk=k=1∏Kαknkj=1∏N[ϕ(yj∣θk)]γjk=k=1∏Kαknkj=1∏N[2πσk1exp(−2σk2(yj−μk)2)]γjk
式中, n k = ∑ j = 1 N γ j k n_k = \displaystyle\sum_{j = 1} ^ {N} \gamma_{jk} nk=j=1∑Nγjk,由于对于每个数据,其实只能是由一个模型生成的,所以对于 γ j k \gamma_{jk} γjk,相同的 j j j下,只能有一个 k k k使 γ j k = 1 \gamma_{jk} = 1 γjk=1,其余的都为 0 0 0,所以有 ∑ k = 1 K γ j k = 1 \displaystyle\sum_{k = 1} ^ {K} \gamma_{jk} = 1 k=1∑Kγjk=1,那么 ∑ k = 1 K n k = ∑ k = 1 K ∑ j = 1 N γ j k = ∑ j = 1 N ∑ k = 1 K γ j k = N \displaystyle\sum_{k = 1} ^ {K} n_k = \displaystyle\sum_{k = 1} ^ {K} \displaystyle\sum_{j = 1} ^ {N} \gamma_{jk} = \displaystyle\sum_{j = 1} ^ {N} \displaystyle\sum_{k = 1} ^ {K} \gamma_{jk} = N k=1∑Knk=k=1∑Kj=1∑Nγjk=j=1∑Nk=1∑Kγjk=N。
那么,完全数据的对数似然函数为
log P ( y , γ ∣ θ ) = ∑ k = 1 K { n k log α k + ∑ j = 1 N γ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } \log P(y, \gamma \mid \theta) = \displaystyle\sum_{k = 1} ^ {K} \left\{ n_k \log \alpha_k + \displaystyle\sum_{j = 1} ^ {N} \gamma_{jk} \left[ \log \left( \frac{1}{\sqrt{2\pi}} \right) - \log \sigma_k - \frac{1}{2\sigma_k ^ {2}} \left( y_j - \mu_k \right) ^ {2} \right] \right\} logP(y,γ∣θ)=k=1∑K{nklogαk+j=1∑Nγjk[log(2π1)−logσk−2σk21(yj−μk)2]}
- EM算法的E步:确定 Q Q Q函数
Q ( θ , θ ( i ) ) = E [ log P ( y , γ ∣ θ ) ∣ y , θ ( i ) ] = E { ∑ k = 1 K { n k log α k + ∑ j = 1 N γ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } } = ∑ k = 1 K { ∑ j = 1 N ( E γ j k ) log α k + ∑ j = 1 N ( E γ j k ) [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } (4.5) \begin{aligned} Q( \theta, \theta ^ {(i)} ) &= E\left[ \log P(y, \gamma \mid \theta) \mid y, \theta ^ {(i)} \right] \\ &= E\left\{ \displaystyle\sum_{k = 1} ^ {K} \left\{ n_k \log \alpha_k + \displaystyle\sum_{j = 1} ^ {N} \gamma_{jk} \left[ \log \left( \frac{1}{\sqrt{2\pi}} \right) - \log \sigma_k - \frac{1}{2\sigma_k ^ {2}} \left( y_j - \mu_k \right) ^ {2} \right] \right\} \right\} \\ &= \displaystyle\sum_{k = 1} ^ {K} \left\{ \displaystyle\sum_{j = 1} ^ {N} \left( E_{\gamma_{jk}} \right) \log \alpha_k + \displaystyle\sum_{j = 1} ^ {N} \left( E_{\gamma_{jk}} \right) \left[ \log \left( \frac{1}{\sqrt{2\pi}} \right) - \log \sigma_k - \frac{1}{2\sigma_k ^ {2}} \left( y_j - \mu_k \right) ^ {2} \right] \right\} \end{aligned} \tag{4.5} Q(θ,θ(i))=E[logP(y,γ∣θ)∣y,θ(i)]=E{k=1∑K{nklogαk+j=1∑Nγjk[log(2π1)−logσk−2σk21(yj−μk)2]}}=k=1∑K{j=1∑N(Eγjk)logαk+j=1∑N(Eγjk)[log(2π1)−logσk−2σk21(yj−μk)2]}(4.5)
这里需要计算 E ( γ j k ∣ y , θ ) E \left( \gamma_{jk} \mid y, \theta \right) E(γjk∣y,θ),记为 γ ^ j k \hat{\gamma}_{jk} γ^jk。
γ ^ j k = E ( γ j k ∣ y , θ ) = P ( γ j k = 1 ∣ y , θ ) = P ( γ j k = 1 , y j ∣ θ ) ∑ k = 1 K P ( γ j k = 1 , y j ∣ θ ) = P ( y j ∣ γ j k = 1 , θ ) P ( γ j k = 1 ∣ θ ) ∑ k = 1 K P ( y j ∣ γ j k = 1 , θ ) P ( γ j k = 1 ∣ θ ) = α k ϕ ( y j ∣ θ k ) ∑ k = 1 K α k ϕ ( y j ∣ θ k ) , j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K \begin{aligned} \hat{\gamma}_{jk} &= E \left( \gamma_{jk} \mid y, \theta \right) = P \left( \gamma_{jk} = 1 \mid y, \theta \right) \\ &= \frac{P \left( \gamma_{jk} = 1, y_j \mid \theta \right)}{\displaystyle\sum_{k = 1} ^ {K} P\left( \gamma_{jk} = 1, y_j \mid \theta \right)} \\ &= \frac{P \left( y_j \mid \gamma_{jk} = 1, \theta \right) P \left( \gamma_{jk} = 1 \mid \theta \right)}{\displaystyle\sum_{k = 1} ^ {K} P \left( y_j \mid \gamma_{jk} = 1, \theta \right) P \left( \gamma_{jk} = 1 \mid \theta \right)} \\ &= \frac{\alpha_k \phi \left( y_j \mid \theta_k \right)}{\displaystyle\sum_{k = 1} ^ {K} \alpha_k \phi \left( y_j \mid \theta_k \right)},\ j = 1, 2, \cdots, N;\ k = 1, 2, \cdots, K \end{aligned} γ^jk=E(γjk∣y,θ)=P(γjk=1∣y,θ)=k=1∑KP(γjk=1,yj∣θ)P(γjk=1,yj∣θ)=k=1∑KP(yj∣γjk=1,θ)P(γjk=1∣θ)P(yj∣γjk=1,θ)P(γjk=1∣θ)=k=1∑Kαkϕ(yj∣θk)αkϕ(yj∣θk), j=1,2,⋯,N; k=1,2,⋯,K
γ ^ j k \hat{\gamma}_{jk} γ^jk是在当前模型参数下第 j j j个观测数据来自第 k k k个分模型的概率,称为分模型 k k k对观测数据 y j y_j yj的响应度。
将 γ ^ j k = E γ j k \hat{\gamma}_{jk} = E_{\gamma_{jk}} γ^jk=Eγjk及 n k = ∑ j = 1 N E γ j k n_k = \displaystyle\sum_{j = 1} ^ {N} E_{\gamma_{jk}} nk=j=1∑NEγjk代入式 ( 4.5 ) (4.5) (4.5),即得
Q ( θ , θ ( i ) ) = ∑ k = 1 K { n k log α k + ∑ j = 1 N γ ^ j k [ log ( 1 2 π ) − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } (4.6) Q( \theta, \theta ^ {(i)} ) = \displaystyle\sum_{k = 1} ^ {K} \left\{ n_k \log \alpha_k + \displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk} \left[ \log \left( \frac{1}{\sqrt{2\pi}} \right) - \log \sigma_k - \frac{1}{2\sigma_k ^ {2}} \left( y_j - \mu_k \right) ^ {2} \right] \right\} \tag{4.6} Q(θ,θ(i))=k=1∑K{nklogαk+j=1∑Nγ^jk[log(2π1)−logσk−2σk21(yj−μk)2]}(4.6)
- 确定EM算法的M步
迭代的M步是求函数 Q ( θ , θ ( i ) ) Q( \theta, \theta ^ {(i)} ) Q(θ,θ(i))对 θ \theta θ的极大值,即求新一轮迭代的模型参数:
θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) \theta ^ {(i + 1)} = \mathop{\arg} \mathop{\max}\limits_{\theta} Q(\theta, \theta ^ {(i)}) θ(i+1)=argθmaxQ(θ,θ(i))
用 μ ^ k \hat{\mu}_k μ^k, σ ^ k 2 \hat{\sigma}_k ^ {2} σ^k2及 α ^ k \hat{\alpha}_k α^k, k = 1 , 2 , ⋯ , K k = 1, 2, \cdots, K k=1,2,⋯,K,表示 θ ( i + 1 ) \theta ^ {(i + 1)} θ(i+1)的各参数。求 μ ^ k \hat{\mu}_k μ^k, σ ^ k 2 \hat{\sigma}_k ^ {2} σ^k2只需将式 ( 4.6 ) (4.6) (4.6)分别对 μ ^ k \hat{\mu}_k μ^k, σ ^ k 2 \hat{\sigma}_k ^ {2} σ^k2求偏导数并令其为 0 0 0,即可得到;求 α ^ k \hat{\alpha}_k α^k是在 ∑ k = 1 K α k = 1 \displaystyle\sum_{k = 1} ^ {K} \alpha_k = 1 k=1∑Kαk=1条件下求偏导数并令其为 0 0 0得到的。j结果如下:
μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K (4.7) \hat{\mu}_k = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk} y_j}{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}, \ k = 1, 2, \cdots, K \tag{4.7} μ^k=j=1∑Nγ^jkj=1∑Nγ^jkyj, k=1,2,⋯,K(4.7)
σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K (4.8) \hat{\sigma}_k ^ {2} = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk} \left( y_j - \mu_k \right) ^ {2}}{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}, \ k = 1, 2, \cdots, K \tag{4.8} σ^k2=j=1∑Nγ^jkj=1∑Nγ^jk(yj−μk)2, k=1,2,⋯,K(4.8)
α ^ k = n k N = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯ , K (4.9) \hat{\alpha}_k = \frac{n_k}{N} = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}{N}, \ k = 1, 2, \cdots, K \tag{4.9} α^k=Nnk=Nj=1∑Nγ^jk, k=1,2,⋯,K(4.9)
重复以上计算,直到对数似然函数值不再有明显的变化为止。
如此我们便得到了高斯混合模型参数估计的EM算法:
算法4.1(高斯混合模型参数估计的EM算法)
输入:观测数据 y 1 , y 2 , ⋯ , y N y_1, y_2, \cdots, y_N y1,y2,⋯,yN,高斯混合模型;
输出:高斯混合模型参数。
(1)取参数的初始值开始迭代;
(2)E步:依据当前模型参数,计算分模型 k k k对观测数据 y j y_j yj的响应度
γ ^ j k = α k ϕ ( y j ∣ θ k ) ∑ k = 1 K α k ϕ ( y j ∣ θ k ) , j = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , K \hat{\gamma}_{jk} = \frac{\alpha_k \phi \left( y_j \mid \theta_k \right)}{\displaystyle\sum_{k = 1} ^ {K} \alpha_k \phi \left( y_j \mid \theta_k \right)},\ j = 1, 2, \cdots, N;\ k = 1, 2, \cdots, K γ^jk=k=1∑Kαkϕ(yj∣θk)αkϕ(yj∣θk), j=1,2,⋯,N; k=1,2,⋯,K
(3)M步:计算新一轮迭代的模型参数
μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K \hat{\mu}_k = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk} y_j}{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}, \ k = 1, 2, \cdots, K μ^k=j=1∑Nγ^jkj=1∑Nγ^jkyj, k=1,2,⋯,K
σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , ⋯ , K \hat{\sigma}_k ^ {2} = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk} \left( y_j - \mu_k \right) ^ {2}}{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}, \ k = 1, 2, \cdots, K σ^k2=j=1∑Nγ^jkj=1∑Nγ^jk(yj−μk)2, k=1,2,⋯,K
α ^ k = n k N = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , ⋯ , K \hat{\alpha}_k = \frac{n_k}{N} = \frac{\displaystyle\sum_{j = 1} ^ {N} \hat{\gamma}_{jk}}{N}, \ k = 1, 2, \cdots, K α^k=Nnk=Nj=1∑Nγ^jk, k=1,2,⋯,K
(4)重复第(2)步和第(3)步,直到收敛。
五、代码实现
该代码实现了高斯混合模型参数估计的EM算法,选取了三个二维高斯分布模型 μ 1 = ( 2 1 ) \bm{\mu}_1 = \begin{pmatrix}2 \\ 1\end{pmatrix} μ1=(21), σ 1 2 = ( 0.5 0 0 0.2 ) \bm{\sigma}_1 ^ 2 = \begin{pmatrix}0.5 & 0 \\ 0 & 0.2\end{pmatrix} σ12=(0.5000.2); μ 2 = ( 5 9 ) \bm{\mu}_2 = \begin{pmatrix}5 \\ 9\end{pmatrix} μ2=(59), σ 2 2 = ( 0.3 0 0 0.3 ) \bm{\sigma}_2 ^ 2 = \begin{pmatrix}0.3 & 0 \\ 0 & 0.3\end{pmatrix} σ22=(0.3000.3); μ 3 = ( 10 6 ) \bm{\mu}_3 = \begin{pmatrix}10 \\ 6\end{pmatrix} μ3=(106), σ 3 2 = ( 0.3 0 0 0.4 ) \bm{\sigma}_3 ^ 2 = \begin{pmatrix}0.3 & 0 \\ 0 & 0.4\end{pmatrix} σ32=(0.3000.4),每个模型都随机生成100个点,然后用EM算法对这300个点进行划分和聚类,其中初值设定为
α = ( 0.3 0.3 0.4 ) , μ = ( 2.0 + r a n d o m 1.0 + r a n d o m 5.0 + r a n d o m 9.0 + r a n d o m 10.0 + r a n d o m 6.0 + r a n d o m ) , σ 2 = ( 0.5 + r a n d o m 0.2 + r a n d o m 0.3 + r a n d o m 0.3 + r a n d o m 0.3 + r a n d o m 0.4 + r a n d o m ) \alpha = \begin{pmatrix} 0.3 \\ 0.3 \\ 0.4 \end{pmatrix}, \ \mu = \begin{pmatrix} 2.0 + random & 1.0 + random \\ 5.0 + random & 9.0 + random \\ 10.0 + random & 6.0 + random \end{pmatrix}, \ \sigma ^ {2} = \begin{pmatrix} 0.5 + random & 0.2 + random \\ 0.3 + random & 0.3 + random \\ 0.3 + random & 0.4 + random \end{pmatrix} α=⎝⎛0.30.30.4⎠⎞, μ=⎝⎛2.0+random5.0+random10.0+random1.0+random9.0+random6.0+random⎠⎞, σ2=⎝⎛0.5+random0.3+random0.3+random0.2+random0.3+random0.4+random⎠⎞
然后进行迭代,最终得到结果如下:
完整代码如下:
import numpy as np
import random
import math
import matplotlib.pyplot as plt
# 创造数据
def create_data(mu1, mu2, sigma1, sigma2, size):
mean = np.array([mu1, mu2])
cov = np.array([[sigma1, 0], [0, sigma2]])
Y = np.random.multivariate_normal(mean, cov, size=size)
return Y
# 高斯分布模型
def Gau(y, mu, sigma):
tmp1 = -0.5 * ((y[0] - mu[0]) ** 2 / sigma[0] + (y[1] - mu[1]) ** 2 / sigma[1])
tmp2 = 1.0 / (2 * np.pi * math.sqrt(sigma[0]) * math.sqrt(sigma[1]))
res = tmp2 * math.exp(tmp1)
return res
# EM算法
def EM(y, K):
N = len(y)
alpha = np.array([0.3, 0.3, 0.4])
mu = np.array([[2.0 + random.random(), 1.0 + random.random()],
[5.0 + random.random(), 9.0 + random.random()],
[10.0 + random.random(), 6.0 + random.random()]])
sigma = np.array([[0.5 + random.random(), 0.2 + random.random()],
[0.3 + random.random(), 0.3 + random.random()],
[0.3 + random.random(), 0.4 + random.random()]])
gama = []
for i in range(N):
gama.append([0.0, 0.0, 0.0])
gama = np.array(gama)
gama1 = gama
t = 1
while True:
# E步
for j in range(N):
tmp = 0.0
for k in range(K):
tmp = tmp + alpha[k] * Gau(y[j], mu[k], sigma[k])
for k in range(K):
gama[j, k] = alpha[k] * Gau(y[j], mu[k], sigma[k]) / tmp
# 判断
if t == 2:
sum = 0.0
for j in range(N):
sum += np.linalg.norm(gama[j] - gama1[j])
if sum < 0.0000001:
break
# M步
tmp_gama = np.array([0.0, 0.0, 0.0])
for k in range(K):
for j in range(N):
tmp_gama[k] = tmp_gama[k] + gama[j, k]
for k in range(K):
tmp = np.array([0.0, 0.0])
for j in range(N):
tmp = tmp + gama[j, k] * y[j]
mu[k] = tmp / tmp_gama[k]
for k in range(K):
tmp = np.array([0.0, 0.0])
for j in range(N):
tmp = tmp + gama[j, k] * (y[j] - mu[k]) ** 2
sigma[k] = tmp / tmp_gama[k]
for k in range(K):
alpha[k] = tmp_gama[k] / N
gama1 = gama
t = 2
return N, alpha, mu, sigma, gama
if __name__ == '__main__':
# K = 3
# n = 100, N = 300
s = np.array([3, 100, 2, 1, 0.5, 0.2, 5, 9, 0.3, 0.3, 10, 6, 0.3, 0.4])
lens = len(s)
for i in range(2, lens):
s[i] = np.float_(s[i])
K = np.int_(s[0])
size = np.int_(s[1])
Y = []
for i in range(2, 14, 4):
tmp_Y = create_data(s[i], s[i + 1], s[i + 2], s[i + 3], size)
Y.extend(tmp_Y)
Y = np.array(Y)
N, alpha, mu, sigma, gama = EM(Y, K)
print("α =", alpha)
print("μ =", mu)
print("σ^2 =", sigma)
for j in range(N):
index = 0
big = gama[j, 0]
if big < gama[j, 1]:
index = 1
big = gama[j, 1]
if big < gama[j, 2]:
index = 2
if index == 0:
c = 'r'
elif index == 1:
c = 'g'
else:
c = 'b'
plt.scatter(Y[j, 0], Y[j, 1], c=c, marker='+')
plt.xlabel("X1")
plt.ylabel("X2")
plt.show()