1.高斯混合模型
极大似然估计是一种应用很广泛的参数估计方法。在已有某个地区身高数据以及知道身高服从高斯分布的情况下,利用极大似然估计的方法可以估计出高斯分布
μ
,
σ
\mu,\sigma
μ,σ两个参数。
如果是多组数据,多个模型呢?获取现在我们有全国多个省份的身高数据,但并不知道它们具体属于哪个省份,只知道每个省之间服从不同的高斯分布,此时的模型称为高斯混合模型(GMM),其公式为
P
(
x
;
θ
)
=
∑
i
=
1
k
p
k
N
(
x
;
μ
k
,
Σ
k
)
.
.
.
.
.
.
.
.
.
.
(
1
)
P(x;\theta)=\sum_{i=1}^kp_kN(x;\mu_k,\Sigma_k)..........(1)
P(x;θ)=i=1∑kpkN(x;μk,Σk)..........(1)
此时用极大似然估计的方法并不能很好地求出模型里面的参数,因为模型中存在一个隐变量——样本的label。令其为Z,则可以写出似然函数的形式:
L
(
θ
)
=
∑
i
=
1
m
log
p
(
x
i
;
θ
)
=
∑
i
=
1
m
log
∫
p
(
x
i
,
z
;
θ
)
d
z
.
.
.
.
.
.
.
.
.
.
(
2
)
L(\theta)=\sum_{i=1}^m\log p(x_i;\theta)=\sum_{i=1}^m\log \int p(x_i,z;\theta)dz..........(2)
L(θ)=i=1∑mlogp(xi;θ)=i=1∑mlog∫p(xi,z;θ)dz..........(2)
如果根据极大似然估计的算法对式子(2)进行求导,意味着我们要对
log
(
y
1
+
y
2
+
.
.
.
)
\log (y1+y2+...)
log(y1+y2+...)的形式进行求导,这计算量简直爆炸。
2.Jensen不等式
这里我们需要用到EM算法来进行求解,在介绍EM算法前,我们先来了解一个不等式——Jensen不等式,它的定义如下:
如果X是一个随机变量,f(X)是一个凸函数(二阶导数大或等于0),那么有:
E
(
f
(
x
)
)
≥
f
[
E
(
x
)
]
E(f(x)) \ge f[E(x)]
E(f(x))≥f[E(x)]
如果f(X)是凹函数,不等号反向
E
(
f
(
x
)
)
≤
f
[
E
(
x
)
]
E(f(x)) \le f[E(x)]
E(f(x))≤f[E(x)]
当且仅当X是常数的时候等号成立。
结合图像我们可以对Jensen不等式有一个直观的理解,对于x=a和x=b两个点,E(f(x))可以看成是
1
2
(
f
(
a
)
+
f
(
b
)
)
\frac{1}{2}(f(a)+f(b))
21(f(a)+f(b)),f[E(x)]可以看成是
f
(
1
2
(
f
(
a
)
+
f
(
b
)
)
)
f(\frac{1}{2}(f(a)+f(b)))
f(21(f(a)+f(b))),即ab中点对应的函数值。由于图像是凹图像,所以不难得出
E
(
f
(
x
)
)
≥
f
[
E
(
x
)
]
E(f(x)) \ge f[E(x)]
E(f(x))≥f[E(x)]。
3.EM算法及推导过程
利用Jensen不等式,由于log函数是凹函数,式子(2)可以改写成:
∑
i
=
1
m
log
∫
p
(
x
i
,
z
;
θ
)
d
z
=
∑
i
=
1
m
log
∫
Q
(
z
)
P
(
x
i
,
z
;
θ
)
Q
(
z
)
d
z
≥
∑
i
=
1
m
∫
Q
(
z
)
log
P
(
x
i
,
z
;
θ
)
Q
(
z
)
d
z
.
.
.
.
.
.
.
.
.
.
(
3
)
\begin{aligned} \sum_{i=1}^m\log \int p(x_i,z;\theta)dz &=\sum_{i=1}^m\log \int Q(z)\frac{P(x_i,z;\theta)}{Q(z)}dz\\ &\ge \sum_{i=1}^m\int Q(z)\log \frac{P(x_i,z;\theta)}{Q(z)}dz..........(3) \end{aligned}
i=1∑mlog∫p(xi,z;θ)dz=i=1∑mlog∫Q(z)Q(z)P(xi,z;θ)dz≥i=1∑m∫Q(z)logQ(z)P(xi,z;θ)dz..........(3)
由于
Q
(
z
)
Q(z)
Q(z)是概率分布,所以满足
∑
z
Q
(
z
)
=
1..........
(
4
)
\sum_zQ(z)=1..........(4)
z∑Q(z)=1..........(4)
根据Jenson不等式,等式成立时,有
p
(
x
i
,
z
;
θ
)
Q
(
z
)
=
C
.
.
.
.
.
.
.
.
.
.
(
5
)
\frac{p(x_i,z;\theta)}{Q(z)}=C..........(5)
Q(z)p(xi,z;θ)=C..........(5)
根据4和5,有
∑
z
i
Q
(
z
)
=
∑
z
i
p
(
x
i
,
z
;
θ
)
C
=
1..........
(
6
)
Q
(
z
)
=
p
(
x
i
,
z
;
θ
)
C
=
p
(
x
i
,
z
;
θ
)
∑
z
p
(
x
i
,
z
;
θ
)
=
p
(
x
i
,
z
;
θ
)
p
(
x
i
)
=
p
(
z
∣
x
i
;
θ
)
.
.
.
.
.
.
.
.
.
.
(
7
)
\sum_{z_i}Q(z)=\sum_{z_i}\frac{p(x_i,z;\theta)}{C}=1..........(6) \\ Q(z)=\frac{p(x_i,z;\theta)}{C}=\frac{p(x_i,z;\theta)}{\sum_{z}p(x_i,z;\theta)}=\frac{p(x_i,z;\theta)}{p(x_i)}=p(z|x_i;\theta)..........(7)
zi∑Q(z)=zi∑Cp(xi,z;θ)=1..........(6)Q(z)=Cp(xi,z;θ)=∑zp(xi,z;θ)p(xi,z;θ)=p(xi)p(xi,z;θ)=p(z∣xi;θ)..........(7)
将(7)回代到(3),有
∑
i
=
1
m
log
∫
p
(
x
i
,
z
;
θ
)
d
z
≥
∑
i
=
1
m
∫
p
(
z
∣
x
i
;
θ
)
log
P
(
x
i
,
z
;
θ
)
p
(
z
∣
x
i
;
θ
)
d
z
\sum_{i=1}^m\log\int p(x_i,z;\theta)dz\ge \sum_{i=1}^m\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz
i=1∑mlog∫p(xi,z;θ)dz≥i=1∑m∫p(z∣xi;θ)logp(z∣xi;θ)P(xi,z;θ)dz
通过每次令
θ
^
=
arg
max
θ
∫
p
(
z
∣
x
i
;
θ
)
log
P
(
x
i
,
z
;
θ
)
p
(
z
∣
x
i
;
θ
)
d
z
=
∫
p
(
z
∣
x
i
;
θ
)
log
P
(
x
i
,
z
;
θ
)
d
z
\hat \theta=\arg \max_\theta\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz=\int p(z|x_i;\theta)\log P(x_i,z;\theta)dz
θ^=argθmax∫p(z∣xi;θ)logp(z∣xi;θ)P(xi,z;θ)dz=∫p(z∣xi;θ)logP(xi,z;θ)dz
就可以不断迭代使得似然函数最大。因此EM算法本质上是一种令下界不断提高,从而使得似然函数也不断提高的迭代算法。
因此,EM算法的流程如下:
- 初始化参数 θ 0 θ_0 θ0,开始迭代
- E-Step:计算 Q ( θ , θ i ) = E z ∣ x , θ i [ log P ( x , z ∣ θ ) ] = ∫ log P ( x , z ∣ θ ) P ( z ∣ x , θ i ) d z Q(\theta,\theta_i)=E_{z|x,\theta_i}[\log P(x,z|\theta)]=\int\log P(x,z|\theta)P(z|x,\theta_i)dz Q(θ,θi)=Ez∣x,θi[logP(x,z∣θ)]=∫logP(x,z∣θ)P(z∣x,θi)dz
- M-Step:根据计算得到的 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi),求使 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi)极大化的 θ \theta θ,得到新的参数 θ i + 1 θ_{i+1} θi+1。
- 重复2和3直至收敛
4.EM算法的可行性
为什么EM算法能近似实现对观测数据的极大似然估计呢?我们需要证明我们的对数似然函数的值在迭代的过程中一直在增大。假设当前迭代到第i轮,参数为
θ
i
\theta_i
θi,希望新估计值
θ
\theta
θ可以使
L
(
θ
)
L(\theta)
L(θ)增大,即
L
(
θ
)
>
L
(
θ
i
)
L(\theta)\gt L(\theta_i)
L(θ)>L(θi),则:
L
(
θ
)
−
L
(
θ
i
)
=
log
(
∑
Z
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
)
−
log
(
p
(
x
∣
θ
i
)
)
=
log
∑
Z
p
(
z
∣
x
,
θ
i
)
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
p
(
z
∣
x
,
θ
i
)
−
log
(
p
(
x
∣
θ
i
)
)
≥
∑
Z
p
(
z
∣
x
,
θ
i
)
log
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
p
(
z
∣
x
,
θ
i
)
−
log
(
p
(
x
∣
θ
i
)
)
=
∑
Z
p
(
z
∣
x
,
θ
i
)
log
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
p
(
z
∣
x
,
θ
i
)
(
p
(
x
∣
θ
i
)
.
.
.
.
.
.
.
.
.
.
.
(
8
)
\begin{aligned} L(\theta)-L(\theta_{i}) &=\log (\sum_Zp(x|z,\theta)p(z|\theta))-\log \left(p(x|\theta_{i})\right) \\ &=\log \sum_Zp(z|x,\theta_{i})\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &\ge \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &= \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(8) \end{aligned}
L(θ)−L(θi)=log(Z∑p(x∣z,θ)p(z∣θ))−log(p(x∣θi))=logZ∑p(z∣x,θi)p(z∣x,θi)p(x∣z,θ)p(z∣θ)−log(p(x∣θi))≥Z∑p(z∣x,θi)logp(z∣x,θi)p(x∣z,θ)p(z∣θ)−log(p(x∣θi))=Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)...........(8)
令
B
(
θ
,
θ
i
)
=
L
(
θ
i
)
+
∑
Z
p
(
z
∣
x
,
θ
i
)
log
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
p
(
z
∣
x
,
θ
i
)
(
p
(
x
∣
θ
i
)
.
.
.
.
.
.
.
.
.
.
.
(
9
)
B(\theta,\theta_{i})=L(\theta_{i})+\sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(9)
B(θ,θi)=L(θi)+Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)...........(9)
则有
L
(
θ
)
≥
B
(
θ
,
θ
i
)
L(\theta)\ge B(\theta,\theta_{i})
L(θ)≥B(θ,θi),因此可以认为
B
(
θ
,
θ
i
)
B(\theta,\theta_{i})
B(θ,θi)为
L
(
θ
)
L(\theta)
L(θ)的一个下界,且有
L
(
θ
i
)
=
B
(
θ
i
,
θ
i
)
L(\theta_{i})=B(\theta_{i},\theta_{i})
L(θi)=B(θi,θi),为了使
L
(
θ
)
L(\theta)
L(θ)尽可能地变大,我们解以下式子:
arg
max
θ
B
(
θ
,
θ
i
)
=
arg
max
θ
L
(
θ
i
)
+
∑
Z
p
(
z
∣
x
,
θ
i
)
log
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
p
(
z
∣
x
,
θ
i
)
(
p
(
x
∣
θ
i
)
=
arg
max
θ
∑
Z
p
(
z
∣
x
,
θ
i
)
log
(
p
(
x
∣
z
,
θ
)
p
(
z
∣
θ
)
)
=
arg
max
θ
∑
Z
p
(
z
∣
x
,
θ
i
)
log
p
(
x
,
z
∣
θ
)
=
arg
max
θ
Q
(
θ
,
θ
i
)
.
.
.
.
.
.
.
.
.
.
.
(
10
)
\begin{aligned} \arg \max_\theta B(\theta,\theta_{i}) &=\arg \max_\theta L(\theta_{i})+ \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})} \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log (p(x|z,\theta)p(z|\theta)) \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log p(x,z|\theta)\\ &= \arg \max_\theta Q(\theta,\theta_i)...........(10) \end{aligned}
argθmaxB(θ,θi)=argθmaxL(θi)+Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)=argθmaxZ∑p(z∣x,θi)log(p(x∣z,θ)p(z∣θ))=argθmaxZ∑p(z∣x,θi)logp(x,z∣θ)=argθmaxQ(θ,θi)...........(10)
而这正是EM中我们所做的一次迭代,由此可证EM算法是通过不断求解下界的极大化逼近求解对数似然函数极大化的算法。
5.EM算法的收敛性
可以使用单调有界的思想来证明EM算法收敛。首先,
P
(
x
∣
θ
)
P(x|\theta)
P(x∣θ)为观测数据的似然函数,我们要证明递增性,即
P
(
x
∣
θ
i
+
1
)
≥
P
(
x
∣
θ
i
)
.
.
.
.
.
.
.
.
.
.
.
(
11
)
P(x|\theta_{i+1}) \ge P(x| \theta_i)...........(11)
P(x∣θi+1)≥P(x∣θi)...........(11)
证明如下:
P
(
x
∣
θ
)
=
P
(
x
,
z
∣
θ
)
P
(
z
∣
x
,
θ
)
  
⟺
  
log
P
(
x
∣
θ
)
=
log
P
(
x
,
z
∣
θ
)
−
log
P
(
z
∣
x
,
θ
)
.
.
.
.
.
.
.
.
.
.
.
(
12
)
\begin{aligned} &P(x|\theta)=\frac{P(x,z|\theta)}{P(z|x,\theta)}\iff \log P(x|\theta) = \log P(x,z|\theta) - \log P(z|x,\theta)...........(12) \end{aligned}
P(x∣θ)=P(z∣x,θ)P(x,z∣θ)⟺logP(x∣θ)=logP(x,z∣θ)−logP(z∣x,θ)...........(12)
令
H
(
θ
,
θ
i
)
=
∑
z
log
P
(
z
∣
x
,
θ
)
P
(
z
∣
x
,
θ
i
)
.
.
.
.
.
.
.
.
.
.
.
(
13
)
H(\theta,\theta_i)=\sum\limits_z\log P(z|x,\theta)P(z|x,\theta_i)...........(13)
H(θ,θi)=z∑logP(z∣x,θ)P(z∣x,θi)...........(13)
根据Q函数定义有,
Q
(
θ
,
θ
i
)
=
∑
z
log
P
(
x
,
z
∣
θ
)
P
(
z
∣
x
,
θ
i
)
Q(\theta,\theta_i)=\sum\limits_z\log P(x,z|\theta)P(z|x,\theta_i)
Q(θ,θi)=z∑logP(x,z∣θ)P(z∣x,θi)
于是对数似然函数可以写成
log
P
(
x
∣
θ
)
=
Q
(
θ
,
θ
i
)
−
H
(
θ
,
θ
i
)
.
.
.
.
.
.
.
.
.
.
(
14
)
\log P(x|\theta) =Q(\theta,\theta_i)-H(\theta,\theta_i)..........(14)
logP(x∣θ)=Q(θ,θi)−H(θ,θi)..........(14)
上式中分别取
θ
i
\theta_i
θi 和
θ
i
+
1
\theta_{i+1}
θi+1并相减,有
log
P
(
x
∣
θ
i
+
1
)
−
log
P
(
x
∣
θ
i
)
=
[
Q
(
θ
i
+
1
,
θ
i
)
−
Q
(
θ
i
,
θ
i
)
]
−
[
H
(
θ
i
+
1
,
θ
i
)
−
H
(
θ
i
,
θ
i
)
]
.
.
.
.
.
.
.
.
.
.
(
15
)
\begin{aligned} & \log P(x|\theta_{i+1})- \log P(x|\theta_i) \\ &=[Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i)]-[H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i)]..........(15) \end{aligned}
logP(x∣θi+1)−logP(x∣θi)=[Q(θi+1,θi)−Q(θi,θi)]−[H(θi+1,θi)−H(θi,θi)]..........(15)
由于
θ
i
+
1
\theta_{i+1}
θi+1使
Q
(
θ
,
θ
i
)
Q(\theta,\theta_i)
Q(θ,θi)极大,所以有
Q
(
θ
i
+
1
,
θ
i
)
−
Q
(
θ
i
,
θ
i
)
≥
0..........
(
16
)
Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i) \ge 0 ..........(16)
Q(θi+1,θi)−Q(θi,θi)≥0..........(16)
而由式子13以及Jensen不等式,有
H
(
θ
i
+
1
,
θ
i
)
−
H
(
θ
i
,
θ
i
)
=
∑
z
(
log
P
(
z
∣
x
,
θ
i
+
1
)
P
(
z
∣
x
,
θ
i
)
P
(
z
∣
x
,
θ
i
)
)
≤
log
(
∑
z
P
(
z
∣
x
,
θ
i
+
1
)
P
(
z
∣
x
,
θ
i
)
P
(
z
∣
x
,
θ
i
)
)
=
log
P
(
z
∣
x
,
θ
i
+
1
)
=
0.........
(
17
)
\begin{aligned} &H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i) \\ &=\sum_z\left(\log \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &\le \log \left(\sum_z \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &=\log P(z|x,\theta_{i+1})=0.........(17) \end{aligned}
H(θi+1,θi)−H(θi,θi)=z∑(logP(z∣x,θi)P(z∣x,θi+1)P(z∣x,θi))≤log(z∑P(z∣x,θi)P(z∣x,θi+1)P(z∣x,θi))=logP(z∣x,θi+1)=0.........(17)
可知
log
P
(
x
∣
θ
i
+
1
)
−
log
P
(
x
∣
θ
i
)
≥
0..........
(
18
)
\log P(x|\theta_{i+1})- \log P(x|\theta_i) \ge 0 ..........(18)
logP(x∣θi+1)−logP(x∣θi)≥0..........(18)
又因为
P
(
x
∣
θ
i
)
≤
1
P(x|\theta_i) \le 1
P(x∣θi)≤1有界,可知
L
(
θ
i
)
=
log
P
(
x
∣
θ
i
)
L(\theta_i)=\log P(x|\theta_i)
L(θi)=logP(x∣θi)必定收敛。
由于采用迭代的方式进行求解,EM算法可以保证收敛到一个稳定点,但是却不能保证收敛到全局的极大值点,因此它是局部最优的算法,当然,如果我们的优化目标
B
(
θ
,
θ
i
)
B(\theta,\theta_i)
B(θ,θi)是凸的,则EM算法可以保证收敛到全局最大值,这点和梯度下降法这样的迭代算法相同。初值的选择非常重要,常用的方法是选择多个初值尝试,比较结果选取较好的。
6.EM的另一种推导
log
P
(
x
∣
θ
)
=
log
P
(
x
,
z
∣
θ
)
−
log
P
(
z
∣
x
,
θ
)
=
log
P
(
x
,
z
∣
θ
)
q
(
z
)
−
log
P
(
z
∣
x
,
θ
)
q
(
z
)
\begin{aligned} \log P(x|\theta) &=\log P(x,z|\theta)-\log P(z|x,\theta) \\ &=\log \frac{P(x,z|\theta)}{q(z)}-\log \frac{P(z|x,\theta)}{q(z)} \\ \end{aligned}
logP(x∣θ)=logP(x,z∣θ)−logP(z∣x,θ)=logq(z)P(x,z∣θ)−logq(z)P(z∣x,θ)
左右两边同时对z积分,此时有
l
e
f
t
=
∫
log
P
(
x
∣
θ
)
q
(
z
)
d
z
=
log
P
(
x
∣
θ
)
∫
q
(
z
)
d
z
=
P
(
x
∣
θ
)
left=\int \log P(x|\theta) q(z)dz=\log P(x|\theta)\int q(z)dz= P(x|\theta)
left=∫logP(x∣θ)q(z)dz=logP(x∣θ)∫q(z)dz=P(x∣θ)
右边有
r
i
g
h
t
=
∫
z
q
(
z
)
log
P
(
x
,
z
∣
θ
)
q
(
z
)
d
z
−
∫
z
q
(
z
)
log
P
(
z
∣
x
,
θ
)
q
(
z
)
d
z
right=\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz-\int_zq(z)\log \frac{P(z|x,\theta)}{q(z)}dz
right=∫zq(z)logq(z)P(x,z∣θ)dz−∫zq(z)logq(z)P(z∣x,θ)dz
根据相对熵(KL散度)的定义,对于p(x)和q(x)两个分布,p对q的相对熵表示为
D
(
p
∣
∣
q
)
=
∑
i
=
1
n
p
(
x
)
log
p
(
x
)
q
(
x
)
D(p||q)=\sum_{i=1}^np(x)\log \frac{p(x)}{q(x)}
D(p∣∣q)=i=1∑np(x)logq(x)p(x)
可以发现right等式右边第二项其实就是q(z)对
P
(
z
∣
x
,
θ
)
P(z|x,\theta)
P(z∣x,θ)的相对熵,即
P
(
x
∣
θ
)
=
∫
z
q
(
z
)
log
P
(
x
,
z
∣
θ
)
q
(
z
)
d
z
+
D
(
q
(
z
)
∣
∣
p
(
z
∣
x
,
θ
)
)
P(x|\theta) =\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz+D(q(z)||p(z|x,\theta))
P(x∣θ)=∫zq(z)logq(z)P(x,z∣θ)dz+D(q(z)∣∣p(z∣x,θ))
相对熵有两个性质,一个是不具有对称性,即
D
(
p
∣
∣
q
)
≠
D
(
q
∣
∣
p
)
D(p||q)\ne D(q||p)
D(p∣∣q)̸=D(q∣∣p),另一个性质是
D
(
p
∣
∣
q
)
≥
0
D(p||q) \ge 0
D(p∣∣q)≥0。当p和q的概率分布相同时,相对熵为0。也就是说,第一项其实就是
p
(
x
∣
θ
)
p(x|\theta)
p(x∣θ)的一个下界,我们称之为ELBO。我们想要让下界ELBO增大,从而使得
p
(
x
∣
θ
)
p(x|\theta)
p(x∣θ)增大,则
θ
^
=
arg
max
θ
∫
z
q
(
z
)
log
P
(
x
,
z
∣
θ
)
q
(
z
)
d
z
=
arg
max
θ
∫
z
P
(
z
∣
x
,
θ
i
)
log
P
(
x
,
z
∣
θ
)
P
(
z
∣
x
,
θ
i
)
d
z
=
arg
max
θ
∫
z
P
(
z
∣
x
,
θ
i
)
log
P
(
x
,
z
∣
θ
)
d
z
\begin{aligned} \hat \theta &=\arg \max_\theta \int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log \frac{P(x,z|\theta)}{P(z|x,\theta_i)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log P(x,z|\theta)dz \\ \end{aligned}
θ^=argθmax∫zq(z)logq(z)P(x,z∣θ)dz=argθmax∫zP(z∣x,θi)logP(z∣x,θi)P(x,z∣θ)dz=argθmax∫zP(z∣x,θi)logP(x,z∣θ)dz
因此得到
θ
\theta
θ的迭代公式。
7.应用EM算法求解GMM
下面我们使用EM算法来求解GMM模型,GMM模型如下:
p
(
x
∣
θ
)
=
∑
k
=
1
K
p
k
N
(
x
∣
μ
k
,
Σ
k
)
.
.
.
.
.
.
.
.
.
(
19
)
p(x|\theta)=\sum_{k=1}^Kp_kN(x|\mu_k,\Sigma_k).........(19)
p(x∣θ)=k=1∑KpkN(x∣μk,Σk).........(19)
联合概率为
p
(
x
,
z
)
=
p
(
z
)
p
(
x
∣
z
)
=
p
k
⋅
N
(
x
∣
μ
z
,
Σ
z
)
p(x,z)=p(z)p(x|z)=p_k·N(x|\mu_z,\Sigma_z)
p(x,z)=p(z)p(x∣z)=pk⋅N(x∣μz,Σz)
后验概率为
p
(
z
∣
x
)
=
p
(
x
,
z
)
p
(
x
)
=
p
k
⋅
N
(
x
∣
μ
z
,
Σ
z
)
∑
k
=
1
K
p
k
⋅
N
(
x
∣
μ
k
,
Σ
k
)
p(z|x)=\frac{p(x,z)}{p(x)}=\frac{p_k·N(x|\mu_z,\Sigma_z)}{\sum_{k=1}^Kp_k·N(x|\mu_k,\Sigma_k)}
p(z∣x)=p(x)p(x,z)=∑k=1Kpk⋅N(x∣μk,Σk)pk⋅N(x∣μz,Σz)
在EM算法中,有
Q
(
θ
,
θ
t
)
=
∫
z
log
P
(
x
,
z
∣
θ
)
P
(
z
∣
x
,
θ
t
)
d
z
=
∑
z
log
∏
i
=
1
N
P
(
x
i
,
z
i
∣
θ
)
∏
i
=
1
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
,
z
2
,
.
.
.
,
z
n
∑
i
=
1
N
log
P
(
x
i
,
z
i
∣
θ
)
∏
i
=
1
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
,
z
2
,
.
.
.
,
z
n
(
log
P
(
x
1
,
z
1
∣
θ
)
+
log
P
(
x
2
,
z
2
∣
θ
)
+
.
.
.
+
log
P
(
x
n
,
z
n
∣
θ
)
)
∏
i
=
1
N
P
(
z
i
∣
x
i
,
θ
t
)
\begin{aligned} Q(\theta,\theta_t) &=\int_z\log P(x,z|\theta)P(z|x,\theta_t)dz\\ &=\sum_z\log \prod_{i=1}^NP(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\sum_{i=1}^N\log P(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ \end{aligned}
Q(θ,θt)=∫zlogP(x,z∣θ)P(z∣x,θt)dz=z∑logi=1∏NP(xi,zi∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑i=1∑NlogP(xi,zi∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑(logP(x1,z1∣θ)+logP(x2,z2∣θ)+...+logP(xn,zn∣θ))i=1∏NP(zi∣xi,θt)
对于其中一项
∑
z
1
,
z
2
,
.
.
.
,
z
n
log
P
(
x
1
,
z
1
∣
θ
)
∏
i
=
1
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
,
z
2
,
.
.
.
,
z
n
log
P
(
x
1
,
z
1
∣
θ
)
P
(
z
1
∣
x
1
,
θ
t
)
∏
i
=
2
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
log
P
(
x
1
,
z
1
∣
θ
)
P
(
z
1
∣
x
1
,
θ
t
)
∑
z
2
,
.
.
.
,
z
n
∏
i
=
2
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
log
P
(
x
1
,
z
1
∣
θ
)
P
(
z
1
∣
x
1
,
θ
t
)
∑
z
2
P
(
z
2
∣
x
2
,
θ
t
)
∑
z
3
P
(
z
3
∣
x
3
,
θ
t
)
.
.
.
∑
z
n
P
(
z
3
∣
x
n
,
θ
t
)
=
∑
z
1
log
P
(
x
1
,
z
1
∣
θ
)
P
(
z
1
∣
x
1
,
θ
t
)
\begin{aligned} &\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2,...,z_n}\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2}P(z_2|x_2,\theta_t)\sum_{z_3}P(z_3|x_3,\theta_t)...\sum_{z_n}P(z_3|x_n,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t) \end{aligned}
z1,z2,...,zn∑logP(x1,z1∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑logP(x1,z1∣θ)P(z1∣x1,θt)i=2∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)z2,...,zn∑i=2∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)z2∑P(z2∣x2,θt)z3∑P(z3∣x3,θt)...zn∑P(z3∣xn,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)
因此
Q
(
θ
,
θ
t
)
=
∑
z
1
,
z
2
,
.
.
.
,
z
n
(
log
P
(
x
1
,
z
1
∣
θ
)
+
log
P
(
x
2
,
z
2
∣
θ
)
+
.
.
.
+
log
P
(
x
n
,
z
n
∣
θ
)
)
∏
i
=
1
N
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
1
log
P
(
x
1
,
z
1
∣
θ
)
P
(
z
1
∣
x
1
,
θ
t
)
+
.
.
.
+
∑
z
n
log
P
(
x
n
,
z
n
∣
θ
)
P
(
z
n
∣
x
n
,
θ
t
)
=
∑
i
=
1
N
∑
z
i
log
P
(
x
i
,
z
i
∣
θ
)
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
i
=
1
N
∑
z
i
log
(
p
z
i
⋅
N
(
x
i
∣
μ
z
i
,
Σ
z
i
)
)
⋅
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
z
i
∑
i
=
1
N
log
(
p
z
i
⋅
N
(
x
i
∣
μ
z
i
,
Σ
z
i
)
)
⋅
P
(
z
i
∣
x
i
,
θ
t
)
=
∑
k
=
1
K
∑
i
=
1
N
log
(
p
k
⋅
N
(
x
i
∣
μ
k
,
Σ
k
)
)
⋅
P
(
z
i
=
c
k
∣
x
i
,
θ
t
)
=
∑
k
=
1
K
∑
i
=
1
N
(
log
p
k
+
log
N
(
x
i
∣
μ
k
,
Σ
k
)
)
⋅
P
(
z
i
=
c
k
∣
x
i
,
θ
t
)
\begin{aligned} Q(\theta,\theta_t) &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)+...+\sum_{z_n}\log P(x_n,z_n|\theta)P(z_n|x_n,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log P(x_i,z_i|\theta)P(z_i|x_i,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{z_i}\sum_{i=1}^N\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N\log(p_k·N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N(\log p_k+\log N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ \end{aligned}
Q(θ,θt)=z1,z2,...,zn∑(logP(x1,z1∣θ)+logP(x2,z2∣θ)+...+logP(xn,zn∣θ))i=1∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)+...+zn∑logP(xn,zn∣θ)P(zn∣xn,θt)=i=1∑Nzi∑logP(xi,zi∣θ)P(zi∣xi,θt)=i=1∑Nzi∑log(pzi⋅N(xi∣μzi,Σzi))⋅P(zi∣xi,θt)=zi∑i=1∑Nlog(pzi⋅N(xi∣μzi,Σzi))⋅P(zi∣xi,θt)=k=1∑Ki=1∑Nlog(pk⋅N(xi∣μk,Σk))⋅P(zi=ck∣xi,θt)=k=1∑Ki=1∑N(logpk+logN(xi∣μk,Σk))⋅P(zi=ck∣xi,θt)
接下来求解
θ
t
+
1
=
(
p
k
,
μ
k
,
Σ
k
)
\theta_{t+1}=(p_k,\mu_k,\Sigma_k)
θt+1=(pk,μk,Σk)
p
k
t
+
1
=
arg
max
p
k
∑
k
=
1
K
∑
i
=
1
N
log
p
k
⋅
P
(
z
i
=
c
k
∣
x
i
,
θ
t
)
,
s
.
t
.
∑
k
=
1
K
p
k
=
1
μ
k
t
+
1
=
arg
max
μ
k
∑
k
=
1
K
∑
i
=
1
N
log
N
(
x
i
∣
μ
k
,
Σ
k
)
⋅
P
(
z
i
=
c
k
∣
x
i
,
θ
t
)
Σ
k
t
+
1
=
arg
max
Σ
k
∑
k
=
1
K
∑
i
=
1
N
log
N
(
x
i
∣
μ
k
,
Σ
k
)
⋅
P
(
z
i
=
c
k
∣
x
i
,
θ
t
)
p_k^{t+1}=\arg \max_{p_k}\sum_{k=1}^K\sum_{i=1}^N\log p_k·P(z_i=c_k|x_i,\theta_t),\quad s.t. \sum_{k=1}^Kp_k=1 \\ \mu_k^{t+1}=\arg \max_{\mu_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\ \Sigma_k^{t+1}=\arg \max_{\Sigma_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\
pkt+1=argpkmaxk=1∑Ki=1∑Nlogpk⋅P(zi=ck∣xi,θt),s.t.k=1∑Kpk=1μkt+1=argμkmaxk=1∑Ki=1∑NlogN(xi∣μk,Σk)⋅P(zi=ck∣xi,θt)Σkt+1=argΣkmaxk=1∑Ki=1∑NlogN(xi∣μk,Σk)⋅P(zi=ck∣xi,θt)
通过展开之后求导得到最优解,其中
p
k
t
+
1
p_k^{t+1}
pkt+1的求解使用拉格朗日乘子法进行求解。重复以上E-step和M-step直至对数似然值不再有明显变化为止。