Expectation Maximization Algorithm
单一高斯模型拟合数据
当数据的分布情况如下图所示时,可以采用单个的高斯模型来对数据进行拟合:
![](https://i-blog.csdnimg.cn/blog_migrate/105c52dbdfe389120f08855d4cd78b96.png)
似然函数:
L
(
θ
∣
X
ˉ
)
=
log
[
P
(
X
ˉ
∣
θ
)
]
=
∑
i
=
1
k
log
P
(
x
i
∣
θ
)
=
∑
i
=
1
N
log
N
(
x
i
∣
μ
,
σ
)
\begin{aligned}{} \mathcal{L}\,(\theta|\bar{X}) &=\log\,[P(\bar{X}|\theta)] \nonumber\\ &=\sum_{i=1}^k\log\,P(x_i|\theta)\nonumber\\ &=\sum_{i=1}^N\log\,\mathcal{N}(x_i|\mu,\sigma)\nonumber \end{aligned}\nonumber
L(θ∣Xˉ)=log[P(Xˉ∣θ)]=i=1∑klogP(xi∣θ)=i=1∑NlogN(xi∣μ,σ)
求解
arg
max
θ
=
L
(
θ
∣
X
ˉ
)
\arg\underset {\theta}{\max}=\mathcal{L}(\theta|\bar{X})
argθmax=L(θ∣Xˉ)
先求:
μ
M
L
E
=
∂
L
(
μ
,
σ
∣
X
ˉ
)
∂
μ
=
0
\mu_{MLE}=\frac{\partial \mathcal{L}(\mu,\sigma|\bar{X})}{\partial \mu}=0\nonumber
μMLE=∂μ∂L(μ,σ∣Xˉ)=0
再求:
σ
M
L
E
2
=
∂
L
(
μ
M
L
E
,
σ
∣
X
ˉ
)
∂
σ
=
0
\sigma_{MLE}^2=\frac{\partial \mathcal{L}(\mu_{MLE},\sigma|\bar{X})}{\partial \sigma}=0\nonumber
σMLE2=∂σ∂L(μMLE,σ∣Xˉ)=0
解得:
μ
M
L
E
=
1
N
∑
i
=
1
N
x
i
σ
M
L
E
2
=
∑
i
=
1
N
(
x
i
−
μ
M
L
E
)
N
\begin{aligned}\mu_{MLE}&=\frac{1}{N}\sum_{i=1}^Nx_i\\\sigma^2_{MLE}&=\frac{\sum_{i=1}^N(x_i-\mu_{MLE})}{N}\end{aligned}\nonumber
μMLEσMLE2=N1i=1∑Nxi=N∑i=1N(xi−μMLE)
但是在现实情况下,数据可能是这样分布的:
![](https://img-blog.csdnimg.cn/8af7d2fc29b644238d4be1ace58e53bf.png#pic_center=70%x#pic_center)
此时不能用一个高斯模型拟合求解,需要用到两个或多个高斯模型混合求解。
高斯混合模型
高斯混合模型的定义:
p ( X ∣ θ ) = ∑ l = 1 k α l N ( X ∣ μ l , σ l ) s . t . ∑ l = 1 k α l = 1 \begin{aligned} p(X|\theta)=\sum_{l=1}^k\alpha_l\mathcal{N}(X|\mu_l,\sigma_l)\qquad s.t.\quad\sum_{l=1}^k\alpha_l=1\nonumber \end{aligned} p(X∣θ)=l=1∑kαlN(X∣μl,σl)s.t.l=1∑kαl=1
其中,
α
l
\alpha_l
αl为第
l
l
l 个 高斯模型的归一化权重,
N
(
X
∣
μ
l
,
σ
l
)
\mathcal{N}(X|\mu_l,\sigma_l)
N(X∣μl,σl)是高斯分布密度
θ
=
{
θ
1
=
[
α
1
,
μ
1
,
σ
1
]
.
.
.
θ
k
=
[
α
k
,
μ
k
,
σ
k
]
}
\theta=\{\theta_1=[\alpha_1,\mu_1,\sigma_1]\,...\,\theta_k=[\alpha_k,\mu_k,\sigma_k]\}\nonumber
θ={θ1=[α1,μ1,σ1]...θk=[αk,μk,σk]}
接着,对高斯混合模型中的
θ
\theta
θ 参数进行极大似然估计:
θ M L E = arg max θ L ( θ ∣ X ) = arg max θ ( ∑ i = 1 n log ∑ l = 1 k α l N ( X ∣ μ l , σ l ) ) \begin{aligned} \theta_{MLE} &=\arg\underset\theta{\max}\,\cal{L}(\theta|X)\\ &=\arg\underset\theta{\max}\,(\sum_{i=1}^n\log\sum_{l=1}^k\alpha_l\mathcal{N}(X|\mu_l,\sigma_l)) \end{aligned}\nonumber θMLE=argθmaxL(θ∣X)=argθmax(i=1∑nlogl=1∑kαlN(X∣μl,σl))
此时,对 μ 1 , μ 2 . . . μ k \mu_1,\mu_2\ ...\ \mu_k μ1,μ2 ... μk、 σ 1 , σ 2 . . . σ k \sigma_1,\sigma_2\ ...\ \sigma_k σ1,σ2 ... σk都求偏导,然后再求解很困难,所以需要借助EM算法根据迭代的方式来求解。
EM算法迭代公式
定义EM算法的参数更新公式为:
θ ( g + 1 ) = arg max θ ∫ Z log [ p ( X , Z ∣ θ ) ] ⋅ p ( Z ∣ X , θ ( g ) ) d Z (1) \theta^{(g+1)}=\arg\underset\theta{\max}\int _Z\,\log\,[p(X,Z|\theta)]·p(Z|X,\theta^{(g)})\,\mathrm{d}Z\tag1 θ(g+1)=argθmax∫Zlog[p(X,Z∣θ)]⋅p(Z∣X,θ(g))dZ(1)
其中, Z Z Z为隐变量(辅助变量)。
引入隐变量的前提
1、简化求解;
2、不改变数据的边缘分布(在此,对第2条进行证明)
证明满足边缘分布也就是要证明加入隐变量 Z Z Z 后式 ( 2 ) (2) (2)成立:
p
(
x
i
)
=
∫
z
i
p
θ
(
x
i
∣
z
i
)
⋅
p
θ
(
z
i
)
d
z
i
z
i
∈
1
,
2
,
.
.
.
k
(2)
p(x_i)=\int _{z_i}p_\theta(x_i|z_i)·p_\theta(z_i)\mathrm{d}z_i\quad z_i\in{1,2,\ ...\ k}\tag2
p(xi)=∫zipθ(xi∣zi)⋅pθ(zi)dzizi∈1,2, ... k(2)
该如何理解高斯混合模型中的隐变量
Z
Z
Z 呢?
其实, Z Z Z 就代表着数据属于哪儿一个高斯分布,如下图所示:
![](https://img-blog.csdnimg.cn/ed2d43ee740643a2bc4535012e1c2933.png#pic_center=50%x.png#pic_center)
此时,高斯混合模型在一定程度上已经退化为了单个高斯模型,这样也就大大简化了求解。
那在没有观测到数据之前,
z
i
z_i
zi 到底属于
θ
1
\theta_1
θ1还是属于
θ
2
\theta_2
θ2 呢?即:
p
(
z
i
)
=
?
p(z_i)=?
p(zi)=?
其实 p ( z i ) p(z_i) p(zi)这个概率就是从高斯混合模型中的权重参数得来的,即:
p
(
z
i
)
=
α
z
i
p(z_i)=\alpha_{z_i}\nonumber
p(zi)=αzi
此时,
p
θ
(
x
i
∣
z
i
)
=
N
(
x
i
∣
μ
z
i
,
σ
z
i
)
p_\theta(x_i|z_i)=\mathcal{N}(x_i|\mu_{z_i},\sigma_{z_i})\nonumber
pθ(xi∣zi)=N(xi∣μzi,σzi)
将
p
(
z
i
)
p(z_i)
p(zi)、
p
θ
(
x
i
∣
z
i
)
p_\theta(x_i|z_i)
pθ(xi∣zi)代入到上式中,得:
p
(
x
i
)
=
∫
z
i
p
θ
(
x
i
∣
z
i
)
⋅
p
θ
(
z
i
)
d
z
i
=
∑
z
i
k
α
z
i
N
(
x
i
∣
μ
z
i
,
α
z
i
)
\begin{aligned}p(x_i)&=\int _{z_i}p_\theta(x_i|z_i)·p_\theta(z_i)\mathrm{d}z_i\\&=\sum_{z_i}^k\alpha_{z_i}\mathcal{N}(x_i|\mu_{z_i},\alpha_{z_i})\end{aligned}\nonumber
p(xi)=∫zipθ(xi∣zi)⋅pθ(zi)dzi=zi∑kαziN(xi∣μzi,αzi)
所以不会改变
x
i
x_i
xi 的边缘分布,条件2得证!
收敛性证明(局部收敛)
如果EM算法收敛,则随着迭代更新,似然函数始终都在增加,即对于任意更新步骤,都有当前更新的似然比上一次更新的似然要大。则需要证明:
log P ( X ∣ θ ( g + 1 ) ) = L ( θ ( g + 1 ) ) ⩾ L ( θ ( g ) ) = log P ( X ∣ θ ( g ) ) (3) \log\,P(X|\theta^{(g+1)})=\mathcal{L}(\theta^{(g+1)})\geqslant\mathcal{L}(\theta^{(g)})=\log\,P(X|\theta^{(g)}) \tag3 logP(X∣θ(g+1))=L(θ(g+1))⩾L(θ(g))=logP(X∣θ(g))(3)
证明:
由联合概率公式
P ( X ) = P ( X , Z ) P ( Z ∣ X ) P(X)=\frac{P(X,Z)}{P(Z|X)}\nonumber P(X)=P(Z∣X)P(X,Z)
得
log P ( X ∣ θ ) = log P ( X , Z ∣ θ ) − log P ( Z ∣ X , θ ) (4) \log\,P(X|\theta)=\log\,P(X,Z|\theta)-\log\,P(Z|X,\theta)\tag4 logP(X∣θ)=logP(X,Z∣θ)−logP(Z∣X,θ)(4)
对等式两边同时求期望得:
E p ( z ∣ x , θ ( g ) ) [ log P ( X ∣ θ ) ] = E p ( z ∣ x , θ ( g ) ) [ log P ( X , Z ∣ θ ) − log P ( Z ∣ X , θ ) ] \underset{p(z|x,\theta^{(g)})}{E}[\log\,P(X|\theta)] =\underset{p(z|x,\theta^{(g)})}{E}[\log\,P(X,Z|\theta)-\log\,P(Z|X,\theta)]\nonumber p(z∣x,θ(g))E[logP(X∣θ)]=p(z∣x,θ(g))E[logP(X,Z∣θ)−logP(Z∣X,θ)]
等式左边:
E p ( z ∣ x , θ ( g ) ) [ log P ( X ∣ θ ] = ∫ Z log [ P ( X ∣ θ ) ] ⋅ P ( Z ∣ X , θ ( g ) ) d Z = log P ( X ∣ θ ) \begin{aligned} \underset{p(z|x,\theta^{(g)})}{E}[\log\,P(X|\theta] &=\int_Z\,\log\,[P(X|\theta)]·P(Z|X,\theta^{(g)})\,\,\rm{dZ}\\ &=\log\,P(X|\theta) \end{aligned}\nonumber p(z∣x,θ(g))E[logP(X∣θ]=∫Zlog[P(X∣θ)]⋅P(Z∣X,θ(g))dZ=logP(X∣θ)
等式右边:
E p ( z ∣ x , θ ( g ) ) [ log P ( X , Z ∣ θ ) − log P ( Z ∣ X , θ ) ] = ∫ Z log [ P ( X , Z ∣ θ ) ] ⋅ P ( X , Z ∣ θ ( g ) ) d Z − ∫ Z log [ P ( Z ∣ X , θ ) ] ⋅ P ( Z ∣ X , θ ( g ) ) d Z \underset{p(z|x,\theta^{(g)})}{E}[\log\,P(X,Z|\theta)-\log\,P(Z|X,\theta)]\\ =\int_Z\,\log\,[P(X,Z|\theta)]·P(X,Z|\theta^{(g)})\,\rm{dZ}-\int_Z\,\log\,[P(Z|X,\theta)]·P(Z|X,\theta^{(g)})\,\rm{dZ}\nonumber p(z∣x,θ(g))E[logP(X,Z∣θ)−logP(Z∣X,θ)]=∫Zlog[P(X,Z∣θ)]⋅P(X,Z∣θ(g))dZ−∫Zlog[P(Z∣X,θ)]⋅P(Z∣X,θ(g))dZ
令:
Q ( θ , θ ( g ) ) = ∫ Z log [ P ( X , Z ∣ θ ] ⋅ P ( X , Z ∣ θ ( g ) ) d Z H ( θ , θ ( g ) ) = ∫ Z log [ P ( Z ∣ X , θ ] ⋅ P ( Z ∣ X , θ ( g ) ) d Z \begin{aligned} Q(\theta,\theta^{(g)})&=\int_Z\log\,[P(X,Z|\theta]·P(X,Z|\theta^{(g)})\,\rm{dZ}\\ H(\theta,\theta^{(g)})&=\int_Z\log\,[P(Z|X,\theta]·P(Z|X,\theta^{(g)})\,\rm{dZ} \end{aligned}\nonumber Q(θ,θ(g))H(θ,θ(g))=∫Zlog[P(X,Z∣θ]⋅P(X,Z∣θ(g))dZ=∫Zlog[P(Z∣X,θ]⋅P(Z∣X,θ(g))dZ
则公式
(
4
)
(4)
(4)可以写成
log
P
(
X
∣
θ
(
g
)
)
=
Q
(
θ
,
θ
(
g
)
)
−
H
(
θ
,
θ
(
g
)
)
(4)
\log\,P(X|\theta^{(g)})=Q(\theta,\theta^{(g)})-H(\theta,\theta^{(g)})\tag{4}
logP(X∣θ(g))=Q(θ,θ(g))−H(θ,θ(g))(4)
其中
θ
(
g
)
\,\theta^{(g)}
θ(g)为常数,
θ
\,\theta\,
θ为变量。并且,不难发现
Q
(
θ
,
θ
(
g
)
)
Q(\theta,\theta^{(g)})
Q(θ,θ(g))为EM算法中的迭代公式。
分别取
θ
\,\theta\,
θ为
θ
(
g
)
\,\theta^{(g)}\,
θ(g)和
θ
(
g
+
1
)
\,\theta^{(g+1)}\,
θ(g+1)并相减,有
log
P
(
X
∣
θ
(
g
+
1
)
)
−
log
P
(
X
∣
θ
(
g
)
)
=
[
Q
(
θ
(
g
+
1
)
,
θ
(
g
)
)
−
Q
(
θ
(
g
)
,
θ
(
g
)
]
−
[
H
(
θ
(
g
+
1
)
,
θ
(
g
)
)
−
H
(
θ
(
g
)
,
θ
(
g
)
]
(5)
\log\,P(X|\theta^{(g+1)})-\log\,P(X|\theta^{(g)})\\ =[Q(\theta^{(g+1)},\theta^{(g)})-Q(\theta^{(g)},\theta^{(g)}] -[H(\theta^{(g+1)},\theta^{(g)})-H(\theta^{(g)},\theta^{(g)}] \tag5
logP(X∣θ(g+1))−logP(X∣θ(g))=[Q(θ(g+1),θ(g))−Q(θ(g),θ(g)]−[H(θ(g+1),θ(g))−H(θ(g),θ(g)](5)
因为当用EM算法求解
θ
(
g
)
\theta^{(g)}
θ(g) 迭代到
θ
(
g
+
1
)
\theta^{(g+1)}
θ(g+1) 时,有:
Q ( θ ( g + 1 ) , θ ( g ) ) − Q ( θ ( g ) , θ ( g ) ≥ 0 ) Q(\theta^{(g+1)},\theta^{(g)})- Q(\theta^{(g)},\theta^{(g)}\ge 0)\nonumber Q(θ(g+1),θ(g))−Q(θ(g),θ(g)≥0)
所以式 ( 5 ) (5) (5)右端第一项大于等于0。
因为,
f
(
x
)
=
log
x
f(x)=\log\ x\,
f(x)=log x为凸函数,所以,由Jensen’s不等式得:
H
(
θ
(
g
+
1
)
,
θ
(
g
)
)
−
H
(
θ
(
g
)
,
θ
(
g
)
)
=
∫
Z
log
[
P
(
Z
∣
X
,
θ
(
g
+
1
)
)
]
⋅
P
(
Z
∣
X
,
θ
(
g
)
)
−
log
[
P
(
Z
∣
X
,
θ
(
g
)
)
]
⋅
P
(
Z
∣
X
,
θ
(
g
)
)
d
Z
=
∫
Z
log
[
P
(
Z
∣
X
,
θ
(
g
+
1
)
)
P
(
Z
∣
Z
,
θ
(
g
)
)
]
⋅
P
(
Z
∣
X
,
θ
(
g
)
)
d
Z
≤
log
[
∫
Z
P
(
Z
∣
X
,
θ
(
g
+
1
)
)
P
(
Z
∣
X
,
θ
(
g
)
)
⋅
P
(
Z
∣
X
,
θ
(
g
)
)
d
Z
]
=
log
[
∫
Z
P
(
Z
∣
X
,
θ
(
g
+
1
)
)
d
Z
]
=
log
1
=
0
(6)
\begin{aligned} H(\theta^{(g+1)},\theta^{(g)})- H(\theta^{(g)},\theta^{(g)}) &=\int_Z\,\log\,[P(Z|X,\theta^{(g+1)})]· P(Z|X,\theta^{(g)})-\log\,[P(Z|X,\theta^{(g)})]·P(Z|X,\theta^{(g)})\,\rm{dZ}\\ &=\int_Z\,\log[\frac{P(Z|X,\theta^{(g+1)})}{P(Z|Z,\theta^{(g)})}]·P(Z|X,\theta^{(g)})\,\rm{dZ}\\ &\le\log\,[\int_Z\,\frac{P(Z|X,\theta^{(g+1)})}{P(Z|X,\theta^{(g)})}·P(Z|X,\theta^{(g)})\,\rm{dZ}]\\ &=\log\,[\int_Z\,P(Z|X,\theta^{(g+1)})\,\rm{dZ}]\\ &=\log\,1\\ &=0 \end{aligned}\tag6
H(θ(g+1),θ(g))−H(θ(g),θ(g))=∫Zlog[P(Z∣X,θ(g+1))]⋅P(Z∣X,θ(g))−log[P(Z∣X,θ(g))]⋅P(Z∣X,θ(g))dZ=∫Zlog[P(Z∣Z,θ(g))P(Z∣X,θ(g+1))]⋅P(Z∣X,θ(g))dZ≤log[∫ZP(Z∣X,θ(g))P(Z∣X,θ(g+1))⋅P(Z∣X,θ(g))dZ]=log[∫ZP(Z∣X,θ(g+1))dZ]=log1=0(6)
由式
(
5
)
(5)
(5)和式
(
6
)
(6)
(6)得式
(
7
)
(7)
(7)
log
P
(
X
∣
θ
(
g
+
1
)
)
−
log
P
(
X
∣
θ
(
g
)
)
≥
0
(7)
\log\,P(X|\theta^{(g+1)})-\log\,P(X|\theta^{(g)})\ge0 \tag{7}
logP(X∣θ(g+1))−logP(X∣θ(g))≥0(7)
即得式
(
3
)
(3)
(3)成立!
所以EM算法收敛性得证!
EM算法在高斯混合模型中的应用
高斯混合模型:
P ( X ∣ θ ) = ∑ l = 1 k α l N ( X ∣ μ l , σ l ) s . t . ∑ l = 1 k α l = 1 \begin{aligned} P(X|\theta)=\sum_{l=1}^k\alpha_l\mathcal{N}(X|\mu_l,\sigma_l)\qquad s.t.\quad\sum_{l=1}^k\alpha_l=1\nonumber \end{aligned} P(X∣θ)=l=1∑kαlN(X∣μl,σl)s.t.l=1∑kαl=1
对于数据集 X = { x 1 , x 2 , . . . , x n } X=\{x_1,x_2,...,x_n\} X={x1,x2,...,xn}引入隐变量 Z = { z 1 , z 2 , . . . , z n } Z=\{z_1,z_2,...,z_n\} Z={z1,z2,...,zn},每个 z i z_i zi表示数据 x i x_i xi属于第几个高斯分布
EM算法更新过程:
θ
(
g
+
1
)
=
arg
max
θ
∫
Z
log
[
p
(
X
,
Z
∣
θ
)
]
⋅
p
(
Z
∣
X
,
θ
(
g
)
)
d
Z
=
arg
max
θ
Q
(
θ
,
θ
(
g
)
)
(1)
\begin{aligned} \theta^{(g+1)}&=\arg\underset\theta{\max}\int _Z\,\log\,[p(X,Z|\theta)]·p(Z|X,\theta^{(g)})\,\mathrm{d}Z\\ &=\arg\underset\theta{\max}\,Q(\theta,\theta^{(g)}) \end{aligned} \tag1
θ(g+1)=argθmax∫Zlog[p(X,Z∣θ)]⋅p(Z∣X,θ(g))dZ=argθmaxQ(θ,θ(g))(1)
E过程,即计算 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))
计算 p ( X , Z ∣ θ ) p(X,Z|\theta) p(X,Z∣θ)和 p ( Z ∣ X , θ ) p(Z|X,\theta) p(Z∣X,θ)
计算
p
(
X
,
Z
∣
θ
)
p(X,Z|\theta)
p(X,Z∣θ):
p
(
X
,
Z
∣
θ
)
=
∏
i
=
1
n
p
(
x
i
,
z
i
∣
θ
)
=
∏
i
=
1
n
p
(
x
i
∣
z
i
,
θ
)
p
(
z
i
∣
θ
)
\begin{aligned} p(X,Z|\theta) &=\prod_{i=1}^np(x_i,z_i|\theta)\\ &=\prod_{i=1}^np(x_i|z_i,\theta)p(z_i|\theta) \end{aligned}\nonumber
p(X,Z∣θ)=i=1∏np(xi,zi∣θ)=i=1∏np(xi∣zi,θ)p(zi∣θ)
p
(
z
i
∣
θ
)
p(z_i|\theta)
p(zi∣θ)表示在没有任何数据
X
X
X的情况下,是第
z
i
z_i
zi个高斯分布的概率,即为高斯混合分布的混合系数
α
z
i
\alpha_{z_i}
αzi,
p
(
x
i
∣
z
i
,
θ
)
p(x_i|z_i,\theta)
p(xi∣zi,θ)表示数据
x
i
x_i
xi在第
z
i
z_i
zi个高斯分布中的概率,即为
N
(
μ
z
i
,
Σ
z
i
)
\mathcal{N}(\mu_{z_i},\Sigma_{z_i})
N(μzi,Σzi),所以:
p
(
X
,
Z
∣
θ
)
=
∏
i
=
1
n
α
z
i
N
(
μ
z
i
,
σ
z
i
)
(8)
p(X,Z|\theta)=\prod_{i=1}^n\alpha_{z_i}\mathcal{N}(\mu_{z_i},\sigma_{z_i})\tag8
p(X,Z∣θ)=i=1∏nαziN(μzi,σzi)(8)
计算
p
(
Z
∣
X
,
θ
)
p(Z|X,\theta)
p(Z∣X,θ):
p
(
Z
∣
X
,
θ
)
=
∏
i
=
1
n
p
(
z
i
∣
x
i
,
θ
)
\begin{aligned} p(Z|X,\theta) &=\prod_{i=1}^np(z_i|x_i,\theta)\\ \end{aligned}\nonumber
p(Z∣X,θ)=i=1∏np(zi∣xi,θ)
其中,
p
(
z
i
∣
x
i
,
θ
)
p(z_i|x_i,\theta)
p(zi∣xi,θ)的直观解释见下图:
![](https://img-blog.csdnimg.cn/152aebfe34da48db8cff55625b56a763.png#pic_center=50%x#pic_center=50%x#pic_center)
对于红色数据来说,其
p
(
z
i
∣
x
i
,
θ
)
p(z_i|x_i,\theta)
p(zi∣xi,θ)为:
p
(
z
i
=
θ
1
∣
x
i
,
θ
)
=
a
a
+
b
p
(
z
i
=
θ
2
∣
x
i
,
θ
)
=
b
a
+
b
p(z_i=\theta_1|x_i,\theta)=\dfrac a {a+b}\\ p(z_i=\theta_2|x_i,\theta)=\dfrac b {a+b}\nonumber
p(zi=θ1∣xi,θ)=a+bap(zi=θ2∣xi,θ)=a+bb
所以有
p
(
z
i
∣
x
i
,
θ
)
=
p
(
x
i
,
z
i
∣
θ
)
p
(
x
i
∣
θ
)
=
α
z
i
N
(
μ
z
i
,
σ
z
i
)
∑
l
=
1
k
α
l
N
(
μ
l
,
σ
l
)
\begin{aligned} p(z_i|x_i,\theta) &=\frac{p(x_i,z_i|\theta)}{p(x_i|\theta)}\\ &=\frac{\alpha_{z_i}\mathcal{N}(\mu_{z_i},\sigma_{z_i})}{\sum_{l=1}^k\alpha_l\mathcal{N}(\mu_l,\sigma_l)} \end{aligned}\nonumber
p(zi∣xi,θ)=p(xi∣θ)p(xi,zi∣θ)=∑l=1kαlN(μl,σl)αziN(μzi,σzi)
所以有
p
(
Z
∣
X
,
θ
)
=
∏
i
=
1
n
p
(
z
i
∣
x
i
,
θ
)
=
∏
i
=
1
n
p
(
x
i
,
z
i
∣
θ
)
p
(
x
i
∣
θ
)
=
∏
i
=
1
n
α
z
i
N
(
μ
z
i
,
σ
z
i
)
∑
l
=
1
k
α
l
N
(
μ
l
,
σ
l
)
(9)
\begin{aligned} p(Z|X,\theta) &=\prod_{i=1}^np(z_i|x_i,\theta)\\ &=\prod_{i=1}^n\frac{p(x_i,z_i|\theta)}{p(x_i|\theta)}\\ &=\prod_{i=1}^n\frac{\alpha_{z_i}\mathcal{N}(\mu_{z_i},\sigma_{z_i})}{\sum_{l=1}^k\alpha_l\mathcal{N}(\mu_l,\sigma_l)}\tag9 \end{aligned}
p(Z∣X,θ)=i=1∏np(zi∣xi,θ)=i=1∏np(xi∣θ)p(xi,zi∣θ)=i=1∏n∑l=1kαlN(μl,σl)αziN(μzi,σzi)(9)
将式
(
8
)
(8)
(8)和
(
9
)
(9)
(9)代入得:
Q
(
θ
,
θ
(
g
)
)
=
∫
Z
ln
[
p
(
X
,
Z
∣
θ
)
]
⋅
p
(
Z
∣
X
,
θ
(
g
)
)
d
Z
=
∫
z
1
.
.
.
∫
z
n
(
∑
i
=
1
n
ln
p
(
z
i
,
x
i
∣
θ
)
∏
i
=
1
n
p
(
z
i
∣
x
i
,
θ
(
g
)
)
)
d
z
1
.
.
.
d
z
n
\begin{aligned} Q(\theta,\theta^{(g)}) &=\int_Z\ln[p(X,Z|\theta)]·p(Z|X,\theta^{(g)})\mathrm{d}Z\\ &=\int_{z_1}...\int_{z_n}\bigg(\sum_{i=1}^n\ln p(z_i,x_i|\theta)\prod_{i=1}^np(z_i|x_i,\theta^{(g)})\bigg)\mathrm{d}z_1...\mathrm{d}z_n \end{aligned}\nonumber
Q(θ,θ(g))=∫Zln[p(X,Z∣θ)]⋅p(Z∣X,θ(g))dZ=∫z1...∫zn(i=1∑nlnp(zi,xi∣θ)i=1∏np(zi∣xi,θ(g)))dz1...dzn
简化 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))所用公式推导
因为有如下公式:
∫
y
1
.
.
.
∫
y
n
(
∑
i
=
1
n
f
i
(
y
i
)
)
P
(
Y
)
d
Y
=
∑
i
=
1
n
(
∫
y
i
f
i
(
y
i
)
P
i
(
y
i
)
d
y
i
)
\int_{y_1}...\int_{y_n}\bigg(\sum_{i=1}^nf_i(y_i)\bigg)P(Y)\mathrm{d}Y=\sum_{i=1}^n\bigg(\int_{y_i}f_i(y_i)P_i(y_i)\mathrm{d}y_i\bigg) \nonumber
∫y1...∫yn(i=1∑nfi(yi))P(Y)dY=i=1∑n(∫yifi(yi)Pi(yi)dyi)
其中 P ( Y ) P(Y) P(Y)是 y 1 , . . . , y n y_1,...,y_n y1,...,yn的联合概率分布 P ( y 1 , . . . , y n ) P(y_1,...,y_n) P(y1,...,yn),
该公式推导过程:
令
F
(
Y
)
=
f
1
(
x
1
)
+
.
.
.
+
f
n
(
x
n
)
=
∑
i
=
1
n
f
i
(
y
i
)
F(Y)=f_1(x_1)+...+f_n(x_n)=\sum_{i=1}^nf_i(y_i)
F(Y)=f1(x1)+...+fn(xn)=∑i=1nfi(yi):
∫
Y
(
F
(
Y
)
)
P
(
Y
)
d
Y
=
∫
y
1
.
.
.
∫
y
n
(
∑
i
=
1
n
f
i
(
y
i
)
)
P
(
Y
)
d
y
1
.
.
.
d
y
n
\int_Y(F(Y))P(Y)\mathrm{d}Y=\int_{y_1}...\int_{y_n}\bigg(\sum_{i=1}^nf_i(y_i)\bigg)P(Y)\mathrm{d}y_1...\mathrm{d}y_n \nonumber
∫Y(F(Y))P(Y)dY=∫y1...∫yn(i=1∑nfi(yi))P(Y)dy1...dyn
将上式中
∑
i
=
1
n
f
i
(
y
i
)
\sum_{i=1}^nf_i(y_i)
∑i=1nfi(yi)展开,则为:
∫
y
1
.
.
.
∫
y
n
(
∑
i
=
1
n
f
i
(
y
i
)
)
P
(
Y
)
d
y
1
.
.
.
d
y
n
=
∫
y
1
.
.
.
∫
y
n
[
f
1
(
y
1
)
+
f
2
(
y
2
)
+
.
.
.
+
f
n
(
y
n
)
]
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
=
∫
y
1
.
.
.
∫
y
n
f
1
(
y
1
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
+
∫
y
1
.
.
.
∫
y
n
f
2
(
y
2
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
+
.
.
.
+
∫
y
1
.
.
.
∫
y
n
f
n
(
y
n
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
\begin{aligned} \int_{y_1}...\int_{y_n}&\bigg(\sum_{i=1}^nf_i(y_i)\bigg)P(Y)\mathrm{d}y_1...\mathrm{d}y_n\\ &=\int_{y_1}...\int_{y_n}[f_1(y_1)+f_2(y_2)+...+f_n(y_n)]P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\ &=\int_{y_1}...\int_{y_n}f_1(y_1)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\ &\quad+\int_{y_1}...\int_{y_n}f_2(y_2)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\&\quad+...+\int_{y_1}...\int_{y_n}f_n(y_n)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n \end{aligned}\nonumber
∫y1...∫yn(i=1∑nfi(yi))P(Y)dy1...dyn=∫y1...∫yn[f1(y1)+f2(y2)+...+fn(yn)]P(y1,...,yn)dy1...dyn=∫y1...∫ynf1(y1)P(y1,...,yn)dy1...dyn+∫y1...∫ynf2(y2)P(y1,...,yn)dy1...dyn+...+∫y1...∫ynfn(yn)P(y1,...,yn)dy1...dyn
先重点关注第一项:
∫
y
1
.
.
.
∫
y
n
f
1
(
y
1
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
\int_{y_1}...\int_{y_n}f_1(y_1)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n \nonumber
∫y1...∫ynf1(y1)P(y1,...,yn)dy1...dyn
因为
f
1
(
y
1
)
f_1(y_1)
f1(y1)与
y
2
,
.
.
.
,
y
n
y_2,...,y_n
y2,...,yn均无关,对于
y
2
,
.
.
.
,
y
n
y_2,...,y_n
y2,...,yn来说
f
1
(
y
1
)
f_1(y_1)
f1(y1)可看作是常数,所以
f
1
(
y
1
)
f_1(y_1)
f1(y1)可以移到与之无关变量的积分号的外面,即
∫
y
1
.
.
.
∫
y
n
f
1
(
y
1
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
=
∫
y
1
f
1
(
y
1
)
(
∫
y
2
.
.
.
∫
y
n
P
(
y
1
,
.
.
.
,
y
n
)
d
y
2
.
.
.
d
y
n
)
d
y
1
\int_{y_1}...\int_{y_n}f_1(y_1)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\=\int_{y_1}f_1(y_1)\bigg(\int_{y_2}...\int_{y_n}P(y_1,...,y_n)\mathrm{d}y_2...\mathrm{d}y_n\bigg)\mathrm{d}y_1 \nonumber
∫y1...∫ynf1(y1)P(y1,...,yn)dy1...dyn=∫y1f1(y1)(∫y2...∫ynP(y1,...,yn)dy2...dyn)dy1
根据边缘概率公式:
P
(
x
)
=
∫
y
P
(
x
,
y
)
d
y
P(x)=\int_{y}P(x,y)\mathrm{d}y \nonumber
P(x)=∫yP(x,y)dy
因此有:
P
(
y
1
,
y
2
,
.
.
.
,
y
n
−
1
)
=
∫
y
n
P
(
y
1
,
y
2
,
.
.
.
,
y
n
−
1
,
y
n
)
d
y
n
P
(
y
1
,
y
2
,
.
.
.
,
y
n
−
2
)
=
∫
y
n
−
1
P
(
y
1
,
y
2
,
.
.
.
,
y
n
−
2
,
y
n
−
1
)
d
y
n
−
1
⋮
P
(
y
1
)
=
∫
y
2
P
(
y
1
,
y
2
)
d
y
2
\begin{aligned} P(y_1,y_2,...,y_{n-1})&=\int_{y_n}P(y_1,y_2,...,y_{n-1},y_n)\mathrm{d}y_n\\ P(y_1,y_2,...,y_{n-2})&=\int_{y_{n-1}}P(y_1,y_2,...,y_{n-2},y_{n-1})\mathrm{d}y_{n-1}\\&\ \ \vdots\\P(y_1)&=\int_{y_2}P(y_1,y_2)\mathrm{d}y_2 \end{aligned} \nonumber
P(y1,y2,...,yn−1)P(y1,y2,...,yn−2)P(y1)=∫ynP(y1,y2,...,yn−1,yn)dyn=∫yn−1P(y1,y2,...,yn−2,yn−1)dyn−1 ⋮=∫y2P(y1,y2)dy2
因此套用一次边缘概率公式,可以去除掉一层积分,所以公式第一项最后变为:
∫
y
1
f
1
(
y
1
)
(
∫
y
2
.
.
.
∫
y
n
P
(
y
1
,
.
.
.
,
y
n
)
d
y
2
.
.
.
d
y
n
)
d
y
1
=
∫
y
1
f
1
(
y
1
)
P
(
y
1
)
d
y
1
\int_{y_1}f_1(y_1)\bigg(\int_{y_2}...\int_{y_n}P(y_1,...,y_n)\mathrm{d}y_2...\mathrm{d}y_n\bigg)\mathrm{d}y_1=\int_{y_1}f_1(y_1)P(y_1)\mathrm{d}y_1 \nonumber
∫y1f1(y1)(∫y2...∫ynP(y1,...,yn)dy2...dyn)dy1=∫y1f1(y1)P(y1)dy1
整个公式即为:
∫
y
1
.
.
.
∫
y
n
(
∑
i
=
1
n
f
i
(
y
i
)
)
P
(
Y
)
d
y
1
.
.
.
d
y
n
=
∫
y
1
.
.
.
∫
y
n
f
1
(
y
1
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
+
∫
y
1
.
.
.
∫
y
n
f
2
(
y
2
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
+
.
.
.
+
∫
y
1
.
.
.
∫
y
n
f
n
(
y
n
)
P
(
y
1
,
.
.
.
,
y
n
)
d
y
1
.
.
.
d
y
n
=
∫
y
1
f
1
(
y
1
)
P
(
y
1
)
d
y
1
+
∫
y
2
f
2
(
y
2
)
P
(
y
2
)
d
y
2
+
.
.
.
+
∫
y
n
f
n
(
y
n
)
P
(
y
n
)
d
y
n
=
∑
i
=
1
n
(
∫
y
i
f
i
(
y
i
)
P
(
y
i
)
d
y
i
)
\begin{aligned} \int_{y_1}...\int_{y_n}&\bigg(\sum_{i=1}^nf_i(y_i)\bigg)P(Y)\mathrm{d}y_1...\mathrm{d}y_n\\ &=\int_{y_1}...\int_{y_n}f_1(y_1)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\&\quad +\int_{y_1}...\int_{y_n}f_2(y_2)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\&\quad +...+\int_{y_1}...\int_{y_n}f_n(y_n)P(y_1,...,y_n)\mathrm{d}y_1...\mathrm{d}y_n\\ &=\int_{y_1}f_1(y_1)P(y_1)\mathrm{d}y_1\\&\quad +\int_{y_2}f_2(y_2)P(y_2)\mathrm{d}y_2\\&\quad +...+\int_{y_n}f_n(y_n)P(y_n)\mathrm{d}y_n\\ &=\sum_{i=1}^n\bigg(\int_{y_i}f_i(y_i)P(y_i)\mathrm{d}y_i\bigg) \end{aligned} \nonumber
∫y1...∫yn(i=1∑nfi(yi))P(Y)dy1...dyn=∫y1...∫ynf1(y1)P(y1,...,yn)dy1...dyn+∫y1...∫ynf2(y2)P(y1,...,yn)dy1...dyn+...+∫y1...∫ynfn(yn)P(y1,...,yn)dy1...dyn=∫y1f1(y1)P(y1)dy1+∫y2f2(y2)P(y2)dy2+...+∫ynfn(yn)P(yn)dyn=i=1∑n(∫yifi(yi)P(yi)dyi)
简化 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))
因此,把
f
i
(
y
i
)
f_i(y_i)
fi(yi)看作
log
p
(
z
i
,
x
i
∣
θ
)
\log p(z_i,x_i|\theta)
logp(zi,xi∣θ),把
P
i
(
y
i
)
P_i(y_i)
Pi(yi)看作
p
(
z
i
∣
x
i
,
θ
(
g
)
)
p(z_i|x_i,\theta^{(g)})
p(zi∣xi,θ(g)),可得:
Q
(
θ
,
θ
(
g
)
)
=
∫
z
1
.
.
.
∫
z
n
(
∑
i
=
1
n
ln
p
(
z
i
,
x
i
∣
θ
)
∏
i
=
1
n
p
(
z
i
∣
x
i
,
θ
(
g
)
)
)
d
z
1
.
.
.
d
z
n
=
∑
i
=
1
n
(
∫
z
i
ln
p
(
z
i
,
x
i
∣
θ
)
p
(
z
i
∣
x
i
,
θ
(
g
)
)
d
z
i
)
\begin{aligned} Q(\theta,\theta^{(g)}) &=\int_{z_1}...\int_{z_n}\bigg(\sum_{i=1}^n\ln p(z_i,x_i|\theta)\prod_{i=1}^np(z_i|x_i,\theta^{(g)})\bigg)\mathrm{d}z_1...\mathrm{d}z_n\\ &=\sum_{i=1}^n\bigg(\int_{z_i}\ln p(z_i,x_i|\theta)p(z_i|x_i,\theta^{(g)})\mathrm{d}z_i\bigg) \end{aligned}\nonumber
Q(θ,θ(g))=∫z1...∫zn(i=1∑nlnp(zi,xi∣θ)i=1∏np(zi∣xi,θ(g)))dz1...dzn=i=1∑n(∫zilnp(zi,xi∣θ)p(zi∣xi,θ(g))dzi)
因为是
z
i
z_i
zi离散随机变量,
z
i
∈
{
1
,
.
.
.
,
k
}
z_i\in\{1,...,k\}
zi∈{1,...,k},所以积分符号应写为累加符号,可得:
Q
(
θ
,
θ
(
g
)
)
=
∑
i
=
1
n
(
∫
z
i
ln
p
(
z
i
,
x
i
∣
θ
)
p
(
z
i
∣
x
i
,
θ
(
g
)
)
d
z
i
)
=
∑
i
=
1
n
(
∑
z
i
=
1
k
ln
p
(
z
i
,
x
i
∣
Θ
)
p
(
z
i
∣
x
i
,
θ
(
g
)
)
)
\begin{aligned} Q(\theta,\theta^{(g)}) &=\sum_{i=1}^n\bigg(\int_{z_i}\ln p(z_i,x_i|\theta)p(z_i|x_i,\theta^{(g)})\mathrm{d}z_i\bigg)\\&=\sum_{i=1}^n\bigg(\sum_{z_i=1}^k\ln p(z_i,x_i|\Theta)p(z_i|x_i,\theta^{(g)})\bigg) \end{aligned}\nonumber
Q(θ,θ(g))=i=1∑n(∫zilnp(zi,xi∣θ)p(zi∣xi,θ(g))dzi)=i=1∑n(zi=1∑klnp(zi,xi∣Θ)p(zi∣xi,θ(g)))
用
l
l
l替换
z
i
z_i
zi,最终可得:
Q
(
θ
,
θ
(
g
)
)
=
∑
l
=
1
k
∑
i
=
1
n
ln
p
(
l
,
x
i
∣
θ
)
p
(
l
∣
x
i
,
θ
(
g
)
)
=
∑
l
=
1
k
∑
i
=
1
n
ln
[
α
l
N
(
x
i
∣
μ
l
,
σ
l
)
]
p
(
l
∣
x
i
,
θ
(
g
)
)
=
∑
l
=
1
k
∑
i
=
1
n
ln
(
α
l
)
p
(
l
∣
x
i
,
θ
(
g
)
)
+
∑
l
=
1
k
∑
i
=
1
n
ln
[
N
(
x
i
∣
μ
l
,
σ
l
)
]
p
(
l
∣
x
i
,
θ
(
g
)
)
\begin{aligned} Q(\theta,\theta^{(g)}) &=\sum_{l=1}^k\sum_{i=1}^n\ln p(l,x_i|\theta)p(l|x_i,\theta^{(g)})\\ &=\sum_{l=1}^k\sum_{i=1}^n\ln[\alpha_l\mathcal{N}(x_i|\mu_l,\sigma_l)]p(l|x_i,\theta^{(g)})\\&=\sum_{l=1}^k\sum_{i=1}^n\ln(\alpha_l)p(l|x_i,\theta^{(g)})\\&\quad+\sum_{l=1}^k\sum_{i=1}^n\ln[\mathcal{N}(x_i|\mu_l,\sigma_l)]p(l|x_i,\theta^{(g)}) \end{aligned}\nonumber
Q(θ,θ(g))=l=1∑ki=1∑nlnp(l,xi∣θ)p(l∣xi,θ(g))=l=1∑ki=1∑nln[αlN(xi∣μl,σl)]p(l∣xi,θ(g))=l=1∑ki=1∑nln(αl)p(l∣xi,θ(g))+l=1∑ki=1∑nln[N(xi∣μl,σl)]p(l∣xi,θ(g))
M过程,即最大化 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))
每次更新即为求使 Q ( θ , θ ( g ) ) Q(\theta,\theta^{(g)}) Q(θ,θ(g))最大的 { α 1 , . . . , α k , μ 1 , . . . , μ k , σ 1 , . . . , σ k } \{\alpha_1,...,\alpha_k,\mu_1,...,\mu_k,\sigma_1,...,\sigma_k\} {α1,...,αk,μ1,...,μk,σ1,...,σk}。而由于式中加号左边只包含 α \alpha α,而加号右边只包含 μ \mu μ, σ \sigma σ。所以可以每一项分别最大化
最大化 α \alpha α
计算公式:
α
l
=
1
n
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\alpha_l=\frac{1}n\sum_{i=1}^np(l|x_i,\theta^{(g)}) \nonumber
αl=n1i=1∑np(l∣xi,θ(g))
计算过程:
优化目标:
∂
∑
l
=
1
k
∑
i
=
1
n
ln
(
α
l
)
p
(
l
∣
x
i
,
θ
(
g
)
)
∂
α
1
,
.
.
.
,
∂
α
k
=
[
0...0
]
s
.
t
∑
l
=
1
k
α
l
=
1
\frac{\partial\sum_{l=1}^k\sum_{i=1}^n\ln(\alpha_l)p(l|x_i,\theta^{(g)})}{\partial\alpha_1,...,\partial\alpha_k}=[0...0]\qquad s.t\,\sum_{l=1}^k\alpha_l=1 \nonumber
∂α1,...,∂αk∂∑l=1k∑i=1nln(αl)p(l∣xi,θ(g))=[0...0]s.tl=1∑kαl=1
因为
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\sum_{i=1}^np(l|x_i,\theta^{(g)})
∑i=1np(l∣xi,θ(g))这部分当中不包含
α
\alpha
α,所以:
∂
L
M
∂
α
l
=
1
α
l
(
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
)
+
λ
=
0
\frac{\partial\mathbb{LM}}{\partial\alpha_l}=\frac{1}{\alpha_l}\bigg(\sum_{i=1}^np(l|x_i,\theta^{(g)})\bigg)+\lambda=0 \nonumber
∂αl∂LM=αl1(i=1∑np(l∣xi,θ(g)))+λ=0
所以:
α
l
=
−
1
λ
(
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
)
\alpha_l=-\frac{1}{\lambda}\bigg(\sum_{i=1}^np(l|x_i,\theta^{(g)})\bigg)\nonumber
αl=−λ1(i=1∑np(l∣xi,θ(g)))
因为:
∑
l
=
1
k
α
l
=
1
\sum_{l=1}^k\alpha_l=1\nonumber
l=1∑kαl=1
即:
−
∑
l
=
1
k
1
λ
(
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
)
=
1
-\sum_{l=1}^k\frac{1}{\lambda}\bigg(\sum_{i=1}^np(l|x_i,\theta^{(g)})\bigg)=1 \nonumber
−l=1∑kλ1(i=1∑np(l∣xi,θ(g)))=1
接着:
λ
=
−
∑
l
=
1
k
(
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
)
=
−
∑
i
=
1
n
(
∑
l
=
1
k
p
(
l
∣
x
i
,
θ
(
g
)
)
)
=
−
∑
i
=
1
n
1
=
−
n
\begin{aligned} \lambda &=-\sum_{l=1}^k\bigg(\sum_{i=1}^np(l|x_i,\theta^{(g)})\bigg)\\ &=-\sum_{i=1}^n\bigg(\sum_{l=1}^kp(l|x_i,\theta^{(g)})\bigg)\\ &=-\sum_{i=1}^n1\\ &=-n \end{aligned}\nonumber
λ=−l=1∑k(i=1∑np(l∣xi,θ(g)))=−i=1∑n(l=1∑kp(l∣xi,θ(g)))=−i=1∑n1=−n
所以:
α
l
=
1
n
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\alpha_l=\frac{1}{n}\sum_{i=1}^np(l|x_i,\theta^{(g)})\nonumber
αl=n1i=1∑np(l∣xi,θ(g))
最大化 μ \mu μ
μ
l
\mu_l
μl计算公式:
μ
l
=
∑
i
=
1
n
x
i
p
(
l
∣
x
i
,
θ
(
g
)
)
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\mu_l=\dfrac{\sum_{i=1}^nx_ip(l|x_i,\theta^{(g)})}{\sum_{i=1}^np(l|x_i,\theta^{(g)})}\nonumber
μl=∑i=1np(l∣xi,θ(g))∑i=1nxip(l∣xi,θ(g))
μ
\mu
μ计算过程
优化目标:
∂
∑
l
=
1
k
∑
i
=
1
n
ln
[
N
(
x
i
∣
μ
l
,
Σ
l
)
]
p
(
l
∣
x
i
,
θ
(
g
)
)
∂
μ
1
,
.
.
.
,
∂
μ
k
,
∂
σ
1
,
.
.
.
,
∂
σ
k
=
[
0...0
]
\frac{\partial\sum_{l=1}^k\sum_{i=1}^n\ln[\mathcal{N}(x_i|\mu_l,\Sigma_l)]p(l|x_i,\theta^{(g)})}{\partial\mu_1,...,\partial\mu_k,\partial\sigma_1,...,\partial\sigma_k}=[0...0]\nonumber
∂μ1,...,∂μk,∂σ1,...,∂σk∂∑l=1k∑i=1nln[N(xi∣μl,Σl)]p(l∣xi,θ(g))=[0...0]
因为:
∑
l
=
1
k
∑
i
=
1
n
ln
[
N
(
x
i
∣
μ
l
,
σ
l
)
]
p
(
l
∣
x
i
,
θ
(
g
)
)
=
∑
l
=
1
k
∑
i
=
1
n
ln
(
1
(
2
π
)
d
∣
σ
l
∣
e
−
1
2
(
x
i
−
μ
l
)
⊤
σ
l
−
1
(
x
i
−
μ
l
)
)
p
(
l
∣
x
i
,
θ
(
g
)
)
=
∑
l
=
1
k
∑
i
=
1
n
(
−
1
2
ln
(
(
2
π
)
d
∣
σ
l
∣
)
−
1
2
(
x
i
−
μ
l
)
⊤
σ
l
−
1
(
x
i
−
μ
l
)
)
p
(
l
∣
x
i
,
θ
(
g
)
)
\begin{aligned} \sum_{l=1}^k&\sum_{i=1}^n\ln[\mathcal{N}(x_i|\mu_l,\sigma_l)]p(l|x_i,\theta^{(g)})\\ &=\sum_{l=1}^k\sum_{i=1}^n\ln\bigg(\dfrac{1}{\sqrt{(2\pi)^d|\sigma_l|}}e^{-\frac{1}{2}(x_i-\mu_l)^\top\sigma_l^{-1}(x_i-\mu_l)}\bigg)p(l|x_i,\theta^{(g)})\\ &=\sum_{l=1}^k\sum_{i=1}^n\bigg(-\frac{1}{2}\ln\Big((2\pi)^d|\sigma_l|\Big)-\frac{1}{2}(x_i-\mu_l)^\top\sigma_l^{-1}(x_i-\mu_l)\bigg)p(l|x_i,\theta^{(g)}) \end{aligned}\nonumber
l=1∑ki=1∑nln[N(xi∣μl,σl)]p(l∣xi,θ(g))=l=1∑ki=1∑nln((2π)d∣σl∣1e−21(xi−μl)⊤σl−1(xi−μl))p(l∣xi,θ(g))=l=1∑ki=1∑n(−21ln((2π)d∣σl∣)−21(xi−μl)⊤σl−1(xi−μl))p(l∣xi,θ(g))
将上式对
μ
l
\mu_l
μl求导,并令其为0,可得:
∑
i
=
1
n
σ
l
−
1
(
x
i
−
μ
l
)
p
(
l
∣
x
i
,
θ
(
g
)
)
=
0
\sum_{i=1}^n\sigma_l^{-1}(x_i-\mu_l)p(l|x_i,\theta^{(g)})=0\nonumber
i=1∑nσl−1(xi−μl)p(l∣xi,θ(g))=0
所以:
μ
l
=
∑
i
=
1
n
x
i
p
(
l
∣
x
i
,
θ
(
g
)
)
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\mu_l=\dfrac{\sum_{i=1}^nx_ip(l|x_i,\theta^{(g)})}{\sum_{i=1}^np(l|x_i,\theta^{(g)})}\nonumber
μl=∑i=1np(l∣xi,θ(g))∑i=1nxip(l∣xi,θ(g))
最大化 σ \sigma σ
σ
l
\sigma_l
σl计算公式:
σ
l
=
∑
i
=
1
n
M
l
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
=
∑
i
=
1
n
(
x
i
−
μ
l
)
(
x
−
μ
l
)
⊺
p
(
l
∣
x
i
,
θ
(
g
)
)
∑
i
=
1
n
p
(
l
∣
x
i
,
θ
(
g
)
)
\begin{aligned} \sigma_l&=\frac{\sum_{i=1}^nM_l}{\sum_{i=1}^np(l|x_i,\theta^{(g)})}\\ &=\frac{\sum_{i=1}^n(x_i-\mu_l)(x-\mu_l)^\intercal p(l|x_i,\theta^{(g)})}{\sum_{i=1}^np(l|x_i,\theta^{(g)})} \end{aligned}\nonumber
σl=∑i=1np(l∣xi,θ(g))∑i=1nMl=∑i=1np(l∣xi,θ(g))∑i=1n(xi−μl)(x−μl)⊺p(l∣xi,θ(g))
更新参数 θ ( g ) → θ ( g + 1 ) \theta^{(g)}\rightarrow\theta^{(g+1)} θ(g)→θ(g+1)
α l ( g + 1 ) = 1 N ∑ i = 1 N p ( l ∣ x i , θ ( g ) ) μ l ( g + 1 ) = ∑ i = 1 N x i p ( l ∣ x i , θ ( g ) ) σ i = 1 N p ( l ∣ x i , θ ( g ) ) σ l ( g + 1 ) = ∑ i = 1 N [ x i − μ l ( g + 1 ) ] [ x − μ l ( g + 1 ) ] ⊺ p ( l ∣ x i , θ ( g ) ) ∑ i = 1 N p ( l ∣ x i , θ ( g ) ) \alpha_l^{(g+1)}=\frac{1}N\sum_{i=1}^Np(l|x_i,\theta^{(g)})\\ \mu_l^{(g+1)}=\frac{\sum_{i=1}^Nx_ip(l|x_i,\theta^{(g)})}{\sigma_{i=1}^Np(l|x_i,\theta^{(g)})}\\ \sigma_l^{(g+1)}=\dfrac{\sum_{i=1}^N[x_i-\mu_l^{(g+1)}][x-\mu_l^{(g+1)}]^\intercal p(l|x_i,\theta^{(g)})}{\sum_{i=1}^Np(l|x_i,\theta^{(g)})}\nonumber αl(g+1)=N1i=1∑Np(l∣xi,θ(g))μl(g+1)=σi=1Np(l∣xi,θ(g))∑i=1Nxip(l∣xi,θ(g))σl(g+1)=∑i=1Np(l∣xi,θ(g))∑i=1N[xi−μl(g+1)][x−μl(g+1)]⊺p(l∣xi,θ(g))