EM算法
EM算法是含有隐变量的概率模型参数的极大似然估计法。
用 Y Y Y表示观测变量的数据, Z Z Z表示隐变量的数据, θ \theta θ表示要估计的参数, Y Y Y和 Z Z Z连在一起称为完全数据,观测数据 Y Y Y称为不完全数据,假设 Y Y Y的概率分布是 P ( Y ∣ θ ) P(Y\mid\theta) P(Y∣θ),那么不完全数据 Y Y Y的对数似然函数是 log P ( Y ∣ θ ) \log P(Y\mid\theta) logP(Y∣θ),假设 Y Y Y和 Z Z Z的联合概率分布是 P ( Y , Z ∣ θ ) P(Y,Z\mid\theta) P(Y,Z∣θ),那么完全数据的对数似然函数是 log P ( Y , Z ∣ θ ) \log P(Y,Z\mid\theta) logP(Y,Z∣θ)。
对含有隐变量的概率模型,目标是极大化观测数据
Y
Y
Y对于模型参数
θ
\theta
θ的对数似然函数,即极大化
L
(
θ
)
=
log
P
(
Y
∣
θ
)
=
log
(
∑
z
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
L(\theta)=\log P(Y\mid\theta)=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta))
L(θ)=logP(Y∣θ)=log(z∑P(Y∣Z,θ)P(Z∣θ))
这个式子的困难在于存在未观测数据并且包含和的对数,EM算法通过迭代逐步近似极大化
L
(
θ
)
L(\theta)
L(θ),假设在第
i
i
i次迭代后
θ
\theta
θ的估计值是
θ
(
i
)
\theta^{(i)}
θ(i),我们希望新的估计值
θ
\theta
θ能使
L
(
θ
)
L(\theta)
L(θ)增加,即
L
(
θ
)
>
L
(
θ
(
i
)
)
L(\theta)>L(\theta^{(i)})
L(θ)>L(θ(i)),并逐步达到极大值,为此考虑两者的差
L
(
θ
)
−
L
(
θ
(
i
)
)
=
log
(
∑
z
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L(\theta_{(i)})=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta))-\log P(Y\mid\theta^{(i)})
L(θ)−L(θ(i))=log(z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))
利用
J
e
n
s
e
n
Jensen
Jensen不等式
L
(
θ
)
−
L
(
θ
(
i
)
)
=
log
(
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
≥
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
=
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
−
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
P
(
Y
∣
θ
(
i
)
)
=
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
)
\begin{aligned} L(\theta)-L(\theta_{(i)}) &=\log(\sum_zP(Z\mid Y,\theta^{(i)})\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\ &\geq\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\ &=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y\mid\theta^{(i)})\\ &=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})}) \end{aligned}
L(θ)−L(θ(i))=log(z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))≥z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))=z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−z∑P(Z∣Y,θ(i))logP(Y∣θ(i))=z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
令
B
(
θ
,
θ
(
i
)
)
=
L
(
θ
(
i
)
)
+
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
)
B(\theta,\theta^{(i)})=L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})
B(θ,θ(i))=L(θ(i))+z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
即
L
(
θ
)
≥
B
(
θ
,
θ
(
i
)
)
L(\theta)\geq B(\theta,\theta^{(i)})
L(θ)≥B(θ,θ(i))
并且
L
(
θ
(
i
)
)
=
B
(
θ
(
i
)
,
θ
(
i
)
)
L(\theta^{(i)})=B(\theta^{(i)},\theta^{(i)})
L(θ(i))=B(θ(i),θ(i)),因此任何能使
B
(
θ
,
θ
(
i
)
)
B(\theta,\theta^{(i)})
B(θ,θ(i))增大的
θ
\theta
θ也一定能使
L
(
θ
)
L(\theta)
L(θ)增大,为使
L
(
θ
)
L(\theta)
L(θ)增长尽可能的大,应选择
θ
(
i
+
1
)
\theta^{(i+1)}
θ(i+1)使
B
(
θ
,
θ
(
i
)
)
B(\theta,\theta^{(i)})
B(θ,θ(i))达到极大,即
θ
(
i
+
1
)
=
arg
max
θ
B
(
θ
,
θ
(
i
)
)
=
arg
max
θ
L
(
θ
(
i
)
)
+
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
)
=
arg
max
θ
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
=
arg
max
θ
∑
z
P
(
Z
∣
Y
,
θ
(
i
)
)
log
P
(
Y
,
Z
∣
θ
)
=
arg
max
θ
E
z
[
log
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
\begin{aligned} \theta^{(i+1)} &=\arg\max\limits_{\theta}B(\theta,\theta^{(i)})\\ &=\arg\max\limits_{\theta}L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})\\ &=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log(P(Y\mid Z,\theta)P(Z\mid\theta)\\ &=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta)\\ &=\arg\max\limits_{\theta}E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}] \end{aligned}
θ(i+1)=argθmaxB(θ,θ(i))=argθmaxL(θ(i))+z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmaxz∑P(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)=argθmaxz∑P(Z∣Y,θ(i))logP(Y,Z∣θ)=argθmaxEz[logP(Y,Z∣θ)∣Y,θ(i)]
令
Q
(
θ
,
θ
(
i
)
)
=
E
z
[
log
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
Q(\theta,\theta^{(i)})=E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}]
Q(θ,θ(i))=Ez[logP(Y,Z∣θ)∣Y,θ(i)]
即完全数据的对数似然
l
o
g
(
Y
,
Z
∣
θ
)
log(Y,Z\mid\theta)
log(Y,Z∣θ)关于在给定观测数据和当前参数
θ
(
i
)
\theta^{(i)}
θ(i)下对未观测数据
Z
Z
Z的期望,因此
θ
(
i
+
1
)
=
arg
max
θ
Q
(
θ
,
θ
(
i
)
)
\theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)})
θ(i+1)=argθmaxQ(θ,θ(i))
EM算法:
输入:观测数据变量 Y Y Y,隐变量数据 Z Z Z,联合分布 P ( Y , Z ∣ θ ) P(Y,Z\mid\theta) P(Y,Z∣θ),条件分布 P ( Z ∣ Y , θ ) P(Z\mid Y,\theta) P(Z∣Y,θ)
输出:模型参数 θ \theta θ
(1) 选择参数的初始值 θ ( 0 ) \theta^{(0)} θ(0),开始迭代
(2) E步:计算 Q ( θ , θ ( i ) ) = ∑ z P ( Z ∣ Y , θ ( i ) ) log P ( Y , Z ∣ θ ) Q(\theta,\theta^{(i)})=\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta) Q(θ,θ(i))=∑zP(Z∣Y,θ(i))logP(Y,Z∣θ)
(3) M步:计算 θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)}) θ(i+1)=argθmaxQ(θ,θ(i))
(4) 重复执行第(2)步和第(3)步,直至收敛
高斯混合模型
高斯混合分布是具有如下形式的概率分布
P
(
x
∣
θ
)
=
∑
k
=
1
K
α
k
ϕ
(
x
∣
θ
k
)
P(x\mid\theta)=\sum_{k=1}^K\alpha_k\phi(x\mid\theta_k)
P(x∣θ)=k=1∑Kαkϕ(x∣θk)
其中
α
k
\alpha_k
αk是系数,
∑
k
=
1
K
α
k
=
1
\sum_{k=1}^K\alpha_k=1
∑k=1Kαk=1,
α
k
≥
0
\alpha_k\geq0
αk≥0,
ϕ
(
y
∣
θ
k
)
\phi(y\mid\theta_k)
ϕ(y∣θk)是高斯概率密度函数,
θ
k
=
(
μ
k
,
σ
k
2
)
\theta_k=(\mu_k,\sigma_k^2)
θk=(μk,σk2)
ϕ
(
x
∣
θ
k
)
=
1
2
π
σ
k
exp
(
−
(
x
−
μ
k
)
2
2
σ
k
2
)
\phi(x\mid\theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2})
ϕ(x∣θk)=2πσk1exp(−2σk2(x−μk)2)
假设样本的生成过程由高斯混合分布给出:首先根据
α
1
,
α
2
,
…
,
α
k
\alpha_1,\alpha_2,\dots,\alpha_k
α1,α2,…,αk选择一个高斯混合成分,然后根据被选择的高斯混合成分生成观测数据。这是观测数据是已知的,观测数据来自哪个高斯分布是未知的,以隐变量
γ
j
k
\gamma_{jk}
γjk表示,其定义如下:
γ
j
k
=
{
1
,
第
j
个
观
测
来
自
第
k
个
分
模
型
0
,
其
他
j
=
1
,
2
,
…
,
N
;
k
=
1
,
2
,
…
,
k
\begin{aligned} \gamma_{jk}=&\begin{cases} 1,\quad第j个观测来自第k个分模型\\ 0,\quad其他 \end{cases}\\\\ &j=1,2,\dots,N;\;k=1,2,\dots,k \end{aligned}
γjk={1,第j个观测来自第k个分模型0,其他j=1,2,…,N;k=1,2,…,k
那么完全数据是
(
y
j
,
γ
j
1
,
γ
j
2
,
…
,
γ
j
k
)
,
j
=
1
,
2
,
…
,
N
(y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}),\quad j=1,2,\dots,N
(yj,γj1,γj2,…,γjk),j=1,2,…,N
于是完全数据的似然函数
P
(
γ
,
y
∣
θ
)
=
∏
j
=
1
N
P
(
y
j
,
γ
j
1
,
γ
j
2
,
…
,
γ
j
k
∣
θ
)
=
∏
j
=
1
N
∏
k
=
1
K
[
α
k
ϕ
(
x
j
∣
θ
k
)
]
γ
j
k
=
∏
k
=
1
K
α
k
n
k
∏
j
=
1
N
[
ϕ
(
x
j
∣
θ
k
)
]
γ
j
k
=
∏
k
=
1
K
α
k
n
k
∏
j
=
1
N
[
1
2
π
σ
k
exp
(
−
(
x
j
−
μ
k
)
2
2
σ
k
2
)
]
γ
j
k
\begin{aligned} P(\gamma,y\mid\theta) &=\prod_{j=1}^{N}P(y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}\mid\theta)\\ &=\prod_{j=1}^{N}\prod_{k=1}^{K}[\alpha_k\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\ &=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\ &=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}} \end{aligned}
P(γ,y∣θ)=j=1∏NP(yj,γj1,γj2,…,γjk∣θ)=j=1∏Nk=1∏K[αkϕ(xj∣θk)]γjk=k=1∏Kαknkj=1∏N[ϕ(xj∣θk)]γjk=k=1∏Kαknkj=1∏N[2πσk1exp(−2σk2(xj−μk)2)]γjk
其中
n
k
=
∑
j
=
1
N
γ
j
k
n_k=\sum_{j=1}^{N}\gamma_{jk}
nk=∑j=1Nγjk,
∑
k
=
1
K
n
k
=
N
\sum_{k=1}^{K}n_k=N
∑k=1Knk=N,那么完全数据的对数似然是
log
P
(
γ
,
y
∣
θ
)
=
∑
k
=
1
K
n
k
log
α
k
+
∑
k
=
1
K
∑
j
=
1
N
[
γ
j
k
(
log
1
2
π
−
log
σ
k
−
(
x
j
−
μ
k
)
2
2
σ
k
2
)
]
\log P(\gamma,y\mid\theta)=\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]
logP(γ,y∣θ)=k=1∑Knklogαk+k=1∑Kj=1∑N[γjk(log2π1−logσk−2σk2(xj−μk)2)]
需要极大化的
Q
Q
Q函数是
Q
(
θ
,
θ
(
i
)
)
=
E
γ
[
log
P
(
γ
,
y
∣
θ
)
∣
y
,
θ
(
i
)
]
=
E
γ
{
∑
k
=
1
K
n
k
log
α
k
+
∑
k
=
1
K
∑
j
=
1
N
[
γ
j
k
(
log
1
2
π
−
log
σ
k
−
(
x
j
−
μ
k
)
2
2
σ
k
2
)
]
}
=
∑
k
=
1
K
{
∑
j
=
1
N
(
E
γ
j
k
)
log
α
k
+
∑
j
=
1
N
(
E
γ
j
k
)
[
log
1
2
π
−
log
σ
k
−
(
x
j
−
μ
k
)
2
2
σ
k
2
]
}
\begin{aligned} Q(\theta,\theta^{(i)}) &=E_\gamma[\log P(\gamma,y\mid\theta)\mid y,\theta^{(i)}]\\ &=E_\gamma\{\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]\}\\ &=\sum_{k=1}^{K}\{\sum_{j=1}^{N}(E\gamma_{jk})\log\alpha_k+\sum_{j=1}^{N}(E\gamma_{jk})[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\} \end{aligned}
Q(θ,θ(i))=Eγ[logP(γ,y∣θ)∣y,θ(i)]=Eγ{k=1∑Knklogαk+k=1∑Kj=1∑N[γjk(log2π1−logσk−2σk2(xj−μk)2)]}=k=1∑K{j=1∑N(Eγjk)logαk+j=1∑N(Eγjk)[log2π1−logσk−2σk2(xj−μk)2]}
这里需要计算
E
γ
j
k
=
E
(
γ
j
k
∣
y
,
θ
)
E\gamma_{jk}=E(\gamma_{jk}\mid y,\theta)
Eγjk=E(γjk∣y,θ)
E
(
γ
j
k
∣
y
,
θ
)
=
P
(
γ
j
k
=
1
∣
y
,
θ
)
=
P
(
γ
j
k
=
1
,
y
j
∣
θ
)
P
(
y
j
∣
θ
)
=
P
(
γ
j
k
=
1
,
y
j
∣
θ
)
∑
k
=
1
K
P
(
γ
j
k
=
1
,
y
j
∣
θ
)
=
P
(
y
j
∣
γ
j
k
=
1
,
θ
)
P
(
γ
j
k
=
1
∣
θ
)
∑
k
=
1
K
P
(
y
j
∣
γ
j
k
=
1
,
θ
)
P
(
γ
j
k
=
1
∣
θ
)
=
α
k
ϕ
(
y
j
∣
θ
k
)
∑
k
=
1
K
α
k
ϕ
(
y
j
∣
θ
k
)
\begin{aligned} E(\gamma_{jk}\mid y,\theta) &=P(\gamma_{jk}=1\mid y,\theta)\\ &=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{P(y_j\mid\theta)}\\ &=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{\sum_{k=1}^{K}P(\gamma_{jk}=1,y_j\mid\theta)}\\ &=\frac{P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}{\sum_{k=1}^{K}P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}\\ &=\frac{\alpha_k\phi(y_j\mid\theta_k)}{\sum_{k=1}^{K}\alpha_k\phi(y_j\mid\theta_k)} \end{aligned}
E(γjk∣y,θ)=P(γjk=1∣y,θ)=P(yj∣θ)P(γjk=1,yj∣θ)=∑k=1KP(γjk=1,yj∣θ)P(γjk=1,yj∣θ)=∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)P(yj∣γjk=1,θ)P(γjk=1∣θ)=∑k=1Kαkϕ(yj∣θk)αkϕ(yj∣θk)
E
(
γ
j
k
∣
y
,
θ
)
E(\gamma_{jk}\mid y,\theta)
E(γjk∣y,θ)表示当前参数下第
j
j
j个观测数据来自第
k
k
k个混合成分的概率,记为
γ
^
j
k
\hat{\gamma}_{jk}
γ^jk。综上所述
Q
(
θ
,
θ
(
i
)
)
=
∑
k
=
1
K
{
n
k
log
α
k
+
∑
j
=
1
N
γ
^
j
k
[
log
1
2
π
−
log
σ
k
−
(
x
j
−
μ
k
)
2
2
σ
k
2
]
}
Q(\theta,\theta^{(i)})=\sum_{k=1}^{K}\{n_k\log\alpha_k+\sum_{j=1}^{N}\hat{\gamma}_{jk}[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\}
Q(θ,θ(i))=k=1∑K{nklogαk+j=1∑Nγ^jk[log2π1−logσk−2σk2(xj−μk)2]}
其中
n
k
=
∑
j
=
1
N
γ
^
j
k
n_k=\sum_{j=1}^{N}\hat{\gamma}_{jk}
nk=∑j=1Nγ^jk,接下来需对
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))求极大,需要求
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))对每个参数的偏导。
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))对
μ
k
\mu_k
μk的偏导:
∂
Q
(
θ
,
θ
(
i
)
)
∂
μ
k
=
∑
j
=
1
N
γ
j
k
(
x
j
−
μ
k
)
σ
k
2
\frac{\partial Q(\theta, \theta^{(i)})}{\partial \mu_k}=\sum_{j=1}^{N}\frac{\gamma_{jk}(x_j-\mu_k)}{\sigma^2_k}
∂μk∂Q(θ,θ(i))=j=1∑Nσk2γjk(xj−μk)
所以
μ
^
k
=
∑
j
=
1
N
γ
j
k
x
j
∑
j
=
1
N
γ
j
k
\hat{\mu}_k=\frac{\sum_{j=1}^{N}\gamma_{jk}x_j}{\sum_{j=1}^{N}\gamma_{jk}}
μ^k=∑j=1Nγjk∑j=1Nγjkxj
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))对
σ
k
2
\sigma_k^2
σk2的偏导:
∂
Q
(
θ
,
θ
(
i
)
)
∂
σ
k
2
=
−
1
2
∑
j
=
1
N
γ
^
j
k
(
1
σ
k
2
−
(
x
j
−
μ
k
)
2
σ
k
4
)
\frac{\partial Q(\theta,\theta^{(i)})}{\partial\sigma_k^2}=-\frac{1}{2}\sum_{j=1}^{N}\hat{\gamma}_{jk}(\frac{1}{\sigma^2_k}-\frac{(x_j-\mu_k)^2}{\sigma^4_k})
∂σk2∂Q(θ,θ(i))=−21j=1∑Nγ^jk(σk21−σk4(xj−μk)2)
所以
σ
^
k
2
=
∑
j
=
1
N
γ
^
j
k
(
x
j
−
μ
k
)
2
∑
j
=
1
N
γ
^
j
k
\hat{\sigma}_k^2=\frac{\sum_{j=1}^{N}\hat{\gamma}^{jk}(x_j-\mu_k)^2}{\sum_{j=1}^{N}\hat{\gamma}_{jk}}
σ^k2=∑j=1Nγ^jk∑j=1Nγ^jk(xj−μk)2
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))对
α
k
\alpha_k
αk的偏导:由于存在约束条件
∑
k
=
1
K
α
k
=
1
\sum_{k=1}^{K}\alpha_k=1
∑k=1Kαk=1,所以考虑
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))的拉格朗日函数
L
(
θ
,
θ
(
i
)
)
=
Q
(
θ
,
θ
(
i
)
)
+
λ
(
∑
k
=
1
K
α
k
−
1
)
L(\theta,\theta^{(i)})=Q(\theta,\theta^{(i)})+\lambda(\sum_{k=1}^K\alpha_k-1)
L(θ,θ(i))=Q(θ,θ(i))+λ(k=1∑Kαk−1)
求偏导
∂
L
(
θ
,
θ
(
i
)
)
∂
α
k
=
∑
j
=
1
N
γ
^
j
k
α
k
+
λ
\frac{\partial L(\theta,\theta^{(i)})}{\partial\alpha_k}=\frac{\sum_{j=1}^N\hat{\gamma}_{jk}}{\alpha_k}+\lambda
∂αk∂L(θ,θ(i))=αk∑j=1Nγ^jk+λ
令偏导等于零,即
∑
j
=
1
N
γ
^
j
k
+
λ
α
k
=
0
\sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\alpha_k=0
j=1∑Nγ^jk+λαk=0
为求解
λ
\lambda
λ,对所有分模型求和得
∑
k
=
1
K
∑
j
=
1
N
γ
^
j
k
+
λ
∑
k
=
1
K
α
k
=
0
\sum_{k=1}^K\sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\sum_{k=1}^K\alpha_k=0
k=1∑Kj=1∑Nγ^jk+λk=1∑Kαk=0
解得
λ
=
−
N
\lambda=-N
λ=−N,所以
α
^
k
=
∑
j
=
1
N
γ
^
j
k
N
\hat{\alpha}_k=\frac{\sum_{j=1}^{N}\hat{\gamma}_{jk}}{N}
α^k=N∑j=1Nγ^jk
多元高斯分布与此类似,更新公式几乎是一模一样的,但计算却更为复杂,涉及到复杂的求导。
根据高斯混合模型的计算结果,可以得到一种聚类方法,即根据
γ
\gamma
γ的取值判断每个样本属于哪一类,对于样本
x
j
x_j
xj,其所属的类别是
arg
max
k
γ
j
k
\arg\max\limits_{k}\gamma_{jk}
argkmaxγjk