目录
Prerequisite
- 好心情
模型
Driving Example
扔钢镚儿:
钢镚一号 钢镚二号
图片来自:哔哩哔哩
设:
一号正面和二号正面出现的概率分别为:
p
q
p\ q
p q
并且有奇数次投掷一号,偶数次投掷二号,反复十次得到结果:
1
,
1
,
0
,
1
,
0
,
0
,
1
,
0
,
1
,
1
1,1,0,1,0,0,1,0,1,1
1,1,0,1,0,0,1,0,1,1
现在试问用极大似然估计的方法求
p
q
p\ q
p q:
解:
由题那么有:
似然函数为:
L
(
p
,
q
)
=
P
(
x
∣
p
,
q
)
=
p
(
x
1
∣
p
,
q
)
⋅
p
(
x
2
∣
p
,
q
)
⋯
p
(
x
10
∣
p
,
q
)
L(p, q)=P(\mathbf x\vert p,q)=p(x_1\vert p,q)\cdot p(x_2\vert p,q)\cdots p(x_{10}\vert p,q)
L(p,q)=P(x∣p,q)=p(x1∣p,q)⋅p(x2∣p,q)⋯p(x10∣p,q)
即有:
L
(
p
,
q
)
=
p
∗
q
∗
(
1
−
p
)
∗
q
∗
(
1
−
p
)
∗
(
1
−
q
)
∗
p
∗
(
1
−
q
)
∗
p
∗
q
L(p, q)=p*q*(1-p)*q*(1-p)*(1-q)*p*(1-q)*p*q
L(p,q)=p∗q∗(1−p)∗q∗(1−p)∗(1−q)∗p∗(1−q)∗p∗q
=
p
3
∗
q
3
∗
(
1
−
p
)
2
∗
(
1
−
q
)
2
\qquad \quad\ =p^3*q^3*(1-p)^2*(1-q)^2
=p3∗q3∗(1−p)2∗(1−q)2
若要
m
a
x
L
(
p
,
q
)
maxL(p,q)
maxL(p,q), 只需分别
m
a
x
(
p
3
∗
(
1
−
p
)
2
)
max(p^3*(1-p)^2)
max(p3∗(1−p)2) 和
m
a
x
(
q
3
∗
(
1
−
q
)
2
)
max(q^3*(1-q)^2)
max(q3∗(1−q)2)
显然
p
=
q
=
0.6
p=q=0.6
p=q=0.6
上面的例子很简单,但相传有一好事者,他是这样子做的:
好事者扔钢镚儿:
钢镚一号钢镚二号钢镚三号
设:
一号正面,二号正面,三号正面出现的概率分别为:
π
,
p
,
q
\pi, p,q
π,p,q
并且有,先投掷一号,然后根据它的正反,来决定投掷二号(一号为正)或者三号(一号为负)。反复十次投掷有:
1
,
1
,
0
,
1
,
0
,
0
,
1
,
0
,
1
,
1
1,1,0,1,0,0,1,0,1,1
1,1,0,1,0,0,1,0,1,1 。
好事者说:
假设只通过观测到掷硬币的结果,不能观测掷硬币的过程,问如何估计三个硬币出现正面的概率?
解:
比起上面的奇偶投掷法,这里无非就是把奇偶选择硬币的方式,换成了用另外一个硬币来做选择。但尴尬的是,好事者不让我们知道一号的结果,因此我们无法直接用
p
(
x
1
∣
p
,
q
)
⋅
p
(
x
2
∣
p
,
q
)
⋯
p(x_1\vert p,q)\cdot p(x_2\vert p,q)\cdots
p(x1∣p,q)⋅p(x2∣p,q)⋯的方法来建立似然函数了,因为我们没有一号硬币的信息。那么就让我们来建立一个三参数模型,引入一号硬币的信息。
P
(
y
∣
θ
)
=
∑
z
P
(
y
,
z
∣
θ
)
=
∑
z
P
(
z
∣
θ
)
P
(
y
∣
z
,
θ
)
=
π
p
y
(
1
−
p
)
1
−
y
+
(
1
−
π
)
q
y
(
1
−
q
)
1
−
y
\begin{aligned} P(y|\theta) &=\sum_{z}P(y,z|\theta)=\sum_{z}P(z|\theta)P(y|z,\theta) \\ &=\pi p^{y}(1-p)^{1-y}+(1-\pi)q^{y}(1-q)^{1-y} \end{aligned}
P(y∣θ)=z∑P(y,z∣θ)=z∑P(z∣θ)P(y∣z,θ)=πpy(1−p)1−y+(1−π)qy(1−q)1−y
y是观测变量,表示一次观测结果是1或0,z是隐藏变量(hidden variable),表示掷硬币一号的结果,这个是观测不到结果的,
θ
=
(
π
,
p
,
q
)
\theta=(\pi,p,q)
θ=(π,p,q)表示模型参数,将观测数据表示为
Y
=
(
Y
1
,
Y
2
,
.
.
.
,
Y
n
)
T
Y=(Y_1,Y_2,...,Y_n)^{T}
Y=(Y1,Y2,...,Yn)T,未观测的数据表示为
Z
=
(
Z
1
,
Z
2
,
.
.
.
,
Z
n
)
T
Z=(Z_1,Z_2,...,Z_n)^{T}
Z=(Z1,Z2,...,Zn)T,则观测函数的似然函数是:
P
(
Y
∣
θ
)
=
∑
Z
P
(
Z
∣
θ
)
P
(
Y
∣
Z
,
θ
)
o
r
=
∏
i
=
0
n
P
(
y
i
∣
θ
)
=
∏
i
=
0
n
(
π
p
y
i
(
1
−
p
)
1
−
y
i
+
(
1
−
π
)
q
y
i
(
1
−
q
)
1
−
y
i
)
\begin{aligned} P(Y|\theta)&=\sum_{Z}P(Z|\theta)P(Y|Z,\theta)\\ &or\\ &=\prod_{i=0}^nP(y_i\vert \theta)\\ &=\prod_{i=0}^n ( \pi p^{y_i}(1-p)^{1-y_{i}}+(1-\pi)q^{y_{i}}(1-q)^{1-y_{i}}) \end{aligned}
P(Y∣θ)=Z∑P(Z∣θ)P(Y∣Z,θ)or=i=0∏nP(yi∣θ)=i=0∏n(πpyi(1−p)1−yi+(1−π)qyi(1−q)1−yi)
注意
Z
=
(
Z
1
,
Z
2
,
.
.
.
,
Z
n
)
T
Z=(Z_1,Z_2,...,Z_n)^{T}
Z=(Z1,Z2,...,Zn)T是向量,不是取其中一个值!!
那么对
θ
=
(
π
,
p
,
q
)
\theta=(\pi,p,q)
θ=(π,p,q)求极大似然估计有:
θ
^
=
arg max
θ
l
o
g
P
(
Y
∣
θ
)
\hat{\theta}=\argmax_\theta logP(Y\vert \theta)
θ^=θargmaxlogP(Y∣θ)
就这?
是不是lingo, mathlab启动就完事了?!!
然而这个问题没有解析解,所以我们就只能另寻他路。
EM算法:
E:expectation M:maximization
L ( θ ) − L ( θ ( i ) ) L(\theta)-L(\theta^{(i)}) L(θ)−L(θ(i)):
上文我们提到有似然函数:
L
(
θ
)
=
l
o
g
P
(
Y
∣
θ
)
=
l
o
g
∑
Z
P
(
Y
,
Z
∣
θ
)
=
l
o
g
(
∑
Z
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
\begin{aligned}L(\theta)&=logP(Y\vert \theta)=log\sum_ZP(Y,Z\vert \theta)\\ &=log(\sum_ZP(Y\vert Z,\theta)P(Z\vert \theta)) \end{aligned}
L(θ)=logP(Y∣θ)=logZ∑P(Y,Z∣θ)=log(Z∑P(Y∣Z,θ)P(Z∣θ))
假设我们有随机初始值:
θ
(
i
)
\theta^{(i)}
θ(i), 现在我们的任务是找到一个
θ
\theta
θ有:
L
(
θ
)
>
L
(
θ
(
i
)
)
L(\theta)>L(\theta^{(i)})
L(θ)>L(θ(i))。那么也即是有:
L
(
θ
)
−
L
(
θ
(
i
)
)
=
l
o
g
(
∑
Z
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
−
l
o
g
P
(
Y
∣
θ
(
i
)
)
>
0
L(\theta)-L(\theta^{(i)})=log(\sum_ZP(Y\vert Z,\theta)P(Z\vert \theta))-logP(Y\vert \theta^{(i)}) \gt 0
L(θ)−L(θ(i))=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))>0
现在我们希望对这个式子进行变形,从而找到
L
(
θ
)
L(\theta)
L(θ)的下界,如果有下界大于
L
(
θ
(
i
)
)
L(\theta^{(i)})
L(θ(i)),那么就达到了差值大于零的目的。并且我们尽量是下界大,那么肯定
L
(
θ
)
L(\theta)
L(θ)也会尽量变大。但要注意,下界最大时,
L
(
θ
)
L(\theta)
L(θ)并不一定就取得了最大。下文会提到这一点。
L
(
θ
)
−
L
(
θ
(
i
)
)
=
l
o
g
(
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
−
l
o
g
P
(
Y
∣
θ
(
i
)
)
由
l
o
g
凹
函
数
性
质
有
:
l
o
g
∑
j
λ
j
y
j
≥
∑
j
λ
j
l
o
g
y
j
,
λ
j
≥
0
,
∑
j
λ
j
=
1
,
则
≥
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
)
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
−
l
o
g
P
(
Y
∣
θ
(
i
)
)
K
n
o
w
n
:
∑
Z
P
(
Z
∣
Y
,
θ
)
=
1
,
t
h
e
n
=
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
−
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
∣
θ
(
i
)
)
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
=
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
\begin{aligned}L(\theta)-L(\theta^{(i)})&=log(\sum_Z P(Z\vert Y,\theta^{(i)})\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})}-logP(Y\vert \theta^{(i)})\\ &由log凹函数性质有:log\sum_j\lambda_jy_j\geq \sum_j\lambda_jlogy_j,\lambda_j\ge 0,\sum_j\lambda_j=1,则\\ &\ge \sum_ZP(Z\vert Y,\theta^{(i)}) )log\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})}-logP(Y\vert \theta^{(i)})\\ &Known: \sum\limits_ZP(Z\vert Y,\theta)=1,then \\ &=\sum_ZP(Z\vert Y,\theta^{(i)}) log\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})}-\sum_ZP(Z\vert Y,\theta^{(i)})\frac{logP(Y\vert \theta^{(i)})}{\sum\limits_ZP(Z\vert Y,\theta^{(i)})}\\ &=\sum_ZP(Z\vert Y,\theta^{(i)})log\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})P(Y\vert \theta^{(i)})} \end{aligned}
L(θ)−L(θ(i))=log(Z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−logP(Y∣θ(i))由log凹函数性质有:logj∑λjyj≥j∑λjlogyj,λj≥0,j∑λj=1,则≥Z∑P(Z∣Y,θ(i)))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−logP(Y∣θ(i))Known:Z∑P(Z∣Y,θ)=1,then=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)−Z∑P(Z∣Y,θ(i))Z∑P(Z∣Y,θ(i))logP(Y∣θ(i))=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))
令:
B
(
θ
,
θ
(
i
)
)
=
^
L
(
θ
(
i
)
)
+
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
B(\theta,\theta^{(i)})\hat=L(\theta^{(i)})+\sum_ZP(Z\vert Y,\theta^{(i)})log\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})P(Y\vert \theta^{(i)})}
B(θ,θ(i))=^L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
则:
L
(
θ
)
≥
B
(
θ
,
θ
(
i
)
)
L(\theta)\ge B(\theta,\theta^{(i)})
L(θ)≥B(θ,θ(i))
当令
θ
=
θ
(
i
)
\theta=\theta^{(i)}
θ=θ(i)时,我们有
L
(
θ
(
i
)
)
=
B
(
θ
(
i
)
,
θ
(
i
)
)
L(\theta^{(i)})=B(\theta^{(i)},\theta^{(i)})
L(θ(i))=B(θ(i),θ(i))
上文说到极大
B
(
θ
(
i
)
,
θ
(
i
)
)
B(\theta^{(i)},\theta^{(i)})
B(θ(i),θ(i)),那么也会尽量增大
L
(
θ
)
L(\theta)
L(θ)。
那么我们的任务就变成了:
θ
(
i
+
1
)
=
arg max
θ
B
(
θ
,
θ
(
i
)
)
\theta^{(i+1)}=\argmax_\theta B(\theta,\theta^{(i)})
θ(i+1)=θargmaxB(θ,θ(i))
去除一些关于
θ
\theta
θ的常数:
θ
(
i
+
1
)
=
arg max
θ
(
L
(
θ
(
i
)
)
+
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
P
(
Y
∣
θ
(
i
)
)
)
=
arg max
θ
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
−
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
(
P
(
Y
∣
θ
(
i
)
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
=
arg max
θ
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
(
P
(
Y
∣
Z
,
θ
)
P
(
Z
∣
θ
)
)
=
arg max
θ
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
,
Z
∣
θ
)
=
arg max
θ
Q
(
θ
,
θ
(
i
)
)
−
−
划
重
点
:
Q
函
数
!
\begin{aligned}\theta^{(i+1)}=&\argmax_\theta \big(L(\theta^{(i)})+\sum_ZP(Z\vert Y,\theta^{(i)})log\frac{P(Y\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert Y,\theta^{(i)})P(Y\vert \theta^{(i)})}\big)\\ &=\argmax_\theta \sum_ZP(Z\vert Y,\theta^{(i)}) log\big(P(Y\vert Z,\theta)P(Z\vert \theta)\big)-\sum_ZP(Z\vert Y,\theta^{(i)})log\big(P(Y\vert \theta^{(i)}){P(Z\vert Y,\theta^{(i)})}\big)\\ &=\argmax_\theta \sum_ZP(Z\vert Y,\theta^{(i)}) log\big(P(Y\vert Z,\theta)P(Z\vert \theta)\big)\\ &=\argmax_\theta \sum_ZP(Z\vert Y,\theta^{(i)}) logP(Y,Z\vert \theta)\\ &=\argmax_\theta Q(\theta,\theta^{(i)}) --划重点:Q函数! \end{aligned}
θ(i+1)=θargmax(L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=θargmaxZ∑P(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ))−Z∑P(Z∣Y,θ(i))log(P(Y∣θ(i))P(Z∣Y,θ(i)))=θargmaxZ∑P(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ))=θargmaxZ∑P(Z∣Y,θ(i))logP(Y,Z∣θ)=θargmaxQ(θ,θ(i))−−划重点:Q函数!
最优性:
回到EM算法能否找到最优解的问题上,看下图:
图片来源:李航-统计学习方法
可以发现,虽然说我们能使 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))最大,但是由于推导中的不等式,此时 L ( θ ( i + 1 ) ) L(\theta^{(i+1)}) L(θ(i+1))不一定最大。因而EM算法不能保证找到全局最优解。
Q函数
定义:
完全数据的对称似然函数
l
o
g
P
(
Y
,
Z
∣
θ
)
logP(Y,Z\vert \theta)
logP(Y,Z∣θ)关于在给定观测数据Y和当前参数
θ
(
i
)
\theta^{(i)}
θ(i)下对未观测数据Z的条件概率分布
P
(
Z
∣
Y
,
θ
(
i
)
)
P(Z\vert Y,\theta^{(i)})
P(Z∣Y,θ(i))的期望称为Q函数。(摘自李航统计学习)
即:
Q
(
θ
,
θ
(
i
)
)
=
E
Z
[
l
o
g
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
Q(\theta,\theta^{(i)})=E_Z[logP(Y,Z\vert \theta)\vert Y,\theta^{(i)}]
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
伪代码:
初
始
化
θ
,
需
要
注
意
一
点
是
E
M
算
法
对
初
始
值
比
较
敏
感
初始化\theta, 需要注意一点是EM算法对初始值比较敏感
初始化θ,需要注意一点是EM算法对初始值比较敏感
f
o
r
i
=
1
t
o
k
:
for \ i=1\ to\ k:
for i=1 to k:
E
步
:
Q
(
θ
,
θ
(
i
)
)
=
E
Z
[
l
o
g
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
E步: \quad Q(\theta,\theta^{(i)})=E_Z[logP(Y,Z\vert \theta)\vert Y,\theta^{(i)}]
E步:Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
M
步
:
θ
(
i
+
1
)
=
arg max
θ
Q
(
θ
,
θ
(
i
)
)
M步: \quad \theta^{(i+1)}=\argmax_\theta Q(\theta,\theta^{(i)})
M步:θ(i+1)=θargmaxQ(θ,θ(i))
u
n
t
i
l
∣
∣
θ
(
i
+
1
)
−
θ
(
i
)
∣
∣
<
ϵ
1
o
r
∣
∣
Q
(
θ
(
i
+
1
)
,
θ
(
i
)
)
−
Q
(
θ
(
i
)
,
θ
(
i
)
)
∣
∣
<
ϵ
2
until \qquad \vert\vert\theta^{(i+1)}-\theta^{(i)}\vert\vert<\epsilon_1 \ or\ \vert\vert Q(\theta^{(i+1)},\theta^{(i)})-Q(\theta^{(i)},\theta^{(i)})\vert\vert<\epsilon_2
until∣∣θ(i+1)−θ(i)∣∣<ϵ1 or ∣∣Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∣∣<ϵ2
通过上面的流程也可以看得出来为什么叫EM算法。
收敛性:
please jump: https://blog.csdn.net/sinat_26376671/article/details/44313885?depth_1-utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-3&utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-3
高斯混合模型的参数估计
定义:
高斯混合模型是指有如下形式的概率分布模型:
P
(
y
∣
θ
)
=
∑
k
=
1
K
a
k
ϕ
(
y
∣
θ
k
)
a
k
>
0
,
∑
a
k
=
1
,
θ
k
=
(
u
k
,
δ
k
2
)
ϕ
(
y
∣
θ
k
)
=
1
2
π
δ
k
e
x
p
(
−
(
y
−
μ
k
)
2
2
δ
k
2
)
,
称
为
第
k
个
分
模
型
P(y|\theta)=\sum_{k=1}^{K}a_k\phi(y|\theta_{k}) \\ a_k>0,\sum a_k =1,\theta_k=(u_k,\delta_k^2)\\ \phi(y|\theta_{k})=\frac{1}{\sqrt{2\pi}\delta_{k}}exp(-\frac{(y-\mu_{k})^2}{2 \delta_{k}^{2}}),称为第k个分模型
P(y∣θ)=k=1∑Kakϕ(y∣θk)ak>0,∑ak=1,θk=(uk,δk2)ϕ(y∣θk)=2πδk1exp(−2δk2(y−μk)2),称为第k个分模型
考虑:
让高斯混合模型生成样本
Y
=
(
y
1
,
y
2
,
⋯
y
j
)
T
,
j
=
1
,
2
,
3
,
⋯
,
N
Y=(y_1,y_2,\cdots y_j)^T,j=1,2,3,\cdots ,N
Y=(y1,y2,⋯yj)T,j=1,2,3,⋯,N,那么我们如何来理解这个生成过程呢?
由于
∑
α
k
=
1
\sum\alpha_k=1
∑αk=1,我们可以认为,第 i 次生成时,根据
α
k
\alpha_k
αk的相对大小来选择第K个模型来生成
y
i
y_i
yi。
这跟扔钢镚很像,先确定一个前置量(无法观测),再用相应的后置量来生成样本。因此,我们的隐变量(所谓的前置量)也就确定了。
在这里我们设隐变量为:
Z
=
(
z
1
,
z
2
,
⋯
z
j
)
T
,
j
=
1
,
2
,
3
,
⋯
,
N
Z=(z_1,z_2,\cdots z_j)^T,j=1,2,3,\cdots ,N
Z=(z1,z2,⋯zj)T,j=1,2,3,⋯,N,它的每一个分量表征的是每一次采样所用的是哪一个分模型
ϕ
(
y
∣
θ
k
)
\phi(y\vert \theta_k)
ϕ(y∣θk),即
z
j
=
1
,
2
,
3
,
⋯
,
k
z_j=1,2,3,\cdots,k
zj=1,2,3,⋯,k
E步:
由上文,我们知道了,观察变量
Y
Y
Y,和隐变量
Z
Z
Z。又由极大似然估计的EM算法,我们有:
θ
(
i
+
1
)
=
arg max
θ
Q
(
θ
,
θ
(
i
)
)
=
arg max
θ
E
Z
[
l
o
g
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
(
i
)
]
=
arg max
θ
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
l
o
g
P
(
Y
,
Z
∣
θ
)
由
样
本
间
的
独
立
性
有
:
=
arg max
θ
∑
z
1
,
z
2
,
⋯
,
z
j
P
(
z
1
∣
y
1
,
θ
(
i
)
)
⋯
P
(
z
j
∣
y
j
,
θ
(
i
)
)
l
o
g
P
(
y
1
,
z
1
∣
θ
)
⋯
P
(
y
j
,
z
j
∣
θ
)
=
arg max
θ
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
∑
j
=
1
N
l
o
g
P
(
y
j
,
z
j
∣
θ
)
=
arg max
θ
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
(
l
o
g
P
(
y
1
,
z
1
∣
θ
)
+
∑
j
=
2
N
l
o
g
(
y
j
,
z
j
∣
θ
)
)
取
前
一
加
项
有
:
=
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
l
o
g
P
(
y
1
,
z
1
∣
θ
)
=
∑
z
1
,
z
2
,
⋯
,
z
j
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
∏
j
=
2
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
=
∑
z
1
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
∑
z
2
,
⋯
,
z
j
∏
j
=
2
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
=
∑
z
1
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
∑
z
2
∑
z
3
,
⋯
,
z
j
P
(
z
2
∣
y
2
,
θ
(
i
)
)
∏
j
=
3
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
=
∑
z
1
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
∑
z
2
P
(
z
2
∣
y
2
,
θ
(
i
)
)
∑
z
3
,
⋯
,
z
j
∏
j
=
3
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
可
以
发
现
∑
z
2
P
(
z
2
∣
y
2
,
θ
(
i
)
)
=
1
,
所
以
有
:
=
∑
z
1
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
综
上
讨
论
有
:
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
l
o
g
P
(
y
1
,
z
1
∣
θ
)
=
∑
z
1
l
o
g
P
(
y
1
,
z
1
∣
θ
)
P
(
z
1
∣
y
1
,
θ
(
i
)
)
即
是
说
有
:
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
l
o
g
P
(
y
j
,
z
j
∣
θ
)
=
∑
z
j
l
o
g
P
(
y
j
,
z
j
∣
θ
)
P
(
z
j
∣
y
j
,
θ
(
i
)
)
那
么
:
arg max
θ
∑
z
1
,
z
2
,
⋯
,
z
j
∏
j
=
1
N
P
(
z
j
∣
y
j
,
θ
(
i
)
)
(
l
o
g
P
(
y
1
,
z
1
∣
θ
)
+
∑
j
=
2
N
l
o
g
(
y
j
,
z
j
∣
θ
)
)
=
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
P
(
y
j
,
z
j
∣
θ
)
P
(
z
j
∣
y
j
,
θ
(
i
)
)
又
由
于
:
令
:
α
k
=
α
z
j
,
α
k
的
取
值
取
决
于
z
j
的
取
值
P
(
y
j
,
z
j
∣
θ
)
=
P
(
z
j
∣
θ
)
P
(
y
j
∣
z
j
θ
)
=
α
k
ϕ
(
y
j
∣
θ
k
)
=
α
z
j
ϕ
(
y
j
∣
θ
k
)
P
(
z
j
∣
y
j
,
θ
(
i
)
)
=
P
(
z
j
,
y
j
∣
θ
(
i
)
)
P
(
y
j
∣
θ
(
i
)
)
=
α
k
ϕ
(
y
j
∣
θ
k
(
i
)
)
P
(
y
j
∣
θ
(
i
)
)
=
α
k
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
k
ϕ
(
y
j
∣
θ
k
(
i
)
)
=
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
终
于
有
:
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
P
(
y
j
,
z
j
∣
θ
)
P
(
z
j
∣
y
j
,
θ
(
i
)
)
=
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
α
z
j
ϕ
(
y
j
∣
θ
k
)
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
\begin{aligned}\theta^{(i+1)}&=\argmax_\theta Q(\theta,\theta^{(i)}) \\ &=\argmax_\theta E_Z[logP(Y,Z\vert \theta)\vert Y,\theta^{(i)}]\\ &=\argmax_\theta \sum_ZP(Z\vert Y,\theta^{(i)}) logP(Y,Z\vert \theta)\\ 由样本间的独立性有:\\ &=\argmax_\theta \sum_{z_1,z_2,\cdots ,z_j}P(z_1\vert y_1,\theta^{(i)})\cdots P(z_j\vert y_j,\theta^{(i)})logP(y_1,z_1\vert\theta) \cdots P(y_j,z_j\vert \theta)\\ &=\argmax_\theta\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})\sum_{j=1}^NlogP(y_j,z_j\vert \theta)\\ &=\argmax_\theta\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})\big (logP(y_1,z_1\vert \theta)+\sum_{j=2}^Nlog(y_j,z_j\vert \theta)\big )\\ 取前一加项有:\\ &=\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})logP(y_1,z_1\vert \theta)\\ &=\sum_{z_1,z_2,\cdots ,z_j}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\prod_{j=2}^N P(z_j\vert y_j,\theta^{(i)})\\ &=\sum_{z_1}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\sum_{z_2,\cdots ,z_j}\prod_{j=2}^N P(z_j\vert y_j,\theta^{(i)})\\ &=\sum_{z_1}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\sum_{z_2}\sum_{z_3,\cdots ,z_j}P(z_2\vert y_2,\theta^{(i)})\prod_{j=3}^N P(z_j\vert y_j,\theta^{(i)})\\ &=\sum_{z_1}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\sum_{z_2}P(z_2\vert y_2,\theta^{(i)})\sum_{z_3,\cdots ,z_j}\prod_{j=3}^N P(z_j\vert y_j,\theta^{(i)})\\ 可以发现\sum_{z_2}P(z_2\vert y_2,\theta^{(i)})=1,所以有:\\ &=\sum_{z_1}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\\ 综上讨论有:\\ &\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})logP(y_1,z_1\vert \theta)\\ &=\sum_{z_1}logP(y_1,z_1\vert \theta)P(z_1\vert y_1,\theta^{(i)})\\ 即是说有:\\ &\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})logP(y_j,z_j\vert \theta)\\ &=\sum_{z_j}logP(y_j,z_j\vert \theta)P(z_j\vert y_j,\theta^{(i)})\\ 那么:\\ &\argmax_\theta\sum_{z_1,z_2,\cdots ,z_j}\prod_{j=1}^N P(z_j\vert y_j,\theta^{(i)})\big (logP(y_1,z_1\vert \theta)+\sum_{j=2}^Nlog(y_j,z_j\vert \theta)\big )\\ &=\argmax_\theta \sum_{j=1}^N\sum_{z_j}logP(y_j,z_j\vert \theta)P(z_j\vert y_j,\theta^{(i)})\\ 又由于:\\ 令:\\ &\alpha_k=\alpha_{z_j},\alpha_k的取值取决于z_j的取值\\ &P(y_j,z_j\vert \theta)=P(z_j\vert \theta)P(y_j\vert z_j\theta)=\alpha_k\phi(y_j\vert \theta_k)=\alpha_{z_j}\phi(y_j\vert \theta_k)\\ &P(z_j\vert y_j,\theta^{(i)})=\frac{P(z_j,y_j\vert \theta^{(i)})}{P(y_j\vert \theta^{(i)})}=\frac{\alpha_k\phi(y_j\vert\theta_k^{(i)})}{P(y_j\vert \theta^{(i)})}=\frac{\alpha_k\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_k\phi(y_j\vert\theta_k^{(i)})}=\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K \alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}\\ 终于有:\\ &\argmax_\theta \sum_{j=1}^N\sum_{z_j}logP(y_j,z_j\vert \theta)P(z_j\vert y_j,\theta^{(i)})\\ &=\argmax_\theta \sum_{j=1}^N\sum_{z_j}log\alpha_{z_j}\phi(y_j\vert \theta_k)\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})} \end{aligned}
θ(i+1)由样本间的独立性有:取前一加项有:可以发现z2∑P(z2∣y2,θ(i))=1,所以有:综上讨论有:即是说有:那么:又由于:令:终于有:=θargmaxQ(θ,θ(i))=θargmaxEZ[logP(Y,Z∣θ)∣Y,θ(i)]=θargmaxZ∑P(Z∣Y,θ(i))logP(Y,Z∣θ)=θargmaxz1,z2,⋯,zj∑P(z1∣y1,θ(i))⋯P(zj∣yj,θ(i))logP(y1,z1∣θ)⋯P(yj,zj∣θ)=θargmaxz1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))j=1∑NlogP(yj,zj∣θ)=θargmaxz1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))(logP(y1,z1∣θ)+j=2∑Nlog(yj,zj∣θ))=z1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))logP(y1,z1∣θ)=z1,z2,⋯,zj∑logP(y1,z1∣θ)P(z1∣y1,θ(i))j=2∏NP(zj∣yj,θ(i))=z1∑logP(y1,z1∣θ)P(z1∣y1,θ(i))z2,⋯,zj∑j=2∏NP(zj∣yj,θ(i))=z1∑logP(y1,z1∣θ)P(z1∣y1,θ(i))z2∑z3,⋯,zj∑P(z2∣y2,θ(i))j=3∏NP(zj∣yj,θ(i))=z1∑logP(y1,z1∣θ)P(z1∣y1,θ(i))z2∑P(z2∣y2,θ(i))z3,⋯,zj∑j=3∏NP(zj∣yj,θ(i))=z1∑logP(y1,z1∣θ)P(z1∣y1,θ(i))z1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))logP(y1,z1∣θ)=z1∑logP(y1,z1∣θ)P(z1∣y1,θ(i))z1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))logP(yj,zj∣θ)=zj∑logP(yj,zj∣θ)P(zj∣yj,θ(i))θargmaxz1,z2,⋯,zj∑j=1∏NP(zj∣yj,θ(i))(logP(y1,z1∣θ)+j=2∑Nlog(yj,zj∣θ))=θargmaxj=1∑Nzj∑logP(yj,zj∣θ)P(zj∣yj,θ(i))αk=αzj,αk的取值取决于zj的取值P(yj,zj∣θ)=P(zj∣θ)P(yj∣zjθ)=αkϕ(yj∣θk)=αzjϕ(yj∣θk)P(zj∣yj,θ(i))=P(yj∣θ(i))P(zj,yj∣θ(i))=P(yj∣θ(i))αkϕ(yj∣θk(i))=k=1∑Kαkϕ(yj∣θk(i))αkϕ(yj∣θk(i))=k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))θargmaxj=1∑Nzj∑logP(yj,zj∣θ)P(zj∣yj,θ(i))=θargmaxj=1∑Nzj∑logαzjϕ(yj∣θk)k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))
M步:
由上面的推证,我们得到:
θ
(
i
+
1
)
=
arg max
θ
Q
(
θ
,
θ
(
i
)
)
=
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
α
z
j
ϕ
(
y
j
∣
θ
k
)
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
=
arg max
θ
∑
j
=
1
N
∑
z
j
(
l
o
g
α
z
j
+
l
o
g
ϕ
(
y
j
∣
θ
k
)
)
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
=
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
α
z
j
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
+
l
o
g
ϕ
(
y
j
∣
θ
k
)
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
\begin{aligned}\theta^{(i+1)}&=\argmax_\theta Q(\theta,\theta^{(i)})\\ &=\argmax_\theta \sum_{j=1}^N\sum_{z_j}log\alpha_{z_j}\phi(y_j\vert \theta_k)\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})} \\ &=\argmax_\theta \sum_{j=1}^N\sum_{z_j}\big (log\alpha_{z_j}+log\phi(y_j\vert \theta_k)\big )\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}\\ &=\argmax_\theta\sum_{j=1}^N\sum_{z_j}log\alpha_{z_j}\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}+log\phi(y_j\vert \theta_k)\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}\\ \end{aligned}
θ(i+1)=θargmaxQ(θ,θ(i))=θargmaxj=1∑Nzj∑logαzjϕ(yj∣θk)k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))=θargmaxj=1∑Nzj∑(logαzj+logϕ(yj∣θk))k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))=θargmaxj=1∑Nzj∑logαzjk=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))+logϕ(yj∣θk)k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))
可以发现,若要取到最大,只需要加号两边分别最大,即是说:
α
k
=
arg max
α
z
j
∑
j
=
1
N
∑
z
j
l
o
g
α
z
j
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
θ
=
arg max
θ
∑
j
=
1
N
∑
z
j
l
o
g
ϕ
(
y
j
∣
θ
k
)
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
∑
k
=
1
K
α
z
j
ϕ
(
y
j
∣
θ
k
(
i
)
)
\begin{aligned}\mathbf\alpha_k=\argmax_{\alpha_{z_j}}\sum_{j=1}^N\sum_{z_j}log\alpha_{z_j}\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}\\ \mathbf \theta=\argmax_\theta\sum_{j=1}^N\sum_{z_j}log\phi(y_j\vert \theta_k)\frac{\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}{\sum\limits_{k=1}^K\alpha_{z_j}\phi(y_j\vert\theta_k^{(i)})}\\ \end{aligned}
αk=αzjargmaxj=1∑Nzj∑logαzjk=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))θ=θargmaxj=1∑Nzj∑logϕ(yj∣θk)k=1∑Kαzjϕ(yj∣θk(i))αzjϕ(yj∣θk(i))
这样一来,分模型的概率值就可以和高斯分布的参数分开来求了。由于求解过程涉及到宇宙起源,篇幅有限,就不再赘述了。
for details please jump: https://blog.csdn.net/qq_37334135/article/details/85493330
总结:
- EM算法,简单来说就是列出Q函数,求使其最大化的参数,并迭代直到达到理想效果。
- EM算法,不仅仅可以用极大似然估计法,还可以用最大后验估计法。
- 除开高斯分布,其他分布也可以用来建模
- sklearn的mixture模块的讲解的蛮好的,贝叶斯估计也有,可以去看看
- https://scikit-learn.org/stable/modules/mixture.html#gmm
参考书籍:
李航:《统计学习方法第二版》
参考链接:
https://www.bilibili.com/video/BV1aE411o7qd?p=68
https://github.com/datawhalechina/team-learning/tree/master/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AE%97%E6%B3%95%E5%9F%BA%E7%A1%80