期望最大化(EM)算法
1.前言
概率模型有时候既含有观测变量,又含有隐变量。只含有观测变量的情况下,直接对观测值进行极大似然估计便能够求出参数;比如抛一枚不均匀硬币n次,极大似然估计能够求解出正反面分别出现的概率。在含有隐变量的情况下,无法通过极大似然估计求得;比如手中有三枚不均匀硬币,先从中选取一枚硬币,然后再抛,得到的正反面为观测值;如果直接用极大似然估计,无法体现选择硬币的过程,错误地将三枚硬币视为同一枚硬币。本文要介绍的期望最大化算法便是用来求解含有隐变量的概率模型。
2.算法导出
EM算法实质是极大似然估计的变型,是由极大化似然函数导出的。回到三硬币的例子,将选择硬币的事件记为隐变量Z,所求参数记为
θ
\theta
θ,观测值硬币正反面记为
Y
Y
Y,于是似然函数为
L
(
θ
)
=
log
P
(
Y
∣
θ
)
L(\theta)=\log P(Y|\theta)
L(θ)=logP(Y∣θ)
上式中不包含隐变量Z,于是引入隐变量Z可得
L
(
θ
)
=
log
P
(
Y
∣
θ
)
=
log
∑
Z
P
(
Y
,
Z
∣
θ
)
L(\theta) = \log P(Y|\theta) = \log \sum_Z P(Y,Z|\theta)
L(θ)=logP(Y∣θ)=logZ∑P(Y,Z∣θ)
上式极大化无法直接通过偏导求出,可采用迭代法求解,假设当前参数值为
θ
i
\theta_i
θi,迭代法希望每一步
θ
\theta
θ都能够使得
L
(
θ
)
−
L
(
θ
i
)
L(\theta) - L(\theta_i)
L(θ)−L(θi)最大化,即
L
(
θ
)
−
L
(
θ
i
)
=
log
∑
Z
P
(
Y
,
Z
∣
θ
)
−
log
P
(
Y
∣
θ
i
)
=
log
∑
Z
P
(
Y
,
Z
∣
θ
)
P
(
Y
∣
θ
i
)
=
log
∑
Z
P
(
Z
∣
Y
,
θ
i
)
P
(
Y
,
Z
∣
θ
)
P
(
Y
∣
θ
i
)
P
(
Z
∣
Y
,
θ
i
)
\begin{aligned} L(\theta) - L(\theta_i) &= \log \sum_Z P(Y,Z|\theta) - \log P(Y|\theta_i)\\ & = \log \sum_Z \frac{P(Y,Z|\theta)}{P(Y|\theta_i)}\\ & = \log \sum_Z P(Z|Y,\theta_i) \frac{ P(Y,Z|\theta)}{P(Y|\theta_i)P(Z|Y,\theta_i)} \end{aligned}
L(θ)−L(θi)=logZ∑P(Y,Z∣θ)−logP(Y∣θi)=logZ∑P(Y∣θi)P(Y,Z∣θ)=logZ∑P(Z∣Y,θi)P(Y∣θi)P(Z∣Y,θi)P(Y,Z∣θ)
又因为log函数是凹函数并且
∑
Z
P
(
Z
∣
Y
,
θ
i
)
=
1
\sum_Z P(Z|Y,\theta_i)=1
∑ZP(Z∣Y,θi)=1,由琴生不等式可得
L
(
θ
)
−
L
(
θ
i
)
=
log
∑
Z
P
(
Z
∣
Y
,
θ
i
)
P
(
Y
,
Z
∣
θ
)
P
(
Y
∣
θ
i
)
P
(
Z
∣
Y
,
θ
i
)
≥
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
)
P
(
Y
∣
θ
i
)
P
(
Z
∣
Y
,
θ
i
)
=
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
)
P
(
Y
,
Z
∣
θ
i
)
=
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
)
−
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
i
)
=
△
B
(
θ
,
θ
i
)
\begin{aligned} L(\theta) - L(\theta_i) &= \log \sum_Z P(Z|Y,\theta_i) \frac{ P(Y,Z|\theta)}{P(Y|\theta_i)P(Z|Y,\theta_i)}\\ & \ge \sum_Z P(Z|Y,\theta_i) \log \frac{ P(Y,Z|\theta)}{P(Y|\theta_i)P(Z|Y,\theta_i)}\\ & = \sum_Z P(Z|Y,\theta_i) \log \frac{ P(Y,Z|\theta)}{P(Y,Z|\theta_i)}\\ & = \sum_Z P(Z|Y,\theta_i) \log P(Y,Z|\theta) - \sum_Z P(Z|Y,\theta_i) \log P(Y,Z|\theta_i) \\ & \overset \triangle = B(\theta,\theta_i) \end{aligned}
L(θ)−L(θi)=logZ∑P(Z∣Y,θi)P(Y∣θi)P(Z∣Y,θi)P(Y,Z∣θ)≥Z∑P(Z∣Y,θi)logP(Y∣θi)P(Z∣Y,θi)P(Y,Z∣θ)=Z∑P(Z∣Y,θi)logP(Y,Z∣θi)P(Y,Z∣θ)=Z∑P(Z∣Y,θi)logP(Y,Z∣θ)−Z∑P(Z∣Y,θi)logP(Y,Z∣θi)=△B(θ,θi)
所以,
B
(
θ
,
θ
i
)
B(\theta,\theta_i)
B(θ,θi)是
L
(
θ
)
−
L
(
θ
i
)
L(\theta) - L(\theta_i)
L(θ)−L(θi)的下界,即最大化下界等价于最大化原函数
θ
=
arg
max
θ
L
(
θ
)
−
L
(
θ
i
)
=
arg
max
θ
B
(
θ
,
θ
i
)
=
arg
max
θ
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
)
−
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
i
)
\begin{aligned} \theta &= \arg \max_\theta L(\theta)-L(\theta_i)\\ & = \arg \max_\theta B(\theta, \theta_i)\\ & = \arg \max_\theta \sum_Z P(Z|Y,\theta_i) \log P(Y,Z|\theta) - \sum_Z P(Z|Y,\theta_i) \log P(Y,Z|\theta_i) \end{aligned}
θ=argθmaxL(θ)−L(θi)=argθmaxB(θ,θi)=argθmaxZ∑P(Z∣Y,θi)logP(Y,Z∣θ)−Z∑P(Z∣Y,θi)logP(Y,Z∣θi)
又因为上式中,后项与因变量
θ
\theta
θ无关,可舍去,即
θ
=
arg
max
θ
∑
Z
P
(
Z
∣
Y
,
θ
i
)
log
P
(
Y
,
Z
∣
θ
)
=
arg
max
θ
E
Z
[
log
P
(
Y
,
Z
∣
θ
)
∣
Y
,
θ
i
]
=
△
arg
max
θ
Q
(
θ
,
θ
i
)
\begin{aligned} \theta &= \arg \max_\theta \sum_Z P(Z|Y,\theta_i) \log P(Y,Z|\theta)\\ & = \arg \max_\theta E_Z[\log P(Y,Z|\theta)|Y,\theta_i]\\ & \overset \triangle = \arg \max_\theta Q(\theta, \theta_i) \end{aligned}
θ=argθmaxZ∑P(Z∣Y,θi)logP(Y,Z∣θ)=argθmaxEZ[logP(Y,Z∣θ)∣Y,θi]=△argθmaxQ(θ,θi)
于是极大似然估计等价于最大化
Q
(
θ
,
θ
i
)
Q(\theta, \theta_i)
Q(θ,θi)函数。
综上所述,导出期望最大化算法:
-
第一步(E步):求解期望Q函数
Q ( θ , θ i ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ i ] Q(\theta, \theta_i) = E_Z[\log P(Y,Z|\theta)|Y,\theta_i] Q(θ,θi)=EZ[logP(Y,Z∣θ)∣Y,θi] -
第二步(M步):最大化Q函数
3.高斯混合模型
接下来举例介绍下EM算法应该如何使用,高斯分布是常用的概率分布,但是在实际问题中有些概率模型并不是单一高斯分布能够表示。比如说身高问题,我们可以假设身高满足单一高斯分布状态,两头少中间多,但身高跟性别有一定关系,更好地假设是,男生女生身高各满足一个不同的高斯分布。假设男女生出现的概率是 α i , i = 0 , 1 \alpha_i, i=0,1 αi,i=0,1,并且身高分别满足高斯分布 φ ( μ i , σ i ) , i = 0 , 1 \varphi(\mu_i,\sigma_i),i=0,1 φ(μi,σi),i=0,1,那么一个人身高为x的概率是 P ( x ) = ∑ i = 0 1 α i φ ( μ i , σ i ) P(x)=\sum_{i=0}^1 \alpha_i \varphi(\mu_i,\sigma_i) P(x)=∑i=01αiφ(μi,σi)。这样的概率分布就是混合高斯分布。
混合高斯分布的一般形式是
P
(
Y
∣
α
,
μ
,
σ
)
=
∑
k
=
1
K
α
k
φ
(
μ
k
,
σ
k
)
P(Y|\alpha,\mu, \sigma) = \sum_{k=1}^K \alpha_k \varphi(\mu_k, \sigma_k)
P(Y∣α,μ,σ)=k=1∑Kαkφ(μk,σk)
假设观测值集合为Y,使用EM算法求解高斯混合模型:
1.求解Q函数(E步)
混合高斯分布可以视为两步,第一步以一定概率选择某一高斯分布,第二步根据该高斯分布确定概率。观测值记为Y,隐变量为选择哪一种高斯分布,记为
γ
\gamma
γ,且定义
γ
j
k
=
{
1
,
第
j
个
样
本
选
择
第
k
个
高
斯
分
布
0
,
否
则
\gamma_{jk} = \begin{cases} 1, &第j个样本选择第k个高斯分布\\ 0, &否则 \end{cases}
γjk={1,0,第j个样本选择第k个高斯分布否则
完整似然函数为
P
(
Y
,
Γ
∣
θ
)
=
∏
j
=
1
N
P
(
y
j
,
γ
j
1
,
γ
j
2
,
…
γ
j
K
∣
θ
)
=
∏
j
=
1
N
∏
k
=
1
K
[
α
k
φ
(
μ
k
,
σ
k
)
]
γ
j
k
=
∏
j
=
1
N
∏
k
=
1
K
[
α
k
1
2
π
δ
k
e
−
(
y
j
−
μ
k
)
2
2
δ
k
2
]
γ
j
k
\begin{aligned} P(Y,\Gamma | \theta) &= \prod_{j=1}^N P(y_j, \gamma_{j1}, \gamma_{j2}, \dots \gamma_{jK} | \theta)\\ & = \prod_{j=1}^N \prod_{k=1}^K [\alpha_k \varphi(\mu_k, \sigma_k)]^{\gamma_{jk}}\\ & = \prod_{j=1}^N \prod_{k=1}^K [\alpha_k \frac{1}{\sqrt{2\pi}\delta_k} e^{-\frac{(y_j-\mu_k)^2}{2\delta_k^2}}]^{\gamma_{jk}} \end{aligned}
P(Y,Γ∣θ)=j=1∏NP(yj,γj1,γj2,…γjK∣θ)=j=1∏Nk=1∏K[αkφ(μk,σk)]γjk=j=1∏Nk=1∏K[αk2πδk1e−2δk2(yj−μk)2]γjk
取对数得
log
P
(
Y
,
Γ
∣
θ
)
=
∑
j
=
1
N
∑
k
=
1
K
γ
j
k
log
[
α
k
1
2
π
δ
k
e
−
(
y
j
−
μ
k
)
2
2
δ
k
2
]
=
∑
j
=
1
N
∑
k
=
1
K
γ
j
k
[
log
α
k
−
log
(
2
π
)
−
log
δ
k
−
(
y
j
−
μ
k
)
2
2
δ
k
2
]
\begin{aligned} \log P(Y,\Gamma | \theta) &= \sum_{j=1}^N \sum_{k=1}^K \gamma_{jk} \log [\alpha_k \frac{1}{\sqrt{2\pi}\delta_k} e^{-\frac{(y_j-\mu_k)^2}{2\delta_k^2}}]\\ & = \sum_{j=1}^N \sum_{k=1}^K \gamma_{jk} [ \log \alpha_k - \log(\sqrt{2\pi}) - \log \delta_k - \frac{(y_j-\mu_k)^2}{2\delta_k^2}] \end{aligned}
logP(Y,Γ∣θ)=j=1∑Nk=1∑Kγjklog[αk2πδk1e−2δk2(yj−μk)2]=j=1∑Nk=1∑Kγjk[logαk−log(2π)−logδk−2δk2(yj−μk)2]
于是,
Q
(
θ
,
θ
i
)
=
E
Γ
[
log
P
(
Y
,
Γ
∣
θ
)
∣
Y
,
θ
i
]
=
∑
j
=
1
N
∑
k
=
1
K
E
[
γ
j
k
∣
Y
,
θ
i
]
[
log
α
k
−
log
(
2
π
)
−
log
δ
k
−
(
y
i
−
μ
k
)
2
2
δ
k
2
]
\begin{aligned} Q(\theta, \theta_i) &= E_{\Gamma} [\log P(Y,\Gamma | \theta) | Y,\theta_i]\\ & = \sum_{j=1}^N \sum_{k=1}^K E[\gamma_{jk}|Y,\theta_i] [ \log \alpha_k - \log(\sqrt{2\pi}) - \log \delta_k - \frac{(y_i-\mu_k)^2}{2\delta_k^2}] \end{aligned}
Q(θ,θi)=EΓ[logP(Y,Γ∣θ)∣Y,θi]=j=1∑Nk=1∑KE[γjk∣Y,θi][logαk−log(2π)−logδk−2δk2(yi−μk)2]
上式中需要计算
E
[
γ
j
k
∣
Y
,
θ
i
]
E[\gamma_{jk}|Y,\theta_i]
E[γjk∣Y,θi],记为
γ
^
j
k
\hat{\gamma}_{jk}
γ^jk,即有
γ
^
j
k
=
E
[
γ
j
k
∣
Y
,
θ
i
]
=
P
(
γ
j
k
=
1
∣
Y
,
θ
i
)
\begin{aligned} \hat{\gamma}_{jk} &= E[\gamma_{jk}|Y,\theta_i]\\ & = P(\gamma_{jk}=1|Y,\theta_i)\\ \end{aligned}
γ^jk=E[γjk∣Y,θi]=P(γjk=1∣Y,θi)
由贝叶斯公式得
P
(
γ
j
k
=
1
∣
Y
,
θ
i
)
=
P
(
y
j
∣
γ
j
k
=
1
,
θ
i
)
P
(
γ
j
k
=
1
∣
θ
i
)
∑
k
=
1
K
P
(
y
j
∣
γ
j
k
=
1
,
θ
i
)
P
(
γ
j
k
=
1
∣
θ
i
)
=
φ
(
y
j
∣
θ
k
)
α
k
∑
k
=
1
K
φ
(
y
j
∣
θ
k
)
α
k
\begin{aligned} P(\gamma_{jk}=1|Y,\theta_i) &= \frac{P( y_j|\gamma_{jk}=1,\theta_i)P(\gamma_{jk}=1|\theta_i)}{\sum_{k=1}^K P( y_j|\gamma_{jk}=1,\theta_i)P(\gamma_{jk}=1|\theta_i)}\\ & = \frac{\varphi(y_j|\theta_k) \alpha_k}{\sum_{k=1}^K \varphi(y_j|\theta_k) \alpha_k} \end{aligned}
P(γjk=1∣Y,θi)=∑k=1KP(yj∣γjk=1,θi)P(γjk=1∣θi)P(yj∣γjk=1,θi)P(γjk=1∣θi)=∑k=1Kφ(yj∣θk)αkφ(yj∣θk)αk
于是
γ
j
k
^
\hat{\gamma_{jk}}
γjk^表示响应度,即对于第j个样本数据,属于第k个高斯分布的概率,可以由上一步参数来进行计算。
于是有,
Q
(
θ
,
θ
i
)
=
∑
j
=
1
N
∑
k
=
1
K
γ
j
k
^
[
log
α
k
−
log
(
2
π
)
−
log
δ
k
−
(
y
j
−
μ
k
)
2
2
δ
k
2
]
\begin{aligned} Q(\theta, \theta_i) &= \sum_{j=1}^N \sum_{k=1}^K \hat{\gamma_{jk}} [ \log \alpha_k - \log(\sqrt{2\pi}) - \log \delta_k - \frac{(y_j-\mu_k)^2}{2\delta_k^2}] \end{aligned}
Q(θ,θi)=j=1∑Nk=1∑Kγjk^[logαk−log(2π)−logδk−2δk2(yj−μk)2]
2.求最大化操作(M步)
Q函数对
μ
,
δ
\mu,\delta
μ,δ分别求偏导可得
∂
Q
∂
μ
k
=
∑
j
=
1
N
γ
^
j
k
y
j
−
μ
k
δ
k
2
\frac{\partial Q}{\partial \mu_k} = \sum_{j=1}^N \hat{\gamma}_{jk} \frac{y_j-\mu_k}{\delta_k^2}
∂μk∂Q=j=1∑Nγ^jkδk2yj−μk
∂ Q ∂ δ k = ∑ j = 1 N γ ^ j k [ − 1 δ k + ( y j − μ k ) 2 δ k 3 ] \frac{\partial Q}{\partial \delta_k} = \sum_{j=1}^N \hat{\gamma}_{jk} [-\frac{1}{\delta_k} + \frac{(y_j-\mu_k)^2}{\delta_k^3}] ∂δk∂Q=j=1∑Nγ^jk[−δk1+δk3(yj−μk)2]
将以上两式得零可以推导出
μ
^
k
=
∑
j
=
1
N
γ
^
j
k
⋅
y
j
∑
j
=
1
N
γ
^
j
k
,
k
=
1
,
2
,
…
,
K
\hat{\mu}_k = \frac{\sum_{j=1}^N \hat{\gamma}_{jk} \cdot y_j}{\sum_{j=1}^N \hat{\gamma}_{jk}}, k=1,2,\dots,K
μ^k=∑j=1Nγ^jk∑j=1Nγ^jk⋅yj,k=1,2,…,K
δ ^ k 2 = ∑ j = 1 N γ ^ j k ⋅ ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , … , K \hat{\delta}_k^2 = \frac{\sum_{j=1}^N \hat{\gamma}_{jk} \cdot (y_j - \mu_k)^2}{\sum_{j=1}^N \hat{\gamma}_{jk}}, k=1,2,\dots,K δ^k2=∑j=1Nγ^jk∑j=1Nγ^jk⋅(yj−μk)2,k=1,2,…,K
结合约束条件
∑
k
=
1
K
α
k
=
1
\sum_{k=1}^K \alpha_k = 1
∑k=1Kαk=1,求Q函数拉格朗日函数有
L
(
λ
,
α
)
=
Q
(
α
)
+
λ
⋅
(
∑
k
=
1
K
α
k
−
1
)
L(\lambda, \alpha) = Q(\alpha) + \lambda \cdot (\sum_{k=1}^K \alpha_k - 1)
L(λ,α)=Q(α)+λ⋅(k=1∑Kαk−1)
分别对
α
,
λ
\alpha,\lambda
α,λ求偏导得
∂
L
∂
α
k
=
∑
j
=
1
N
γ
^
j
k
1
α
k
+
λ
,
k
=
1
,
2
,
…
,
K
\frac{\partial L}{\partial \alpha_k} = \sum_{j=1}^N \hat{\gamma}_{jk} \frac{1}{\alpha_k} + \lambda, \space k=1,2,\dots,K
∂αk∂L=j=1∑Nγ^jkαk1+λ, k=1,2,…,K
∂
L
∂
λ
=
∑
k
=
1
K
α
k
−
1
\frac{\partial L}{\partial \lambda} = \sum_{k=1}^K \alpha_k - 1
∂λ∂L=k=1∑Kαk−1
令上两式为零,可得
α
^
k
=
∑
j
=
1
N
γ
^
j
k
∑
j
=
1
N
∑
k
=
1
K
γ
^
j
k
=
∑
j
=
1
N
γ
^
j
k
N
,
k
=
1
,
2
,
…
,
K
\hat{\alpha}_k = \frac{\sum_{j=1}^N \hat{\gamma}_{jk}}{\sum_{j=1}^N \sum_{k=1}^K \hat{\gamma}_{jk}} = \frac{\sum_{j=1}^N \hat{\gamma}_{jk}}{N}, k=1,2,\dots,K
α^k=∑j=1N∑k=1Kγ^jk∑j=1Nγ^jk=N∑j=1Nγ^jk,k=1,2,…,K
综上所述,高斯混合模型(GMM)模型算法流程如下:
-
1.初始化各参数
-
2.按照上一次参数更新响应度
γ ^ j k = φ ( y j ∣ θ k ) α k ∑ k = 1 K φ ( y j ∣ θ k ) α k , k = 1 , 2 , … , K ; j = 1 , 2 , … , N \hat{\gamma}_{jk} = \frac{\varphi(y_j|\theta_k) \alpha_k}{\sum_{k=1}^K \varphi(y_j|\theta_k) \alpha_k} , k=1,2,\dots,K ;j=1,2,\dots,N γ^jk=∑k=1Kφ(yj∣θk)αkφ(yj∣θk)αk,k=1,2,…,K;j=1,2,…,N -
3.更新模型参数
μ ^ k = ∑ j = 1 N γ ^ j k ⋅ y j ∑ j = 1 N γ ^ j k , k = 1 , 2 , … , K \hat{\mu}_k = \frac{\sum_{j=1}^N \hat{\gamma}_{jk} \cdot y_j}{\sum_{j=1}^N \hat{\gamma}_{jk}}, k=1,2,\dots,K μ^k=∑j=1Nγ^jk∑j=1Nγ^jk⋅yj,k=1,2,…,K
δ ^ k 2 = ∑ j = 1 N γ ^ j k ⋅ ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k , k = 1 , 2 , … , K \hat{\delta}_k^2 = \frac{\sum_{j=1}^N \hat{\gamma}_{jk} \cdot (y_j - \mu_k)^2}{\sum_{j=1}^N \hat{\gamma}_{jk}}, k=1,2,\dots,K δ^k2=∑j=1Nγ^jk∑j=1Nγ^jk⋅(yj−μk)2,k=1,2,…,K
α ^ k = ∑ j = 1 N γ ^ j k ∑ j = 1 N ∑ k = 1 K γ ^ j k = ∑ j = 1 N γ ^ j k N , k = 1 , 2 , … , K \hat{\alpha}_k = \frac{\sum_{j=1}^N \hat{\gamma}_{jk}}{\sum_{j=1}^N \sum_{k=1}^K \hat{\gamma}_{jk}} = \frac{\sum_{j=1}^N \hat{\gamma}_{jk}}{N}, k=1,2,\dots,K α^k=∑j=1N∑k=1Kγ^jk∑j=1Nγ^jk=N∑j=1Nγ^jk,k=1,2,…,K
-
4.重复上述过程直至收敛
4.参考资料
- 统计学习方法 - 李航