算法思路
之前所遇到的模型,大多是求 θ ^ = arg max θ P ( X , y ∣ θ ) \hat\theta=\arg\max_\theta P(X,y|\theta) θ^=argmaxθP(X,y∣θ),但是在一些存在隐变量的模型中,表达式 P ( X , y ∣ θ ) P(X,y|\theta) P(X,y∣θ)难以求解析解,所以可以通过一个迭代的算法求出其解,这个迭代的算法就称为EM算法。
算法流程
- 确定初始参数 θ ( 0 ) \theta^{(0)} θ(0)
- 计算
Q
(
θ
,
θ
(
i
)
)
Q(\theta,\theta^{(i)})
Q(θ,θ(i))
Q ( θ , θ ( i ) ) = E P ( Z ∣ X , θ ( i ) ) [ log P ( X , Z ∣ θ ) ] \begin{aligned} Q(\theta, \theta^{(i)}) &= \mathbb E_{P(Z|X,\theta^{(i)})}[\log P(X,Z|\theta)] \end{aligned} Q(θ,θ(i))=EP(Z∣X,θ(i))[logP(X,Z∣θ)] - 求解 θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg\max_\theta Q(\theta,\theta^{(i)}) θ(i+1)=argmaxθQ(θ,θ(i))
收敛性证明
我们只需要证明
P
(
X
∣
θ
(
i
+
1
)
)
>
P
(
X
∣
θ
(
i
)
)
P(X|\theta^{(i+1)})>P(X|\theta^{(i)})
P(X∣θ(i+1))>P(X∣θ(i))即可。
P
(
Z
∣
X
,
θ
)
∗
P
(
X
∣
θ
)
=
P
(
X
,
Z
∣
θ
)
log
P
(
X
∣
θ
)
=
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
X
∣
θ
)
=
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
X
,
Z
∣
θ
)
−
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
Z
∣
X
,
θ
)
log
P
(
X
∣
θ
)
=
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
X
,
Z
∣
θ
)
−
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
Z
∣
X
,
θ
)
log
P
(
X
∣
θ
)
=
Q
(
θ
,
θ
(
i
)
)
−
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
Z
∣
X
,
θ
)
\begin{aligned} P(Z|X,\theta) *P(X|\theta) &= P(X,Z|\theta)\\ \log P(X|\theta) &= \log P(X,Z|\theta) - \log P(Z|X,\theta)\\ \mathbb E_{P(Z|X,\theta^{(i)})} \log P(X|\theta) &= \mathbb E_{P(Z|X,\theta^{(i)})} \log P(X,Z|\theta) - \mathbb E_{P(Z|X,\theta^{(i)})} \log P(Z|X,\theta)\\ \log P(X|\theta) &= \mathbb E_{P(Z|X,\theta^{(i)})} \log P(X,Z|\theta) - \mathbb E_{P(Z|X,\theta^{(i)})} \log P(Z|X,\theta)\\ \log P(X|\theta) &=Q(\theta, \theta^{(i)}) - \mathbb E_{P(Z|X,\theta^{(i)})} \log P(Z|X,\theta) \end{aligned}
P(Z∣X,θ)∗P(X∣θ)logP(X∣θ)EP(Z∣X,θ(i))logP(X∣θ)logP(X∣θ)logP(X∣θ)=P(X,Z∣θ)=logP(X,Z∣θ)−logP(Z∣X,θ)=EP(Z∣X,θ(i))logP(X,Z∣θ)−EP(Z∣X,θ(i))logP(Z∣X,θ)=EP(Z∣X,θ(i))logP(X,Z∣θ)−EP(Z∣X,θ(i))logP(Z∣X,θ)=Q(θ,θ(i))−EP(Z∣X,θ(i))logP(Z∣X,θ)
所以
D
_
v
a
l
u
e
=
log
P
(
X
∣
θ
(
i
+
1
)
)
−
log
P
(
X
∣
θ
(
i
)
)
D\_value = \log P(X|\theta^{(i+1)})-\log P(X|\theta^{(i)})
D_value=logP(X∣θ(i+1))−logP(X∣θ(i))的值为:
D
_
v
a
l
u
e
=
Q
(
θ
(
i
+
1
)
,
θ
(
i
)
)
−
Q
(
θ
(
i
)
,
θ
(
i
)
)
+
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
Z
∣
X
,
θ
(
i
)
)
P
(
Z
∣
X
,
θ
(
i
+
1
)
)
=
Q
(
θ
(
i
+
1
)
,
θ
(
i
)
)
−
Q
(
θ
(
i
)
,
θ
(
i
)
)
+
D
K
L
(
P
(
Z
∣
X
,
θ
(
i
)
)
∥
P
(
Z
∣
X
,
θ
(
i
+
1
)
)
)
\begin{aligned} D\_value &= Q(\theta^{(i+1)},\theta^{(i)}) - Q(\theta^{(i)},\theta^{(i)}) + \mathbb E_{P(Z|X,\theta^{(i)})}\log \frac{P(Z|X, \theta^{(i)})}{P(Z|X, \theta^{(i+1)})}\\ &= Q(\theta^{(i+1)},\theta^{(i)}) - Q(\theta^{(i)},\theta^{(i)}) + D_{KL}(P(Z|X,\theta^{(i)})\| P(Z|X,\theta^{(i+1)})) \end{aligned}
D_value=Q(θ(i+1),θ(i))−Q(θ(i),θ(i))+EP(Z∣X,θ(i))logP(Z∣X,θ(i+1))P(Z∣X,θ(i))=Q(θ(i+1),θ(i))−Q(θ(i),θ(i))+DKL(P(Z∣X,θ(i))∥P(Z∣X,θ(i+1)))
由于
Q
(
θ
(
i
+
1
)
,
θ
(
i
)
)
=
max
Q
(
θ
,
θ
(
i
)
)
≥
Q
(
θ
(
i
)
,
θ
(
i
)
)
Q(\theta^{(i+1)},\theta^{(i)})=\max Q(\theta,\theta^{(i)}) \ge Q(\theta^{(i)},\theta^{(i)})
Q(θ(i+1),θ(i))=maxQ(θ,θ(i))≥Q(θ(i),θ(i)),又因为差值
D
_
v
a
l
u
e
≥
0
D\_value\ge 0
D_value≥0,所以算法收敛。
算法导出
虽然我们知道EM算法流程,但是我们不知道如何导出EM算法。
仿照贝叶斯神经网络时的推导,我们假设
Z
∼
q
(
Z
)
Z\sim q(Z)
Z∼q(Z),那么我们对等式
log
P
(
X
∣
θ
)
=
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
\log P(X|\theta) = \log P(X,Z|\theta) - \log P(Z|X,\theta)
logP(X∣θ)=logP(X,Z∣θ)−logP(Z∣X,θ)求期望得:
log
P
(
X
∣
θ
)
=
log
P
(
X
,
Z
∣
θ
)
−
log
P
(
Z
∣
X
,
θ
)
=
log
P
(
X
,
Z
∣
θ
)
q
(
Z
)
−
log
P
(
Z
∣
X
,
θ
)
q
(
Z
)
=
E
q
(
Z
)
log
P
(
X
,
Z
∣
θ
)
q
(
Z
)
−
D
K
L
(
q
(
Z
)
∥
P
(
Z
∣
X
,
θ
)
)
\begin{aligned} \log P(X|\theta) &=\log P(X,Z|\theta) - \log P(Z|X,\theta)\\ &= \log \frac{P(X,Z|\theta)}{q(Z)} - \log \frac{P(Z|X,\theta)}{q(Z)}\\ &= \mathbb E_{q(Z)}\log \frac{P(X,Z|\theta)}{q(Z)}-D_{KL} (q(Z)\|P(Z|X,\theta)) \end{aligned}
logP(X∣θ)=logP(X,Z∣θ)−logP(Z∣X,θ)=logq(Z)P(X,Z∣θ)−logq(Z)P(Z∣X,θ)=Eq(Z)logq(Z)P(X,Z∣θ)−DKL(q(Z)∥P(Z∣X,θ))
我们知道当
q
(
Z
)
=
P
(
Z
∣
X
,
θ
)
q(Z)=P(Z|X,\theta)
q(Z)=P(Z∣X,θ)时,
D
K
L
=
0
D_{KL}=0
DKL=0。令
q
(
Z
)
=
P
(
Z
∣
X
,
θ
)
q(Z)=P(Z|X,\theta)
q(Z)=P(Z∣X,θ)得:
log
P
(
X
∣
θ
)
=
E
P
(
Z
∣
X
,
θ
)
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
)
\begin{aligned} \log P(X|\theta) &=\mathbb E_{P(Z|X,\theta)}\log \frac{P(X,Z|\theta)}{P(Z|X,\theta)} \end{aligned}
logP(X∣θ)=EP(Z∣X,θ)logP(Z∣X,θ)P(X,Z∣θ)
但是这一步仍然不可解,所以采用迭代算法策略:
log
P
(
X
∣
θ
)
=
E
P
(
Z
∣
X
,
θ
(
i
)
)
log
P
(
X
,
Z
∣
θ
)
P
(
Z
∣
X
,
θ
(
i
)
)
\log P(X|\theta) =\mathbb E_{P(Z|X,\theta^{(i)})}\log \frac{P(X,Z|\theta)}{P(Z|X,\theta^{(i)})}
logP(X∣θ)=EP(Z∣X,θ(i))logP(Z∣X,θ(i))P(X,Z∣θ),由于
log
\log
log下面得分母与优化过程无关,所以略去。
所以导出EM算法。
代码实现
import numpy as np
import random
def load_data(mus, sigmas, alphas, length):
assert(len(mus) == len(sigmas) and len(mus) == len(alphas))
lengths = [round(length * alpha) for alpha in alphas]
data = []
for mu, sigma, length in zip(mus, sigmas, lengths):
data.extend(np.random.normal(mu, sigma, length))
random.shuffle(data)
return np.array(data)
def gaussian(data, mu, sigma):
return np.exp(-(data - mu)**2 / 2 / sigma**2) / np.sqrt(2 * np.pi) / sigma
def train(data, mus, sigmas, alphas, epochs):
for epoch in range(epochs):
gammas, gamma_sum = [], 0
for alpha, mu, sigma in zip(alphas, mus, sigmas):
gamma = alpha * gaussian(data, mu, sigma)
gammas.append(gamma)
gamma_sum += gamma
for i in range(len(mus)):
gammas[i] /= gamma_sum
mus[i] = np.dot(gammas[i], data) / np.sum(gammas[i])
sigmas[i] = np.sqrt(np.dot(gammas[i], (data - mus[i])**2) / np.sum(gammas[i]))
alphas[i] = np.sum(gammas[i]) / len(gammas[i])
return mus, sigmas, alphas
if __name__ == '__main__':
mus, sigmas, alphas = [0.7, 0.5], [1, 2], [0.3, 0.7]
data = load_data(mus, sigmas, alphas, 10000)
mus, sigmas, alphas = train(data, [0, 1], [1, 1], [0.5, 0.5], 10000)
print(mus, sigmas, alphas)