引言
概率模型的目的是最大化标签在特征条件下的概率分布 P ( y ∣ X ; θ ) P(y | X; \theta) P(y∣X;θ)。一般来讲,我们可根据给定样本的标签 y y y 和特征 X X X 数据,直接使用极大似然估计法或贝叶斯估计法来估计模型参数 θ \theta θ。
但当标签 y y y 是不可观测的隐变量(hidden variable)时,极大似然估计法或贝叶斯估计法失效,需要使用期望极大算法(expectation maximization algorithm, EM)对模型参数进行极大似然估计。
EM算法推导
一个含有隐变量的概率模型,目标变为是最大化观测数据(不完全数据) X X X 关于模型参数 的对数似然函数:
θ
^
=
arg max
θ
L
(
θ
)
(1)
\hat{\theta} = \argmax_{\theta} L(\theta) \tag{1}
θ^=θargmaxL(θ)(1)
L
(
θ
)
=
l
o
g
[
P
(
X
∣
θ
)
]
=
l
o
g
[
∑
Y
P
(
X
,
Y
∣
θ
)
]
=
l
o
g
[
∑
Y
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
]
(2)
L(\theta) = log[P(X | \theta)] = log\begin{bmatrix} \sum_{Y} P(X, Y | \theta) \end{bmatrix} = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} \tag{2}
L(θ)=log[P(X∣θ)]=log[∑YP(X,Y∣θ)]=log[∑YP(X∣Y,θ)P(Y∣θ)](2)
EM算法是通过迭代来逐步近似极大化
L
(
θ
)
L(\theta)
L(θ) 的,假设在第
i
i
i 次迭代后
θ
\theta
θ 的估计值是
θ
(
i
)
\theta^{(i)}
θ(i),所以前后两次迭代对数似然函数的差值
Δ
L
(
θ
)
\Delta L(\theta)
ΔL(θ) 等于:
Δ L ( θ ) = L ( θ ) − L ( θ ( i ) ) = l o g [ ∑ Y P ( X ∣ Y , θ ) P ( Y ∣ θ ) ] − l o g P ( X ∣ θ ( i ) ) \Delta L(\theta) = L(\theta) - L(\theta^{(i)}) = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} - logP(X | \theta^{(i)}) ΔL(θ)=L(θ)−L(θ(i))=log[∑YP(X∣Y,θ)P(Y∣θ)]−logP(X∣θ(i)) L ( θ ) − L ( θ ( i ) ) = l o g [ ∑ Y P ( X ∣ Y , θ ) P ( Y ∣ θ ) ] − l o g [ ∑ Y P ( X , Y ∣ θ ( i ) ) ] L(\theta) - L(\theta^{(i)}) = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} - log\begin{bmatrix} \sum_{Y}P(X, Y | \theta^{(i)}) \end{bmatrix} L(θ)−L(θ(i))=log[∑YP(X∣Y,θ)P(Y∣θ)]−log[∑YP(X,Y∣θ(i))] L ( θ ) − L ( θ ( i ) ) = l o g [ ∑ Y P ( X ∣ Y , θ ) P ( Y ∣ θ ) ] − l o g [ ∑ Y P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ] (3) L(\theta) - L(\theta^{(i)}) = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} - log\begin{bmatrix} \sum_{Y}P(Y | X, \theta^{(i)})P(X | \theta^{(i)}) \end{bmatrix} \tag{3} L(θ)−L(θ(i))=log[∑YP(X∣Y,θ)P(Y∣θ)]−log[∑YP(Y∣X,θ(i))P(X∣θ(i))](3)
利用Jensen不等式1得到 Δ L ( θ ) \Delta L(\theta) ΔL(θ) 的下界:
L
(
θ
)
−
L
(
θ
(
i
)
)
=
l
o
g
[
∑
Y
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
]
−
l
o
g
[
∑
Y
P
(
Y
∣
X
,
θ
(
i
)
)
P
(
X
∣
θ
(
i
)
)
]
L(\theta) - L(\theta^{(i)}) = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} - log\begin{bmatrix} \sum_{Y}P(Y | X, \theta^{(i)})P(X | \theta^{(i)}) \end{bmatrix}
L(θ)−L(θ(i))=log[∑YP(X∣Y,θ)P(Y∣θ)]−log[∑YP(Y∣X,θ(i))P(X∣θ(i))]
L
(
θ
)
−
L
(
θ
(
i
)
)
≥
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
(
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
P
(
Y
∣
X
,
θ
(
i
)
)
)
]
−
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
P
(
X
∣
θ
(
i
)
)
]
(4)
L(\theta) - L(\theta^{(i)}) \ge \sum_{Y}[ P(Y|X, \theta^{(i)}) \cdot log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})} ) ] - \sum_{Y}[ P(Y | X, \theta^{(i)}) \cdot logP(X | \theta^{(i)})] \tag{4}
L(θ)−L(θ(i))≥Y∑[P(Y∣X,θ(i))⋅log(P(Y∣X,θ(i))P(X∣Y,θ)P(Y∣θ))]−Y∑[P(Y∣X,θ(i))⋅logP(X∣θ(i))](4)
合并同类项后,
式
(
4
)
式(4)
式(4)中
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
(
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
P
(
Y
∣
X
,
θ
(
i
)
)
)
]
−
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
P
(
X
∣
θ
(
i
)
)
]
\sum_{Y}[ P(Y|X, \theta^{(i)}) \cdot log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})} ) ] - \sum_{Y}[ P(Y | X, \theta^{(i)}) \cdot logP(X | \theta^{(i)})]
∑Y[P(Y∣X,θ(i))⋅log(P(Y∣X,θ(i))P(X∣Y,θ)P(Y∣θ))]−∑Y[P(Y∣X,θ(i))⋅logP(X∣θ(i))] 可简化为:
∑ Y P ( Y ∣ X , θ ( i ) ) ⋅ l o g [ P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ] (5) \sum_{Y} P(Y|X, \theta^{(i)}) \cdot log[ \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} ] \tag{5} Y∑P(Y∣X,θ(i))⋅log[P(Y∣X,θ(i))P(X∣θ(i))P(X∣Y,θ)P(Y∣θ)](5)所以 式 ( 4 ) 式(4) 式(4)可简化为:
L ( θ ) − L ( θ ( i ) ) ≥ ∑ Y P ( Y ∣ X , θ ( i ) ) ⋅ l o g [ P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ] (6) L(\theta) - L(\theta^{(i)}) \ge \sum_{Y} P(Y|X, \theta^{(i)}) \cdot log[ \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} ] \tag{6} L(θ)−L(θ(i))≥Y∑P(Y∣X,θ(i))⋅log[P(Y∣X,θ(i))P(X∣θ(i))P(X∣Y,θ)P(Y∣θ)](6)
令
B
(
θ
,
θ
(
i
)
)
=
L
(
θ
(
i
)
)
+
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
(
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
P
(
Y
∣
X
,
θ
(
i
)
)
P
(
X
∣
θ
(
i
)
)
)
]
(7)
B(\theta, \theta^{(i)}) = L(\theta^{(i)}) + \sum_{Y}[ P(Y|X, \theta^{(i)}) \cdot log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} )] \tag{7}
B(θ,θ(i))=L(θ(i))+Y∑[P(Y∣X,θ(i))⋅log(P(Y∣X,θ(i))P(X∣θ(i))P(X∣Y,θ)P(Y∣θ))](7)
则
L
(
θ
)
≥
B
(
θ
,
θ
(
i
)
)
(8)
L(\theta) \ge B(\theta, \theta^{(i)}) \tag{8}
L(θ)≥B(θ,θ(i))(8)
因此,任何可以使
B
(
θ
,
θ
(
i
)
)
B(\theta, \theta^{(i)})
B(θ,θ(i)) 增大的
θ
\theta
θ 也可以使
L
(
θ
)
L(\theta)
L(θ) 增大,所以模型参数的估计可变换为:
θ
^
=
θ
(
i
+
1
)
=
arg max
θ
B
(
θ
,
θ
(
i
)
)
\hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} B(\theta, \theta^{(i)})
θ^=θ(i+1)=θargmaxB(θ,θ(i))
θ
^
=
θ
(
i
+
1
)
=
arg max
θ
L
(
θ
(
i
)
)
+
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
(
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
P
(
Y
∣
X
,
θ
(
i
)
)
P
(
X
∣
θ
(
i
)
)
)
]
(9)
\hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} L(\theta^{(i)}) + \sum_{Y}[ P(Y|X, \theta^{(i)}) \cdot log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} )] \tag{9}
θ^=θ(i+1)=θargmaxL(θ(i))+Y∑[P(Y∣X,θ(i))⋅log(P(Y∣X,θ(i))P(X∣θ(i))P(X∣Y,θ)P(Y∣θ))](9)可见上式十分复杂,我们不妨省去一些对参数估计没有影响的常数项,以简化
B
(
θ
,
θ
(
i
)
)
B(\theta, \theta^{(i)})
B(θ,θ(i)) 表达式。由此模型参数的极大似然估计可简写成如下所示:
θ
^
=
θ
(
i
+
1
)
=
arg max
θ
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
l
o
g
(
P
(
X
∣
Y
,
θ
)
P
(
Y
∣
θ
)
)
]
\hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} \sum_{Y}[ P(Y|X, \theta^{(i)}) log( P(X | Y, \theta)P(Y|\theta)) ]
θ^=θ(i+1)=θargmaxY∑[P(Y∣X,θ(i))log(P(X∣Y,θ)P(Y∣θ))]
θ
^
=
θ
(
i
+
1
)
=
arg max
θ
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
P
(
X
,
Y
∣
θ
)
]
(10)
\hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} \sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)] \tag{10}
θ^=θ(i+1)=θargmaxY∑[P(Y∣X,θ(i))⋅logP(X,Y∣θ)](10)
式
(
10
)
式(10)
式(10)中
∑
Y
[
P
(
Y
∣
X
,
θ
(
i
)
)
⋅
l
o
g
P
(
X
,
Y
∣
θ
)
]
\sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)]
∑Y[P(Y∣X,θ(i))⋅logP(X,Y∣θ)] 可定义为
Q
Q
Q函数,它表示完全数据的对数似然函数
l
o
g
P
(
X
,
Y
∣
θ
)
logP(X, Y | \theta)
logP(X,Y∣θ) 在给定观测数据
X
X
X 和当前参数
θ
(
i
)
\theta^{(i)}
θ(i) 下对不可观测数据
Y
Y
Y 的条件概率分布
P
(
Y
∣
X
,
θ
(
i
)
)
P(Y|X, \theta^{(i)})
P(Y∣X,θ(i)) 的期望
E
Y
[
l
o
g
P
(
X
,
Y
∣
θ
)
∣
X
,
θ
(
i
)
]
E_{Y}[logP(X, Y | \theta) | X, \theta^{(i)}]
EY[logP(X,Y∣θ)∣X,θ(i)]。所以:
Q ( θ , θ ( i ) ) = E Y [ l o g P ( X , Y ∣ θ ) ∣ X , θ ( i ) ] = ∑ Y [ P ( Y ∣ X , θ ( i ) ) ⋅ l o g P ( X , Y ∣ θ ) ] (11) Q(\theta, \theta^{(i)}) = E_{Y}[logP(X, Y | \theta) | X, \theta^{(i)}] = \sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)] \tag{11} Q(θ,θ(i))=EY[logP(X,Y∣θ)∣X,θ(i)]=Y∑[P(Y∣X,θ(i))⋅logP(X,Y∣θ)](11)
l o g [ ∑ j λ j y j ] ≥ ∑ j λ j l o g ( y j ) , 其 中 要 求 λ j ≥ 0 且 ∑ j λ j = 1 log\begin{bmatrix} \sum_{j} \lambda_{j}y_{j} \end{bmatrix} \ge \sum_{j} \lambda_{j}log(y_{j}), \ \ \ 其中要求\ \ \lambda_{j} \ge 0 \ 且\ \sum_{j} \lambda_{j}=1 log[∑jλjyj]≥∑jλjlog(yj), 其中要求 λj≥0 且 ∑jλj=1,更多内容可参考文章:EM算法(Expectation Maximization) ↩︎