原创:hxj7
本文介绍了 EM算法 和 Baum-Welch算法的推导过程。
一般地,估算概率模型参数可以用最大似然法,即找到
θ
^
=
a
r
g
m
a
x
θ
 
P
(
x
∣
θ
)
\hat{\theta} = \mathrm{argmax}_\theta \, P(x|\theta)
θ^=argmaxθP(x∣θ)
有时为了简便运算,也可以用最大对数似然代替最大似然,即以下公式
θ
^
=
a
r
g
m
a
x
θ
 
l
o
g
 
P
(
x
∣
θ
)
\hat{\theta} = \mathrm{argmax}_\theta \, \mathrm{log} \, P(x|\theta)
θ^=argmaxθlogP(x∣θ)
很多时候上述公式没有解析解或者解析解运算过于复杂,这个时候可以用迭代的方法求解。预先设置终止条件,满足该条件时即停止迭代。
EM算法
EM算法就是这样一种迭代算法,该算法在有缺失值存在的情况下估算概率模型的参数(或参数集,下文统称参数)。其大致过程如下:
E步骤:计算Q函数。
M步骤:相较于
θ
\theta
θ,最大化
Q
(
θ
∣
θ
t
)
Q(\theta|\theta^t)
Q(θ∣θt)。
其具体推导过程如下:
最终目的是找到最大对数似然对应的参数。
(1)
θ
^
=
a
r
g
m
a
x
θ
 
l
o
g
 
P
(
x
∣
θ
)
\hat{\theta} = \mathrm{argmax}_\theta \, \mathrm{log} \, P(x|\theta) \tag{1}
θ^=argmaxθlogP(x∣θ)(1)
假设
y
y
y为缺失数据,我们首先可以得到:
(2)
l
o
g
 
P
(
x
∣
θ
)
=
l
o
g
 
P
(
x
,
y
∣
θ
)
−
l
o
g
 
P
(
y
∣
x
,
θ
)
\mathrm{log} \, P(x|\theta) = \mathrm{log} \, P(x,y|\theta) - \mathrm{log} \, P(y|x,\theta) \tag{2}
logP(x∣θ)=logP(x,y∣θ)−logP(y∣x,θ)(2)
公式(2)很容易推导,因为:
(2.1) P ( x , y ∣ θ ) = P ( x ∣ θ ) P ( y ∣ x , θ ) P(x,y|\theta) = P(x|\theta)P(y|x,\theta) \tag{2.1} P(x,y∣θ)=P(x∣θ)P(y∣x,θ)(2.1)
所以
(2.2) P ( x ∣ θ ) = P ( x , y ∣ θ ) P ( y ∣ x , θ ) P(x|\theta) = \frac{P(x,y|\theta)}{P(y|x,\theta)} \tag{2.2} P(x∣θ)=P(y∣x,θ)P(x,y∣θ)(2.2)
等式两边取 $\mathrm{log} , $ 可以得到公式(2)。
在求取最大对数似然的迭代过程中,假设步骤
t
t
t中得到了参数
θ
t
\theta^t
θt,对应的对数似然是
l
o
g
 
P
(
x
∣
θ
t
)
\mathrm{log} \, P(x|\theta^t)
logP(x∣θt)。那么步骤
t
+
1
t+1
t+1中的对数似然应该不小于
l
o
g
 
P
(
x
∣
θ
t
)
\mathrm{log} \, P(x|\theta^t)
logP(x∣θt)。即
(3)
l
o
g
 
P
(
x
∣
θ
t
+
1
)
−
l
o
g
 
P
(
x
∣
θ
t
)
≥
0
\mathrm{log} \, P(x|\theta^{t+1}) - \mathrm{log} \, P(x|\theta^t) \geq 0 \tag{3}
logP(x∣θt+1)−logP(x∣θt)≥0(3)
为了得到
θ
t
+
1
\theta^{t+1}
θt+1,我们首先变换得到:
(4)
l
o
g
 
P
(
x
∣
θ
)
=
∑
y
P
(
y
∣
x
,
θ
t
)
l
o
g
 
P
(
x
,
y
∣
θ
)
−
∑
y
P
(
y
∣
x
,
θ
t
)
l
o
g
 
P
(
y
∣
x
,
θ
)
\begin{aligned} \displaystyle \mathrm{log} \, P(x|\theta)= & \sum_y P(y|x,\theta^t)\mathrm{log} \, P(x,y|\theta)\\ & -\sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta) \tag{4} \end{aligned}
logP(x∣θ)=y∑P(y∣x,θt)logP(x,y∣θ)−y∑P(y∣x,θt)logP(y∣x,θ)(4)
公式(4)可以这样推导得到:
将公式(2)等式两边乘上 P ( y ∣ x , θ t ) P(y|x,\theta^t) P(y∣x,θt),再对所有的 y y y求和,得到:
(4.1) ∑ y P ( y ∣ x , θ t ) l o g   P ( x ∣ θ ) = ∑ y P ( y ∣ x , θ t ) l o g   P ( x , y ∣ θ ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ ) \begin{aligned} \displaystyle \sum_y P(y|x,\theta^t)\mathrm{log} \, P(x|\theta) = & \sum_y P(y|x,\theta^t)\mathrm{log} \, P(x,y|\theta)\\ & -\sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta) \tag{4.1} \end{aligned} y∑P(y∣x,θt)logP(x∣θ)=y∑P(y∣x,θt)logP(x,y∣θ)−y∑P(y∣x,θt)logP(y∣x,θ)(4.1)
很容易看出等式左边:
(4.2) ∑ y P ( y ∣ x , θ t ) l o g   P ( x ∣ θ ) = l o g   P ( x ∣ θ ) ∑ y P ( y ∣ x , θ t ) = l o g   P ( x ∣ θ ) \begin{aligned} \displaystyle \sum_y P(y|x,\theta^t)\mathrm{log} \, P(x|\theta) & = \mathrm{log} \, P(x|\theta)\sum_y P(y|x,\theta^t) \\ & = \mathrm{log} \, P(x|\theta) \tag{4.2} \end{aligned} y∑P(y∣x,θt)logP(x∣θ)=logP(x∣θ)y∑P(y∣x,θt)=logP(x∣θ)(4.2)
由公式(4.1)和(4.2)可以得到公式(4)。
如果令
(5)
Q
(
θ
∣
θ
t
)
=
∑
y
P
(
y
∣
x
,
θ
t
)
l
o
g
 
P
(
x
,
y
∣
θ
)
Q(\theta|\theta^t) = \sum_y P(y|x,\theta^t)\mathrm{log} \, P(x,y|\theta) \tag{5}
Q(θ∣θt)=y∑P(y∣x,θt)logP(x,y∣θ)(5)
那么我们可以得到:
(6)
l
o
g
 
P
(
x
∣
θ
)
−
l
o
g
 
P
(
x
∣
θ
t
)
=
Q
(
θ
∣
θ
t
)
−
Q
(
θ
t
∣
θ
t
)
+
∑
y
P
(
y
∣
x
,
θ
t
)
l
o
g
 
P
(
y
∣
x
,
θ
t
)
P
(
y
∣
x
,
θ
)
\begin{aligned} \displaystyle & \mathrm{log} \, P(x|\theta) - \mathrm{log} \, P(x|\theta^t)\\ &= Q(\theta|\theta^t) - Q(\theta^t|\theta^t) + \sum_y P(y|x,\theta^t)\mathrm{log} \, \frac{P(y|x,\theta^t)}{P(y|x,\theta)} \tag{6} \end{aligned}
logP(x∣θ)−logP(x∣θt)=Q(θ∣θt)−Q(θt∣θt)+y∑P(y∣x,θt)logP(y∣x,θ)P(y∣x,θt)(6)
公式(6)的推导也很容易:
(6.1) l o g   P ( x ∣ θ ) − l o g   P ( x ∣ θ t ) = [ ∑ y P ( y ∣ x , θ t ) l o g   P ( x , y ∣ θ ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ ) ] − [ ∑ y P ( y ∣ x , θ t ) l o g   P ( x , y ∣ θ t ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ t ) ] = [ Q ( θ ∣ θ t ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ ) ] − [ Q ( θ t ∣ θ t ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ t ) ] = [ Q ( θ ∣ θ t ) − Q ( θ t ∣ θ t ) ] + [ ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ t ) − ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ ) ] = Q ( θ ∣ θ t ) − Q ( θ t ∣ θ t ) + ∑ y P ( y ∣ x , θ t ) l o g   P ( y ∣ x , θ t ) P ( y ∣ x , θ ) \begin{aligned} &\mathrm{log} \, P(x|\theta) - \mathrm{log} \, P(x|\theta^t)\\ &=\bigg[\sum_y P(y|x,\theta^t)\mathrm{log} \, P(x,y|\theta) - \sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta) \bigg]\\ &\ \ \ \ \ - \bigg[\sum_y P(y|x,\theta^t)\mathrm{log} \, P(x,y|\theta^t) - \sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta^t) \bigg]\\ &=\bigg[Q(\theta|\theta^t) - \sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta)\bigg] \\ &\ \ \ \ \ - \bigg[Q(\theta^t|\theta^t) - \sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta^t)\bigg] \\ &=\bigg[Q(\theta|\theta^t) - Q(\theta^t|\theta^t)\bigg]\\ &\ \ \ \ \ +\bigg[\sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta^t) - \sum_y P(y|x,\theta^t)\mathrm{log} \, P(y|x,\theta)\bigg]\\ &=Q(\theta|\theta^t) - Q(\theta^t|\theta^t) + \sum_y P(y|x,\theta^t)\mathrm{log} \, \frac{P(y|x,\theta^t)}{P(y|x,\theta)} \tag{6.1} \end{aligned} logP(x∣θ)−logP(x∣θt)=[y∑P(y∣x,θt)logP(x,y∣θ)−y∑P(y∣x,θt)logP(y∣x,θ)] −[y∑P(y∣x,θt)logP(x,y∣θt)−y∑P(y∣x,θt)logP(y∣x,θt)]=[Q(θ∣θt)−y∑P(y∣x,θt)logP(y∣x,θ)] −[Q(θt∣θt)−y∑P(y∣x,θt)logP(y∣x,θt)]=[Q(θ∣θt)−Q(θt∣θt)] +[y∑P(y∣x,θt)logP(y∣x,θt)−y∑P(y∣x,θt)logP(y∣x,θ)]=Q(θ∣θt)−Q(θt∣θt)+y∑P(y∣x,θt)logP(y∣x,θ)P(y∣x,θt)(6.1)
其中
∑
y
P
(
y
∣
x
,
θ
t
)
l
o
g
 
P
(
y
∣
x
,
θ
t
)
P
(
y
∣
x
,
θ
)
\sum_yP(y|x,\theta^t)\mathrm{log} \, \frac{P(y|x,\theta^t)}{P(y|x,\theta)}
∑yP(y∣x,θt)logP(y∣x,θ)P(y∣x,θt)一项是
P
(
y
∣
x
,
θ
t
)
P(y|x,\theta^t)
P(y∣x,θt)对
P
(
y
∣
x
,
θ
)
P(y|x,\theta)
P(y∣x,θ)的相对熵,所以不小于0。只要
Q
(
θ
∣
θ
t
)
Q(\theta|\theta^t)
Q(θ∣θt)不小于
Q
(
θ
t
∣
θ
t
)
Q(\theta^t|\theta^t)
Q(θt∣θt),那么就可以保证公式(3)成立。那么可以取
(7)
θ
t
+
1
=
a
r
g
m
a
x
θ
 
Q
(
θ
∣
θ
t
)
\theta^{t+1}=\mathrm{argmax}_\theta \, Q(\theta|\theta^t) \tag{7}
θt+1=argmaxθQ(θ∣θt)(7)
我们定义概率分布 P ( x ) P(x) P(x)相对于概率分布 Q ( x ) Q(x) Q(x)的相对熵为:
(7.1) H ( P ∣ ∣ Q ) = ∑ i P ( x i ) l o g   P ( x i ) Q ( x i ) \displaystyle H(P||Q) = \sum_i\ P(x_i) \mathrm{log} \, \frac{P(x_i)}{Q(x_i)} \tag{7.1} H(P∣∣Q)=i∑ P(xi)logQ(xi)P(xi)(7.1)
那么 H ( P ∣ ∣ Q ) ≥ 0 H(P||Q) \geq 0 H(P∣∣Q)≥0,证明如下:
我们知道:
(7.2) l o g   x ≤ x − 1 , w h e n x > 0 \mathrm{log} \, x \leq x - 1, \qquad when \quad x > 0 \tag{7.2} logx≤x−1,whenx>0(7.2)
那么:
(7.3) − l o g   x ≥ 1 − x , w h e n x > 0 -\mathrm{log} \, x \geq 1 - x, \qquad when \quad x > 0 \tag{7.3} −logx≥1−x,whenx>0(7.3)
所以:
(7.4) l o g   P ( x i ) Q ( x i ) = − l o g   Q ( x i ) P ( x i ) ≥ 1 − Q ( x i ) P ( x i ) \begin{aligned} \mathrm{log} \, \frac{P(x_i)}{Q(x_i)} &= -\mathrm{log} \, \frac{Q(x_i)}{P(x_i)} \\ &\geq 1 - \frac{Q(x_i)}{P(x_i)} \tag{7.4} \end{aligned} logQ(xi)P(xi)=−logP(xi)Q(xi)≥1−P(xi)Q(xi)(7.4)
因而:
(7.5) H ( P ∣ ∣ Q ) = ∑ i P ( x i ) l o g   P ( x i ) Q ( x i ) ≥ ∑ i P ( x i ) ( 1 − Q ( x i ) P ( x i ) ) = ∑ i P ( x i ) − ∑ i Q ( x i ) = 1 − 1 = 0 \begin{aligned} \displaystyle H(P||Q) &= \sum_i\ P(x_i) \mathrm{log} \, \frac{P(x_i)}{Q(x_i)}\\ &\geq \sum_i\ P(x_i) \bigg(1 - \frac{Q(x_i)}{P(x_i)}\bigg)\\ &=\sum_i\ P(x_i)-\sum_i\ Q(x_i)\\ &=1-1\\ &=0 \tag{7.5} \end{aligned} H(P∣∣Q)=i∑ P(xi)logQ(xi)P(xi)≥i∑ P(xi)(1−P(xi)Q(xi))=i∑ P(xi)−i∑ Q(xi)=1−1=0(7.5)
等号成立当且仅当对所有的 i i i, P ( x i ) = Q ( x i ) P(x_i)=Q(x_i) P(xi)=Q(xi)都成立。
Baum-Welch算法
Baum-Welch算法是EM算法的一个特例,可以用于估算HMM模型中的概率参数。其缺失的数据是路径
π
\pi
π。所以,Baum-Welch算法求解的是:
(8)
Q
(
θ
∣
θ
t
)
=
∑
π
P
(
π
∣
x
,
θ
t
)
l
o
g
 
P
(
x
,
π
∣
θ
)
Q(\theta|\theta^t) = \sum_\pi P(\pi|x,\theta^t)\mathrm{log} \, P(x,\pi|\theta) \tag{8}
Q(θ∣θt)=π∑P(π∣x,θt)logP(x,π∣θ)(8)
我们知道,HMM模型中,对于一条给定的路径
π
\pi
π,模型中的参数(如转移概率
a
k
l
a_{kl}
akl、发射概率
e
k
(
b
)
e_k(b)
ek(b)等)都会在计算
l
o
g
P
(
x
,
π
∣
θ
)
\mathrm{log}P(x,\pi|\theta)
logP(x,π∣θ)的式子中出现多次。假设
a
k
l
a_{kl}
akl出现的次数为
A
k
l
A_{kl}
Akl,而
e
k
(
b
)
e_k(b)
ek(b)出现的次数为
E
k
(
b
)
E_k(b)
Ek(b),那么对于路径
π
\pi
π:
(9)
P
(
x
,
π
∣
θ
)
=
∏
k
=
0
M
∏
l
=
1
M
a
k
l
A
k
l
(
π
)
∏
k
=
1
M
∏
b
[
e
k
(
b
)
]
E
k
(
b
,
π
)
\displaystyle P(x,\pi|\theta) = \prod_{k=0}^M \prod_{l=1}^M a_{kl}^{A_{kl}(\pi)} \prod_{k=1}^M \prod_{b}[e_k(b)]^{E_k(b, \pi)} \tag{9}
P(x,π∣θ)=k=0∏Ml=1∏MaklAkl(π)k=1∏Mb∏[ek(b)]Ek(b,π)(9)
所以:
(10)
l
o
g
 
P
(
x
,
π
∣
θ
)
=
∑
k
=
0
M
∑
l
=
1
M
A
k
l
(
π
)
l
o
g
 
a
k
l
+
∑
k
=
1
M
∑
b
E
k
(
b
,
π
)
l
o
g
 
e
k
(
b
)
\displaystyle \mathrm{log} \, P(x,\pi|\theta) = \sum_{k=0}^M \sum_{l=1}^M A_{kl}(\pi) \mathrm{log} \, a_{kl} + \sum_{k=1}^M \sum_{b} E_k(b, \pi) \mathrm{log} \, e_k(b) \tag{10}
logP(x,π∣θ)=k=0∑Ml=1∑MAkl(π)logakl+k=1∑Mb∑Ek(b,π)logek(b)(10)
由此,公式(8)变成:
(11)
Q
(
θ
∣
θ
t
)
=
∑
π
P
(
π
∣
x
,
θ
t
)
×
[
∑
k
=
0
M
∑
l
=
1
M
A
k
l
(
π
)
l
o
g
 
a
k
l
+
∑
k
=
1
M
∑
b
E
k
(
b
,
π
)
l
o
g
 
e
k
(
b
)
]
\begin{aligned} Q(\theta|\theta^t) & = \sum_\pi P(\pi|x,\theta^t) \times \\ & \bigg[\sum_{k=0}^M \sum_{l=1}^M A_{kl}(\pi) \mathrm{log} \, a_{kl} + \sum_{k=1}^M \sum_{b} E_k(b, \pi) \mathrm{log} \, e_k(b)\bigg] \tag{11} \end{aligned}
Q(θ∣θt)=π∑P(π∣x,θt)×[k=0∑Ml=1∑MAkl(π)logakl+k=1∑Mb∑Ek(b,π)logek(b)](11)
改变公式(11)的求和顺序,我们可以得到:
(12)
Q
(
θ
∣
θ
t
)
=
∑
k
=
0
M
∑
l
=
1
M
A
k
l
l
o
g
 
a
k
l
+
∑
k
=
1
M
∑
b
E
k
(
b
)
l
o
g
 
e
k
(
b
)
Q(\theta|\theta^t) = \sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, a_{kl} + \sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, e_k(b) \tag{12}
Q(θ∣θt)=k=0∑Ml=1∑MAkllogakl+k=1∑Mb∑Ek(b)logek(b)(12)
其中:
(12.1)
A
k
l
=
∑
π
P
(
π
∣
x
,
θ
t
)
A
k
l
(
π
)
E
k
(
b
)
=
∑
π
P
(
π
∣
x
,
θ
t
)
E
k
(
b
,
π
)
A_{kl} = \sum_\pi P(\pi|x,\theta^t) A_{kl}(\pi) \\ E_k(b) = \sum_\pi P(\pi|x,\theta^t) E_k(b, \pi) \tag{12.1}
Akl=π∑P(π∣x,θt)Akl(π)Ek(b)=π∑P(π∣x,θt)Ek(b,π)(12.1)
公式(12)可以这样推导得到:
公式(11)先对 π \pi π求和,可以得到:
(12.2) Q ( θ ∣ θ t ) = ∑ k = 0 M ∑ l = 1 M ∑ π P ( π ∣ x , θ t ) A k l ( π ) l o g   a k l + ∑ k = 1 M ∑ b ∑ π P ( π ∣ x , θ t ) E k ( b , π ) l o g   e k ( b ) = ∑ k = 0 M ∑ l = 1 M l o g   a k l ∑ π P ( π ∣ x , θ t ) A k l ( π ) + ∑ k = 1 M ∑ b l o g   e k ( b ) ∑ π P ( π ∣ x , θ t ) E k ( b , π ) = ∑ k = 0 M ∑ l = 1 M l o g   a k l A k l + ∑ k = 1 M ∑ b l o g   e k ( b ) E k ( b ) \begin{aligned} Q(\theta|\theta^t) & = \sum_{k=0}^M \sum_{l=1}^M \sum_\pi P(\pi|x,\theta^t) A_{kl}(\pi) \mathrm{log} \, a_{kl} \\ & \ \ \ \ \ + \sum_{k=1}^M \sum_{b} \sum_\pi P(\pi|x,\theta^t) E_k(b, \pi) \mathrm{log} \, e_k(b)\\ & = \sum_{k=0}^M \sum_{l=1}^M \mathrm{log} \, a_{kl} \sum_\pi P(\pi|x,\theta^t) A_{kl}(\pi) \\ & \ \ \ \ \ + \sum_{k=1}^M \sum_{b} \mathrm{log} \, e_k(b) \sum_\pi P(\pi|x,\theta^t) E_k(b, \pi) \\ & = \sum_{k=0}^M \sum_{l=1}^M \mathrm{log} \, a_{kl} A_{kl} + \sum_{k=1}^M \sum_{b} \mathrm{log} \, e_k(b) E_k(b) \tag{12.2} \end{aligned} Q(θ∣θt)=k=0∑Ml=1∑Mπ∑P(π∣x,θt)Akl(π)logakl +k=1∑Mb∑π∑P(π∣x,θt)Ek(b,π)logek(b)=k=0∑Ml=1∑Mlogaklπ∑P(π∣x,θt)Akl(π) +k=1∑Mb∑logek(b)π∑P(π∣x,θt)Ek(b,π)=k=0∑Ml=1∑MlogaklAkl+k=1∑Mb∑logek(b)Ek(b)(12.2)
现在关键就是怎么样取
a
k
l
a_{kl}
akl以及
e
k
(
b
)
e_k(b)
ek(b)的值,可以最大化
Q
(
θ
∣
θ
t
)
Q(\theta|\theta^t)
Q(θ∣θt)。我们令
(13)
a
k
l
0
=
A
k
l
∑
l
′
A
k
l
′
e
k
0
(
b
)
=
E
k
(
b
)
∑
b
′
E
k
(
b
′
)
a_{kl}^0 = \frac{A_{kl}}{\sum_{l'} A_{kl'}} \qquad e_k^0(b) = \frac{E_k(b)}{\sum_{b'} E_{k}(b')} \tag{13}
akl0=∑l′Akl′Aklek0(b)=∑b′Ek(b′)Ek(b)(13)
那么此时对应的 Q ( θ ∣ θ t ) Q(\theta|\theta^t) Q(θ∣θt)为其最大值。
公式(13)的结论证明如下:
假设 a k l 0 a_{kl}^0 akl0对应的 Q ( θ ∣ θ t ) Q(\theta|\theta^t) Q(θ∣θt)为 Q 0 ( θ ∣ θ t ) Q^0(\theta|\theta^t) Q0(θ∣θt),其它任意 a k l a_{kl} akl对应的 Q ( θ ∣ θ t ) Q(\theta|\theta^t) Q(θ∣θt)就记为 Q ( θ ∣ θ t ) Q(\theta|\theta^t) Q(θ∣θt)。那么:
(13.1) Q 0 ( θ ∣ θ t ) − Q ( θ ∣ θ t ) = [ ∑ k = 0 M ∑ l = 1 M A k l l o g   a k l 0 + ∑ k = 1 M ∑ b E k ( b ) l o g   e k 0 ( b ) ] − [ ∑ k = 0 M ∑ l = 1 M A k l l o g   a k l + ∑ k = 1 M ∑ b E k ( b ) l o g   e k ( b ) ] = [ ∑ k = 0 M ∑ l = 1 M A k l l o g   a k l 0 − ∑ k = 0 M ∑ l = 1 M A k l l o g   a k l ] + [ ∑ k = 1 M ∑ b E k ( b ) l o g   e k 0 ( b ) − ∑ k = 1 M ∑ b E k ( b ) l o g   e k ( b ) ] = ∑ k = 0 M ∑ l = 1 M A k l l o g   a k l 0 a k l + ∑ k = 1 M ∑ b E k ( b ) l o g   e k 0 ( b ) e k ( b ) = ∑ k = 0 M [ ∑ l ′ A k l ′ ] ∑ l = 1 M A k l ∑ l ′ A k l ′ l o g   a k l 0 a k l + ∑ k = 1 M [ ∑ b ′ E k ( b ′ ) ] ∑ b E k ( b ) ∑ b ′ E k ( b ′ ) l o g   e k 0 ( b ) e k ( b ) = ∑ k = 0 M [ ∑ l ′ A k l ′ ] ∑ l = 1 M a k l 0 l o g   a k l 0 a k l + ∑ k = 1 M [ ∑ b ′ E k ( b ′ ) ] ∑ b e k 0 ( b ) l o g   e k 0 ( b ) e k ( b ) ≥ 0 \begin{aligned} & Q^0(\theta|\theta^t) - Q(\theta|\theta^t) \\ & = \bigg[\sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, a_{kl}^0 + \sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, e_k^0(b)\bigg]\\ & \ \ \ \ \ - \bigg[\sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, a_{kl} + \sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, e_k(b)\bigg]\\ & = \bigg[\sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, a_{kl}^0 - \sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, a_{kl}\bigg]\\ & \ \ \ \ \ + \bigg[\sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, e_k^0(b) - \sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, e_k(b)\bigg]\\ & = \sum_{k=0}^M \sum_{l=1}^M A_{kl} \mathrm{log} \, \frac{a_{kl}^0}{a_{kl}} + \sum_{k=1}^M \sum_{b} E_k(b) \mathrm{log} \, \frac{e_k^0(b)}{e_k(b)}\\ & = \sum_{k=0}^M \bigg[\sum_{l'}A_{kl'}\bigg] \sum_{l=1}^M \frac{A_{kl}}{\sum_{l'}A_{kl'}} \mathrm{log} \, \frac{a_{kl}^0}{a_{kl}} \\ & \ \ \ \ \ + \sum_{k=1}^M \bigg[\sum_{b'}E_{k}(b')\bigg] \sum_{b} \frac{E_k(b)}{\sum_{b'}E_{k}(b')} \mathrm{log} \, \frac{e_k^0(b)}{e_k(b)}\\ & = \sum_{k=0}^M \bigg[\sum_{l'}A_{kl'}\bigg] \sum_{l=1}^M a_{kl}^0 \mathrm{log} \, \frac{a_{kl}^0}{a_{kl}} + \sum_{k=1}^M \bigg[\sum_{b'}E_{k}(b')\bigg] \sum_{b} e_k^0(b) \mathrm{log} \, \frac{e_k^0(b)}{e_k(b)}\\ & \geq 0 \tag{13.1} \end{aligned} Q0(θ∣θt)−Q(θ∣θt)=[k=0∑Ml=1∑MAkllogakl0+k=1∑Mb∑Ek(b)logek0(b)] −[k=0∑Ml=1∑MAkllogakl+k=1∑Mb∑Ek(b)logek(b)]=[k=0∑Ml=1∑MAkllogakl0−k=0∑Ml=1∑MAkllogakl] +[k=1∑Mb∑Ek(b)logek0(b)−k=1∑Mb∑Ek(b)logek(b)]=k=0∑Ml=1∑MAkllogaklakl0+k=1∑Mb∑Ek(b)logek(b)ek0(b)=k=0∑M[l′∑Akl′]l=1∑M∑l′Akl′Akllogaklakl0 +k=1∑M[b′∑Ek(b′)]b∑∑b′Ek(b′)Ek(b)logek(b)ek0(b)=k=0∑M[l′∑Akl′]l=1∑Makl0logaklakl0+k=1∑M[b′∑Ek(b′)]b∑ek0(b)logek(b)ek0(b)≥0(13.1)
最后一个不等式中 ∑ l = 1 M a k l 0 l o g   a k l 0 a k l \sum_{l=1}^M a_{kl}^0 \mathrm{log} \, \frac{a_{kl}^0}{a_{kl}} ∑l=1Makl0logaklakl0可以看作 a k l 0 a_{kl}^0 akl0对于 a k l a_{kl} akl的相对熵; ∑ b e k 0 ( b ) l o g   e k 0 ( b ) e k ( b ) \sum_{b} e_k^0(b) \mathrm{log} \, \frac{e_k^0(b)}{e_k(b)} ∑bek0(b)logek(b)ek0(b)可以看作 e k 0 ( b ) e_k^0(b) ek0(b)相对于 e k ( b ) e_k(b) ek(b)的相对熵。所以二者都不小于0。
小结
本文的推导过程基于《生物序列分析》第11章中的内容,笔者所做的工作就是将书中简要的推导过程补充详细。
(公众号:生信了)