隐马尔可夫模型的训练
Baum-Welch算法
整理了李航书上的内容。
马尔科夫模型是一个含有隐变量的概率模型
P
(
x
∣
λ
)
=
∑
y
P
(
x
∣
y
,
λ
)
P
(
y
∣
λ
)
P\left( {{\bf{x}}|{\bf{\lambda }}} \right) = \sum\limits_{\bf{y}} {P\left( {{\bf{x}}|{\bf{y}},{\bf{\lambda }}} \right)P\left( {{\bf{y}}|{\bf{\lambda }}} \right)}
P(x∣λ)=y∑P(x∣y,λ)P(y∣λ)按照Q函数的定义它的参数学习可以由EM算法实现。
1)所有观测数据写成
x
=
(
x
1
,
x
2
,
⋯
x
T
)
{\bf{x}} = \left( {{x_1},{x_2}, \cdots {x_T}} \right)
x=(x1,x2,⋯xT),所有隐数据写成
y
=
(
y
1
,
y
2
,
⋯
y
T
)
{\bf{y}} = \left( {{y_1},{y_2}, \cdots {y_T}} \right)
y=(y1,y2,⋯yT),完全数据是
(
x
,
y
)
=
(
x
1
,
x
2
,
⋯
 
,
x
T
,
y
1
,
y
2
,
⋯
 
,
y
T
)
\left( {{\bf{x}},{\bf{y}}} \right) = \left( {{x_1},{x_2}, \cdots ,{x_T},y_1,y_2,\cdots,y_T} \right)
(x,y)=(x1,x2,⋯,xT,y1,y2,⋯,yT)。完全数据的对数似然函数是
log
P
(
x
,
y
∣
λ
)
\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)
logP(x,y∣λ)。
2)EM算法的E步:求Q函数
Q
(
λ
,
λ
‾
)
Q\left( {\lambda ,\overline \lambda } \right)
Q(λ,λ)
Q
(
λ
,
λ
‾
)
=
∑
y
log
P
(
x
,
y
∣
λ
)
P
(
x
,
y
∣
λ
‾
)
Q\left( {\lambda ,\overline \lambda } \right) = \sum\limits_{\bf{y}} {\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)}
Q(λ,λ)=y∑logP(x,y∣λ)P(x,y∣λ)注:按照Q函数的定义
Q
(
λ
,
λ
‾
)
=
E
y
[
log
P
(
x
,
y
∣
λ
)
y
,
λ
‾
]
=
∑
y
log
P
(
x
,
y
∣
λ
)
P
(
y
∣
x
,
λ
‾
)
Q\left( {{\bf{\lambda ,}}\overline {\bf{\lambda }} } \right) = {E_{\bf{y}}}\left[ {\log P\left( {{\bf{x,y}}|{\bf{\lambda }}} \right){\bf{y}},\overline {\bf{\lambda }} } \right]=\sum\limits_{\bf{y}} {\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)P\left( {{\bf{y}}|{\bf{x}},\overline {\bf{\lambda }} } \right)}
Q(λ,λ)=Ey[logP(x,y∣λ)y,λ]=y∑logP(x,y∣λ)P(y∣x,λ),上式省略了对
λ
{\bf{\lambda }}
λ而言的常数因子
1
/
P
(
x
∣
λ
‾
)
1/P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)
1/P(x∣λ)(
P
(
y
,
x
∣
λ
‾
)
/
P
(
y
∣
x
,
λ
‾
)
=
1
/
P
(
x
∣
λ
‾
)
P\left( {{\bf{y}},{\bf{x}}|\overline {\bf{\lambda }} } \right)/P\left( {{\bf{y}}|{\bf{x}},\overline {\bf{\lambda }} } \right) = 1/P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)
P(y,x∣λ)/P(y∣x,λ)=1/P(x∣λ))。
其中,
λ
‾
{\overline {\bf{\lambda }} }
λ是隐马尔可夫模型参数的当前估计值,
λ
\lambda
λ是要极大化的隐马尔可夫模型参数。
P
(
x
,
y
∣
λ
)
=
π
y
1
b
y
1
x
1
a
y
1
y
2
b
y
2
x
2
⋯
a
y
T
−
1
y
T
b
y
T
x
T
P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right) = {\pi _{{y_1}}}{b_{{y_1}{x_1}}}{a_{{y_1}{y_2}}}{b_{{y_2}{x_2}}} \cdots {a_{{y_{T - 1}}{y_T}}}{b_{{y_T}{x_T}}}
P(x,y∣λ)=πy1by1x1ay1y2by2x2⋯ayT−1yTbyTxT于是函数
Q
(
λ
,
λ
‾
)
=
∑
y
log
π
y
1
P
(
x
,
y
∣
λ
‾
)
+
∑
y
(
∑
t
=
1
T
−
1
log
a
y
t
a
y
t
+
1
)
P
(
x
,
y
∣
λ
‾
)
+
∑
y
(
∑
t
=
1
T
log
b
y
t
x
t
)
P
(
x
,
y
∣
λ
‾
)
\begin{aligned}Q\left( {{\bf{\lambda ,}}\overline {\bf{\lambda }} } \right) &= \sum\limits_{\bf{y}} {\log {{\bf{\pi }}_{{y_1}}}P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)}\\ &+ \sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T - 1} {\log {a_{{y_t}}}{a_{{y_{t + 1}}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} + \sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} \end{aligned}
Q(λ,λ)=y∑logπy1P(x,y∣λ)+y∑(t=1∑T−1logaytayt+1)P(x,y∣λ)+y∑(t=1∑Tlogbytxt)P(x,y∣λ)
3)EM算法的M步:对上式的三项分别极大化。第一部分:
∑
y
log
π
y
1
P
(
x
,
y
∣
λ
‾
)
=
∑
i
=
1
N
log
π
s
i
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
\sum\limits_{\bf{y}} {\log {{\bf{\pi }}_{{y_1}}}P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{i = 1}^N {\log {{\bf{\pi }}_{{s_i}}}P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)}
y∑logπy1P(x,y∣λ)=i=1∑NlogπsiP(x,y1=si∣λ)注意到
∑
i
=
1
N
π
s
i
=
1
\sum\limits_{i = 1}^N {{{\bf{\pi }}_{{s_i}}}} = 1
i=1∑Nπsi=1,利用拉格朗日乘子法,写出拉格朗日函数:
∑
i
=
1
N
log
π
s
i
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
+
γ
(
∑
i
=
1
N
π
s
i
−
1
)
\sum\limits_{i = 1}^N {\log {{\bf{\pi }}_{{s_i}}}P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)} + \gamma \left( {\sum\limits_{i = 1}^N {{{\bf{\pi }}_{{s_i}}}} - 1} \right)
i=1∑NlogπsiP(x,y1=si∣λ)+γ(i=1∑Nπsi−1)关于
π
s
i
{{{\bf{\pi }}_{{s_i}}}}
πsi的偏导等于零
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
π
s
i
+
γ
=
0
⇒
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
+
π
s
i
γ
=
0
(
1
)
\frac{{P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)}}{{{\pi _{{s_i}}}}} + \gamma = 0 \Rightarrow P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right) + {\pi _{{s_i}}}\gamma = 0\qquad(1)
πsiP(x,y1=si∣λ)+γ=0⇒P(x,y1=si∣λ)+πsiγ=0(1)
⇒
−
γ
=
∑
i
=
1
N
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
=
P
(
x
∣
λ
‾
)
\Rightarrow - \gamma = \sum\limits_{i = 1}^N {P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)} = P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)
⇒−γ=i=1∑NP(x,y1=si∣λ)=P(x∣λ)带回(1)式得
π
s
i
=
P
(
x
,
y
1
=
s
i
∣
λ
‾
)
P
(
x
∣
λ
‾
)
=
γ
1
(
i
)
{\pi _{{s_i}}} = \frac{{P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)}}{{P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)}}={\gamma _1}\left( i \right)
πsi=P(x∣λ)P(x,y1=si∣λ)=γ1(i)第二部分:
∑
y
(
∑
t
=
1
T
−
1
log
a
y
t
a
y
t
+
1
)
P
(
x
,
y
∣
λ
‾
)
=
∑
i
=
1
N
∑
j
=
1
N
∑
t
=
1
T
−
1
log
a
i
j
P
(
x
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
‾
)
\sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T - 1} {\log {a_{{y_t}}}{a_{{y_{t + 1}}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T - 1} {\log {a_{ij}}P\left( {{\bf{x}},{y_t} = {s_i},{y_{t+1}} = {s_j}|\overline {\bf{\lambda }} } \right)} } }
y∑(t=1∑T−1logaytayt+1)P(x,y∣λ)=i=1∑Nj=1∑Nt=1∑T−1logaijP(x,yt=si,yt+1=sj∣λ)类比第一部分,发现
∑
j
=
1
N
a
i
j
=
1
\sum\limits_{j = 1}^N {{a_{ij}}} = 1
j=1∑Naij=1,写出拉格朗日函数:
∑
i
=
1
N
∑
j
=
1
N
∑
t
=
1
T
−
1
log
a
i
j
P
(
x
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
‾
)
+
γ
(
∑
j
=
1
N
a
i
j
−
1
)
\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T - 1} {\log {a_{ij}}P\left( {{\bf{x}},{y_t} = {s_i},{y_{t+1}} = {s_j}|\overline {\bf{\lambda }} } \right)} } + \gamma \left( {\sum\limits_{j = 1}^N {{a_{ij}}} - 1} \right)}
i=1∑Nj=1∑Nt=1∑T−1logaijP(x,yt=si,yt+1=sj∣λ)+γ(j=1∑Naij−1)关于
a
i
j
{{a_{ij}}}
aij的偏导等于零
∑
t
=
1
T
−
1
P
(
x
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
‾
)
a
i
j
+
γ
=
0
(
2
)
\frac{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)} }}{{{a_{ij}}}} + \gamma = 0\qquad(2)
aijt=1∑T−1P(x,yt=si,yt+1=sj∣λ)+γ=0(2)
⇒
−
γ
=
∑
t
=
1
T
−
1
P
(
x
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
‾
)
P
(
x
,
y
t
+
1
=
s
j
∣
y
t
=
s
i
,
λ
‾
)
=
∑
t
=
1
T
−
1
P
(
x
,
y
t
=
s
i
∣
λ
‾
)
\Rightarrow - \gamma = \sum\limits_{t = 1}^{T - 1} {\frac{{P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)}}{{P\left( {{\bf{x}},{y_{t + 1}} = {s_j}|{y_t} = {s_i},\overline {\bf{\lambda }} } \right)}} = } \sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i}|\overline {\bf{\lambda }} } \right)}
⇒−γ=t=1∑T−1P(x,yt+1=sj∣yt=si,λ)P(x,yt=si,yt+1=sj∣λ)=t=1∑T−1P(x,yt=si∣λ)代回(2)式得
a
i
j
=
∑
t
=
1
T
−
1
P
(
x
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
‾
)
∑
t
=
1
T
−
1
P
(
x
,
y
t
=
s
i
∣
λ
‾
)
=
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
)
∑
t
=
1
T
−
1
γ
t
(
i
)
{a_{ij}} = \frac{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)} }}{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i}|\overline {\bf{\lambda }} } \right)} }}=\frac{{\sum\limits_{t = 1}^{T - 1} {{\xi _t}\left( {i,j} \right)} }}{{\sum\limits_{t = 1}^{T - 1} {{\gamma _t}\left( i \right)} }}
aij=t=1∑T−1P(x,yt=si∣λ)t=1∑T−1P(x,yt=si,yt+1=sj∣λ)=t=1∑T−1γt(i)t=1∑T−1ξt(i,j)这符合
a
i
j
a_{ij}
aij的实际意义:在
y
t
=
s
i
y_t=s_i
yt=si的前提下
y
t
+
1
=
s
j
y_{t+1}=s_{j}
yt+1=sj的概率。
第三部分:
∑
y
(
∑
t
=
1
T
log
b
y
t
x
t
)
P
(
x
,
y
∣
λ
‾
)
=
∑
j
=
1
N
(
∑
t
=
1
T
−
1
log
b
y
t
x
t
)
P
(
x
,
y
t
=
j
∣
λ
‾
)
\sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{j = 1}^N {\left( {\sum\limits_{t = 1}^{T - 1} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{y_t} = j|\overline {\bf{\lambda }} } \right)}
y∑(t=1∑Tlogbytxt)P(x,y∣λ)=j=1∑N(t=1∑T−1logbytxt)P(x,yt=j∣λ)同样有
∑
k
=
1
M
b
j
x
k
=
1
\sum\limits_{k = 1}^M {{b_{j{x_k}}}} = 1
k=1∑Mbjxk=1,写出拉格朗日函数:
∑
j
=
1
N
∑
t
=
1
T
log
b
y
t
x
t
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
+
γ
(
∑
k
=
1
M
b
j
x
k
−
1
)
\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right) + \gamma \left( {\sum\limits_{k = 1}^M {{b_{j{x_k}}}} - 1} \right)}
j=1∑Nt=1∑TlogbytxtP(x,yt=sj∣λ)+γ(k=1∑Mbjxk−1)关于
b
j
x
k
{{b_{j{x_k}}}}
bjxk的偏导等于零
∑
t
=
1
T
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
I
(
x
t
=
o
k
)
b
j
x
t
+
γ
=
0
(
3
)
\frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{{b_{j{x_t}}}}} + \gamma = 0\qquad(3)
bjxtt=1∑TP(x,yt=sj∣λ)I(xt=ok)+γ=0(3)注:
I
(
t
r
u
e
)
=
1
,
I
(
f
a
l
s
e
)
=
0
I(true)=1,I(false)=0
I(true)=1,I(false)=0.
−
γ
=
∑
t
=
1
T
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
I
(
x
t
=
o
k
)
P
(
y
t
=
s
j
∣
x
t
=
o
k
,
λ
‾
)
=
∑
t
=
1
T
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
- \gamma = \frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{P\left( {{y_t} = {s_j}|{x_t} = {o_k},\overline {\bf{\lambda }} } \right)}} = \sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)}
−γ=P(yt=sj∣xt=ok,λ)t=1∑TP(x,yt=sj∣λ)I(xt=ok)=t=1∑TP(x,yt=sj∣λ)代回(3)式得
b
j
x
t
=
∑
t
=
1
T
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
I
(
x
t
=
o
k
)
∑
t
=
1
T
P
(
x
,
y
t
=
s
j
∣
λ
‾
)
=
∑
t
=
1
,
x
t
=
o
k
T
γ
t
(
j
)
∑
t
=
1
T
γ
t
(
j
)
{b_{j{x_t}}} = \frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)} }}=\frac{{\sum\limits_{t = 1,{x_t} = {o_k}}^T {{\gamma _t}\left( j \right)} }}{{\sum\limits_{t = 1}^T {{\gamma _t}\left( j \right)} }}
bjxt=t=1∑TP(x,yt=sj∣λ)t=1∑TP(x,yt=sj∣λ)I(xt=ok)=t=1∑Tγt(j)t=1,xt=ok∑Tγt(j)这符合
b
j
x
t
b_{jx_{t}}
bjxt的实际意义:在
y
t
=
s
j
y_t=s_j
yt=sj的前提下,
x
t
=
o
k
x_t=o_k
xt=ok的概率。
预测算法
Viterbi算法
Viterbi算法实际是用动态规划解隐马尔科夫模型预测问题,即用动态规划求概率最大路径(最优路径)。最优路径的特性:如果最优路径在时刻 t t t通过结点 i t ∗ {i}^*_t it∗,那么这一条路径从结点 i t ∗ {i}^*_t it∗到终点 i T ∗ {i}^*_T iT∗的部分路径,对于从 i t ∗ {i}^*_t it∗到 i T ∗ {i}^*_T iT∗的所有可能的部分路径来说,必须是最优的。根据这一特性,我们只需从时刻 t = 1 t=1 t=1开始,递推计算地时刻 t t t状态为 i i i的各条部分路径的最大概率,直至得到 t = T t=T t=T状态为 i i i的各条路径的最大概率。时刻 t = T t=T t=T的最大概率即为最优路径的概率 P ∗ P^* P∗,最优路径的终结点 i T ∗ {i}^*_T iT∗也同时得到。之后,从终结点 i T ∗ {i}^*_T iT∗开始,由后向前逐步求节点 i T - 1 ∗ , … , i 1 ∗ i_{T{\text{ - }}1}^*, \ldots ,i_1^* iT - 1∗,…,i1∗,得到最优路径 I = ( i 1 ∗ , i 2 ∗ , … , i T ∗ ) I = \left( {i_1^*,i_2^*, \ldots ,i_T^*} \right) I=(i1∗,i2∗,…,iT∗)。