隐马尔可夫模型的训练
Baum-Welch算法
整理了李航书上的内容。
马尔科夫模型是一个含有隐变量的概率模型P(x∣λ)=∑yP(x∣y,λ)P(y∣λ)P\left( {{\bf{x}}|{\bf{\lambda }}} \right) = \sum\limits_{\bf{y}} {P\left( {{\bf{x}}|{\bf{y}},{\bf{\lambda }}} \right)P\left( {{\bf{y}}|{\bf{\lambda }}} \right)} P(x∣λ)=y∑P(x∣y,λ)P(y∣λ)按照Q函数的定义它的参数学习可以由EM算法实现。
1)所有观测数据写成x=(x1,x2,⋯xT){\bf{x}} = \left( {{x_1},{x_2}, \cdots {x_T}} \right)x=(x1,x2,⋯xT),所有隐数据写成y=(y1,y2,⋯yT){\bf{y}} = \left( {{y_1},{y_2}, \cdots {y_T}} \right)y=(y1,y2,⋯yT),完全数据是(x,y)=(x1,x2,⋯ ,xT,y1,y2,⋯ ,yT)\left( {{\bf{x}},{\bf{y}}} \right) = \left( {{x_1},{x_2}, \cdots ,{x_T},y_1,y_2,\cdots,y_T} \right)(x,y)=(x1,x2,⋯,xT,y1,y2,⋯,yT)。完全数据的对数似然函数是logP(x,y∣λ)\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)logP(x,y∣λ)。
2)EM算法的E步:求Q函数Q(λ,λ‾)Q\left( {\lambda ,\overline \lambda } \right)Q(λ,λ)Q(λ,λ‾)=∑ylogP(x,y∣λ)P(x,y∣λ‾)Q\left( {\lambda ,\overline \lambda } \right) = \sum\limits_{\bf{y}} {\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} Q(λ,λ)=y∑logP(x,y∣λ)P(x,y∣λ)注:按照Q函数的定义Q(λ,λ‾)=Ey[logP(x,y∣λ)y,λ‾]=∑ylogP(x,y∣λ)P(y∣x,λ‾)Q\left( {{\bf{\lambda ,}}\overline {\bf{\lambda }} } \right) = {E_{\bf{y}}}\left[ {\log P\left( {{\bf{x,y}}|{\bf{\lambda }}} \right){\bf{y}},\overline {\bf{\lambda }} } \right]=\sum\limits_{\bf{y}} {\log P\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right)P\left( {{\bf{y}}|{\bf{x}},\overline {\bf{\lambda }} } \right)}Q(λ,λ)=Ey[logP(x,y∣λ)y,λ]=y∑logP(x,y∣λ)P(y∣x,λ),上式省略了对λ{\bf{\lambda }}λ而言的常数因子1/P(x∣λ‾)1/P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)1/P(x∣λ)(P(y,x∣λ‾)/P(y∣x,λ‾)=1/P(x∣λ‾)P\left( {{\bf{y}},{\bf{x}}|\overline {\bf{\lambda }} } \right)/P\left( {{\bf{y}}|{\bf{x}},\overline {\bf{\lambda }} } \right) = 1/P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)P(y,x∣λ)/P(y∣x,λ)=1/P(x∣λ))。
其中,λ‾{\overline {\bf{\lambda }} }λ是隐马尔可夫模型参数的当前估计值,λ\lambdaλ是要极大化的隐马尔可夫模型参数。P(x,y∣λ)=πy1by1x1ay1y2by2x2⋯ayT−1yTbyTxTP\left( {{\bf{x}},{\bf{y}}|{\bf{\lambda }}} \right) = {\pi _{{y_1}}}{b_{{y_1}{x_1}}}{a_{{y_1}{y_2}}}{b_{{y_2}{x_2}}} \cdots {a_{{y_{T - 1}}{y_T}}}{b_{{y_T}{x_T}}}P(x,y∣λ)=πy1by1x1ay1y2by2x2⋯ayT−1yTbyTxT于是函数Q(λ,λ‾)=∑ylogπy1P(x,y∣λ‾)+∑y(∑t=1T−1logaytayt+1)P(x,y∣λ‾)+∑y(∑t=1Tlogbytxt)P(x,y∣λ‾)\begin{aligned}Q\left( {{\bf{\lambda ,}}\overline {\bf{\lambda }} } \right) &= \sum\limits_{\bf{y}} {\log {{\bf{\pi }}_{{y_1}}}P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)}\\ &+ \sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T - 1} {\log {a_{{y_t}}}{a_{{y_{t + 1}}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} + \sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} \end{aligned}Q(λ,λ)=y∑logπy1P(x,y∣λ)+y∑(t=1∑T−1logaytayt+1)P(x,y∣λ)+y∑(t=1∑Tlogbytxt)P(x,y∣λ)
3)EM算法的M步:对上式的三项分别极大化。第一部分:∑ylogπy1P(x,y∣λ‾)=∑i=1NlogπsiP(x,y1=si∣λ‾)\sum\limits_{\bf{y}} {\log {{\bf{\pi }}_{{y_1}}}P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{i = 1}^N {\log {{\bf{\pi }}_{{s_i}}}P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)} y∑logπy1P(x,y∣λ)=i=1∑NlogπsiP(x,y1=si∣λ)注意到∑i=1Nπsi=1\sum\limits_{i = 1}^N {{{\bf{\pi }}_{{s_i}}}} = 1i=1∑Nπsi=1,利用拉格朗日乘子法,写出拉格朗日函数:∑i=1NlogπsiP(x,y1=si∣λ‾)+γ(∑i=1Nπsi−1)\sum\limits_{i = 1}^N {\log {{\bf{\pi }}_{{s_i}}}P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)} + \gamma \left( {\sum\limits_{i = 1}^N {{{\bf{\pi }}_{{s_i}}}} - 1} \right)i=1∑NlogπsiP(x,y1=si∣λ)+γ(i=1∑Nπsi−1)关于πsi{{{\bf{\pi }}_{{s_i}}}}πsi的偏导等于零P(x,y1=si∣λ‾)πsi+γ=0⇒P(x,y1=si∣λ‾)+πsiγ=0(1)\frac{{P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)}}{{{\pi _{{s_i}}}}} + \gamma = 0 \Rightarrow P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right) + {\pi _{{s_i}}}\gamma = 0\qquad(1)πsiP(x,y1=si∣λ)+γ=0⇒P(x,y1=si∣λ)+πsiγ=0(1)⇒−γ=∑i=1NP(x,y1=si∣λ‾)=P(x∣λ‾) \Rightarrow - \gamma = \sum\limits_{i = 1}^N {P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)} = P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)⇒−γ=i=1∑NP(x,y1=si∣λ)=P(x∣λ)带回(1)式得πsi=P(x,y1=si∣λ‾)P(x∣λ‾)=γ1(i){\pi _{{s_i}}} = \frac{{P\left( {{\bf{x}},{y_1} = {s_i}|\overline {\bf{\lambda }} } \right)}}{{P\left( {{\bf{x}}|\overline {\bf{\lambda }} } \right)}}={\gamma _1}\left( i \right)πsi=P(x∣λ)P(x,y1=si∣λ)=γ1(i)第二部分:∑y(∑t=1T−1logaytayt+1)P(x,y∣λ‾)=∑i=1N∑j=1N∑t=1T−1logaijP(x,yt=si,yt+1=sj∣λ‾)\sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T - 1} {\log {a_{{y_t}}}{a_{{y_{t + 1}}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T - 1} {\log {a_{ij}}P\left( {{\bf{x}},{y_t} = {s_i},{y_{t+1}} = {s_j}|\overline {\bf{\lambda }} } \right)} } } y∑(t=1∑T−1logaytayt+1)P(x,y∣λ)=i=1∑Nj=1∑Nt=1∑T−1logaijP(x,yt=si,yt+1=sj∣λ)类比第一部分,发现∑j=1Naij=1\sum\limits_{j = 1}^N {{a_{ij}}} = 1j=1∑Naij=1,写出拉格朗日函数:∑i=1N∑j=1N∑t=1T−1logaijP(x,yt=si,yt+1=sj∣λ‾)+γ(∑j=1Naij−1)\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T - 1} {\log {a_{ij}}P\left( {{\bf{x}},{y_t} = {s_i},{y_{t+1}} = {s_j}|\overline {\bf{\lambda }} } \right)} } + \gamma \left( {\sum\limits_{j = 1}^N {{a_{ij}}} - 1} \right)} i=1∑Nj=1∑Nt=1∑T−1logaijP(x,yt=si,yt+1=sj∣λ)+γ(j=1∑Naij−1)关于aij{{a_{ij}}}aij的偏导等于零∑t=1T−1P(x,yt=si,yt+1=sj∣λ‾)aij+γ=0(2)\frac{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)} }}{{{a_{ij}}}} + \gamma = 0\qquad(2)aijt=1∑T−1P(x,yt=si,yt+1=sj∣λ)+γ=0(2)⇒−γ=∑t=1T−1P(x,yt=si,yt+1=sj∣λ‾)P(x,yt+1=sj∣yt=si,λ‾)=∑t=1T−1P(x,yt=si∣λ‾) \Rightarrow - \gamma = \sum\limits_{t = 1}^{T - 1} {\frac{{P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)}}{{P\left( {{\bf{x}},{y_{t + 1}} = {s_j}|{y_t} = {s_i},\overline {\bf{\lambda }} } \right)}} = } \sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i}|\overline {\bf{\lambda }} } \right)} ⇒−γ=t=1∑T−1P(x,yt+1=sj∣yt=si,λ)P(x,yt=si,yt+1=sj∣λ)=t=1∑T−1P(x,yt=si∣λ)代回(2)式得aij=∑t=1T−1P(x,yt=si,yt+1=sj∣λ‾)∑t=1T−1P(x,yt=si∣λ‾)=∑t=1T−1ξt(i,j)∑t=1T−1γt(i){a_{ij}} = \frac{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i},{y_{t + 1}} = {s_j}|\overline {\bf{\lambda }} } \right)} }}{{\sum\limits_{t = 1}^{T - 1} {P\left( {{\bf{x}},{y_t} = {s_i}|\overline {\bf{\lambda }} } \right)} }}=\frac{{\sum\limits_{t = 1}^{T - 1} {{\xi _t}\left( {i,j} \right)} }}{{\sum\limits_{t = 1}^{T - 1} {{\gamma _t}\left( i \right)} }}aij=t=1∑T−1P(x,yt=si∣λ)t=1∑T−1P(x,yt=si,yt+1=sj∣λ)=t=1∑T−1γt(i)t=1∑T−1ξt(i,j)这符合aija_{ij}aij的实际意义:在yt=siy_t=s_iyt=si的前提下yt+1=sjy_{t+1}=s_{j}yt+1=sj的概率。
第三部分:∑y(∑t=1Tlogbytxt)P(x,y∣λ‾)=∑j=1N(∑t=1T−1logbytxt)P(x,yt=j∣λ‾)\sum\limits_{\bf{y}} {\left( {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{\bf{y}}|\overline {\bf{\lambda }} } \right)} = \sum\limits_{j = 1}^N {\left( {\sum\limits_{t = 1}^{T - 1} {{{\operatorname{log}b }_{{y_t}{x_t}}}} } \right)P\left( {{\bf{x}},{y_t} = j|\overline {\bf{\lambda }} } \right)} y∑(t=1∑Tlogbytxt)P(x,y∣λ)=j=1∑N(t=1∑T−1logbytxt)P(x,yt=j∣λ)同样有∑k=1Mbjxk=1\sum\limits_{k = 1}^M {{b_{j{x_k}}}} = 1k=1∑Mbjxk=1,写出拉格朗日函数:∑j=1N∑t=1TlogbytxtP(x,yt=sj∣λ‾)+γ(∑k=1Mbjxk−1)\sum\limits_{j = 1}^N {\sum\limits_{t = 1}^{T} {{{\operatorname{log}b }_{{y_t}{x_t}}}} P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right) + \gamma \left( {\sum\limits_{k = 1}^M {{b_{j{x_k}}}} - 1} \right)} j=1∑Nt=1∑TlogbytxtP(x,yt=sj∣λ)+γ(k=1∑Mbjxk−1)关于bjxk{{b_{j{x_k}}}}bjxk的偏导等于零∑t=1TP(x,yt=sj∣λ‾)I(xt=ok)bjxt+γ=0(3)\frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{{b_{j{x_t}}}}} + \gamma = 0\qquad(3)bjxtt=1∑TP(x,yt=sj∣λ)I(xt=ok)+γ=0(3)注:I(true)=1,I(false)=0I(true)=1,I(false)=0I(true)=1,I(false)=0.−γ=∑t=1TP(x,yt=sj∣λ‾)I(xt=ok)P(yt=sj∣xt=ok,λ‾)=∑t=1TP(x,yt=sj∣λ‾) - \gamma = \frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{P\left( {{y_t} = {s_j}|{x_t} = {o_k},\overline {\bf{\lambda }} } \right)}} = \sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)} −γ=P(yt=sj∣xt=ok,λ)t=1∑TP(x,yt=sj∣λ)I(xt=ok)=t=1∑TP(x,yt=sj∣λ)代回(3)式得bjxt=∑t=1TP(x,yt=sj∣λ‾)I(xt=ok)∑t=1TP(x,yt=sj∣λ‾)=∑t=1,xt=okTγt(j)∑t=1Tγt(j){b_{j{x_t}}} = \frac{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)I\left( {{x_t} = {o_k}} \right)} }}{{\sum\limits_{t = 1}^T {P\left( {{\bf{x}},{y_t} = {s_j}|\overline {\bf{\lambda }} } \right)} }}=\frac{{\sum\limits_{t = 1,{x_t} = {o_k}}^T {{\gamma _t}\left( j \right)} }}{{\sum\limits_{t = 1}^T {{\gamma _t}\left( j \right)} }}bjxt=t=1∑TP(x,yt=sj∣λ)t=1∑TP(x,yt=sj∣λ)I(xt=ok)=t=1∑Tγt(j)t=1,xt=ok∑Tγt(j)这符合bjxtb_{jx_{t}}bjxt的实际意义:在yt=sjy_t=s_jyt=sj的前提下,xt=okx_t=o_kxt=ok的概率。
预测算法
Viterbi算法
Viterbi算法实际是用动态规划解隐马尔科夫模型预测问题,即用动态规划求概率最大路径(最优路径)。最优路径的特性:如果最优路径在时刻ttt通过结点it∗{i}^*_tit∗,那么这一条路径从结点it∗{i}^*_tit∗到终点iT∗{i}^*_TiT∗的部分路径,对于从it∗{i}^*_tit∗到iT∗{i}^*_TiT∗的所有可能的部分路径来说,必须是最优的。根据这一特性,我们只需从时刻t=1t=1t=1开始,递推计算地时刻ttt状态为iii的各条部分路径的最大概率,直至得到t=Tt=Tt=T状态为iii的各条路径的最大概率。时刻t=Tt=Tt=T的最大概率即为最优路径的概率P∗P^*P∗,最优路径的终结点iT∗{i}^*_TiT∗也同时得到。之后,从终结点iT∗{i}^*_TiT∗开始,由后向前逐步求节点iT - 1∗,…,i1∗i_{T{\text{ - }}1}^*, \ldots ,i_1^*iT - 1∗,…,i1∗,得到最优路径I=(i1∗,i2∗,…,iT∗)I = \left( {i_1^*,i_2^*, \ldots ,i_T^*} \right)I=(i1∗,i2∗,…,iT∗)。