Hidden Markov Model
隐马尔可夫模型是一种概率图模型。机器学习模型可以从频率派和贝叶斯派两个方向考虑,在频率派的方法中的核心是优化问题,而在贝叶斯派的方法中,核心是积分问题,发展出一系列的积分方法:如变分推断,MCMC 等。
概率图模型最基本的模型可分为有向图(贝叶斯网络)和无向图(马尔可夫随机场),如果样本之间存在关联,可认为样本中附带时序信息,使得样本间不独立全同分布,这种模型就是动态模型,隐变量随着时间发生变化,观测变量也随之变化:
根据状态变量的特点,可以分为:
- HMM,状态变量(隐变量)是离散的,观测变量没有要求
- Kalman 滤波,状态变量是连续的,线性的
- 粒子滤波,状态变量是连续,非线性的
模型定义
HMM 用概率图表示为:
上图表示了三个时刻的隐变量变化,用 λ = ( π , A , B ) \color{blue}\lambda=(\pi,A,B) λ=(π,A,B) 表示,其中 π \color{blue}\pi π 为初始概率分布, A \color{blue}A A 为状态转移矩阵, B \color{blue}B B 为发射矩阵。
- o t \color{blue}o_t ot 表示观测变量, O \color{blue}O O 为观测序列, V = − v 1 , v 2 , ⋯ , v M {\color{blue}V}=- {v_1,v_2,\cdots,v_M} V=−v1,v2,⋯,vM 表示观测值域(能取的值),
- i t \color{blue}i_t it 表示状态变量, I \color{blue}I I 为状态序列, Q = q 1 , q 2 , ⋯ , q N {\color{blue}Q}={q_1,q_2,\cdots,q_N} Q=q1,q2,⋯,qN 表示状态变量值域,
- π = { π ( 1 ) , π ( 2 ) , ⋯ , π ( N ) } {\color{blue}\pi}=\{\pi(1),\pi(2),\cdots,\pi(N)\} π={π(1),π(2),⋯,π(N)}表示初始状态, ∑ i = 1 N π ( i ) = 1 \sum^N_{i=1}\pi(i)=1 ∑i=1Nπ(i)=1
- A = ( a i j = p ( i t + 1 = q j ∣ i t = q i ) ) {\color{blue}A}=(a_{ij}=p(i_{t+1}=q_j|i_t=q_i)) A=(aij=p(it+1=qj∣it=qi)) 表示状态转移矩阵
- B = ( b j ( k ) = p ( o t = v k ∣ i t = q j ) ) {\color{blue}B}=(b_j(k)=p(o_t=v_k|i_t=q_j)) B=(bj(k)=p(ot=vk∣it=qj)) 表示发射矩阵。
- λ ( t ) = ( π ( t ) , A ( t ) , B ( t ) ) {\color{blue}\lambda^{(t)}}=(\pi^{(t)},A^{(t)},B^{(t)}) λ(t)=(π(t),A(t),B(t)) 表示 t t t 时刻参数
两个基本假设
-
齐次 Markov 假设: p ( i t + 1 ∣ i t , i t − 1 , ⋯ , i 1 , o t , o t − 1 , ⋯ , o 1 ) = p ( i t + 1 ∣ i t ) p(i_{t+1}|i_t,i_{t-1},\cdots,i_1,o_t,o_{t-1},\cdots,o_1)=p(i_{t+1}|i_t) p(it+1∣it,it−1,⋯,i1,ot,ot−1,⋯,o1)=p(it+1∣it)
-
观测独立假设: p ( o t ∣ i t , i t − 1 , ⋯ , i 1 , o t − 1 , ⋯ , o 1 ) = p ( o t ∣ i t ) p(o_t|i_t,i_{t-1},\cdots,i_1,o_{t-1},\cdots,o_1)=p(o_t|i_t) p(ot∣it,it−1,⋯,i1,ot−1,⋯,o1)=p(ot∣it)
三个任务
HMM的三个任务:
- 1、Evaluation: p ( O ∣ λ ) p(O|\lambda) p(O∣λ),Forward-Backward 算法
- 2、Learning: λ = a r g m a x λ p ( O ∣ λ ) \lambda=\mathop{argmax}\limits_{\lambda}p(O|\lambda) λ=λargmaxp(O∣λ),EM 算法(Baum-Welch)
- 3、Decoding: I = a r g m a x I p ( I ∣ O , λ ) I=\mathop{argmax}\limits_{I}p(I|O,\lambda) I=Iargmaxp(I∣O,λ),Vierbi 算法 预测问题: p ( i t + 1 ∣ o 1 , o 2 , ⋯ , o t ) 滤波问题: p ( i t ∣ o 1 , o 2 , ⋯ , o t ) \begin{aligned}&\text{预测问题:}p(i_{t+1}|o_1,o_2,\cdots,o_t)\\&\text{滤波问题:}p(i_t|o_1,o_2,\cdots,o_t)\end{aligned} 预测问题:p(it+1∣o1,o2,⋯,ot)滤波问题:p(it∣o1,o2,⋯,ot)
Evaluation
Evaluation 问题可表示为:
p
(
O
∣
λ
)
=
∑
I
p
(
I
,
O
∣
λ
)
=
∑
I
p
(
O
∣
I
,
λ
)
p
(
I
∣
λ
)
p(O|\lambda)=\sum\limits_{I}p(I,O|\lambda)=\sum\limits_{I}p(O|I,\lambda)p(I|\lambda)
p(O∣λ)=I∑p(I,O∣λ)=I∑p(O∣I,λ)p(I∣λ)
上式中:
p
(
I
∣
λ
)
=
p
(
i
1
,
i
2
,
⋯
,
i
t
∣
λ
)
=
p
(
i
t
∣
i
1
,
i
2
,
⋯
,
i
t
−
1
,
λ
)
p
(
i
1
,
i
2
,
⋯
,
i
t
−
1
∣
λ
)
p(I|\lambda)=p(i_1,i_2,\cdots,i_t|\lambda)=p(i_t|i_1,i_2,\cdots,i_{t-1},\lambda)p(i_1,i_2,\cdots,i_{t-1}|\lambda)
p(I∣λ)=p(i1,i2,⋯,it∣λ)=p(it∣i1,i2,⋯,it−1,λ)p(i1,i2,⋯,it−1∣λ)
根据齐次 Markov 假设:
p
(
i
t
∣
i
1
,
i
2
,
⋯
,
i
t
−
1
,
λ
)
=
p
(
i
t
∣
i
t
−
1
)
=
a
i
t
−
1
i
t
p(i_t|i_1,i_2,\cdots,i_{t-1},\lambda)=p(i_t|i_{t-1})=a_{i_{t-1}i_t}
p(it∣i1,i2,⋯,it−1,λ)=p(it∣it−1)=ait−1it
所以:
p
(
I
∣
λ
)
=
π
i
1
∏
t
=
2
T
a
i
t
−
1
,
i
t
p(I|\lambda)=\pi_{i_1}\prod\limits_{t=2}^Ta_{i_{t-1},i_t}
p(I∣λ)=πi1t=2∏Tait−1,it 又由于:
p
(
O
∣
I
,
λ
)
=
∏
t
=
1
T
b
i
t
(
o
t
)
p(O|I,\lambda)=\prod\limits_{t=1}^Tb_{i_t}(o_t)
p(O∣I,λ)=t=1∏Tbit(ot)
于是:
p
(
O
∣
λ
)
=
∑
I
π
i
1
∏
t
=
2
T
a
i
t
−
1
,
i
t
∏
t
=
1
T
b
i
t
(
o
t
)
=
∑
i
⋯
∑
T
⏟
复杂度
O
(
N
T
)
π
i
1
∏
t
=
2
T
a
i
t
−
1
,
i
t
∏
t
=
1
T
b
i
t
(
o
t
)
\begin{aligned} p(O|\lambda)&=\sum\limits_{I}\pi_{i_1}\prod\limits_{t=2}^Ta_{i_{t-1},i_t}\prod\limits_{t=1}^Tb_{i_t}(o_t)\\ &=\underbrace{\sum\limits_{i}\cdots\sum\limits_{T}}_{\color{blue}\text{复杂度}O(N^T)} \pi_{i_1}\prod\limits_{t=2}^Ta_{i_{t-1},i_t}\prod\limits_{t=1}^Tb_{i_t}(o_t) \end{aligned}
p(O∣λ)=I∑πi1t=2∏Tait−1,itt=1∏Tbit(ot)=复杂度O(NT)
i∑⋯T∑πi1t=2∏Tait−1,itt=1∏Tbit(ot)
上式 I I I 为状态序列,序列长度为 T T T,每个状态变量 i i i 有 N N N 种状态,所以其复杂度为 O ( N T ) O(N^T) O(NT),计算量太大,要用更加简单的算法计算 p ( O ∣ λ ) p(O|\lambda) p(O∣λ) 。
前向算法
记
α
t
(
i
)
=
p
(
o
1
,
o
2
,
⋯
,
o
t
,
i
t
=
q
i
∣
λ
)
\alpha_t(i)=p(o_1,o_2,\cdots,o_t,i_t=q_i|\lambda)
αt(i)=p(o1,o2,⋯,ot,it=qi∣λ),所以
α
T
(
i
)
=
p
(
O
,
i
T
=
q
i
∣
λ
)
\alpha_T(i)=p(O,i_T=q_i|\lambda)
αT(i)=p(O,iT=qi∣λ)则:
p
(
O
∣
λ
)
=
∑
i
=
1
N
p
(
O
,
i
T
=
q
i
∣
λ
)
=
∑
i
=
1
N
α
T
(
i
)
p(O|\lambda)={\color{blue}\sum\limits_{i=1}^N}p(O,{\color{blue}i_T=q_i} |\lambda)=\sum\limits_{i=1}^N\alpha_T(i)
p(O∣λ)=i=1∑Np(O,iT=qi∣λ)=i=1∑NαT(i)
上式中,对状态变量 i T i_T iT 在其状态空间上求和 ∑ i = 1 N { i T = q i } = 1 \sum^N\limits_{i=1}\{i_T=q_i\}=1 i=1∑N{iT=qi}=1。
推导
α
t
+
1
(
j
)
\alpha_{t+1}(j)
αt+1(j)与
α
t
(
i
)
\alpha_{t}(i)
αt(i)间的递推关系:
α
t
+
1
(
j
)
=
p
(
o
1
,
o
2
,
⋯
,
o
t
+
1
,
i
t
+
1
=
q
j
∣
λ
)
=
∑
i
=
1
N
p
(
o
1
,
o
2
,
⋯
,
o
t
+
1
,
i
t
+
1
=
q
j
,
i
t
=
q
i
∣
λ
)
=
∑
i
=
1
N
p
(
o
t
+
1
∣
o
1
,
o
2
,
⋯
,
i
t
+
1
=
q
j
,
i
t
=
q
i
∣
λ
)
p
(
o
1
,
⋯
,
o
t
,
i
t
=
q
i
,
i
t
+
1
=
q
j
∣
λ
)
=
∑
i
=
1
N
p
(
o
t
+
1
∣
i
t
+
1
=
q
j
)
p
(
o
1
,
⋯
,
o
t
,
i
t
=
q
i
,
i
t
+
1
=
q
j
∣
λ
)
观测独立假设
=
∑
i
=
1
N
p
(
o
t
+
1
∣
i
t
+
1
=
q
j
)
⏟
b
j
(
o
t
)
p
(
i
t
+
1
=
q
j
∣
o
1
,
⋯
,
o
t
,
i
t
=
q
i
,
λ
)
⏟
p
(
i
t
+
1
=
q
j
∣
i
t
=
q
i
,
λ
)
=
a
i
j
p
(
o
1
,
⋯
,
o
t
,
i
t
=
q
i
∣
λ
)
⏟
α
t
(
i
)
=
∑
i
=
1
N
b
j
(
o
t
)
a
i
j
α
t
(
i
)
\begin{aligned} {\color{blue}\alpha_{t+1}(j)}&=p(o_1,o_2,\cdots,o_{t+1},i_{t+1}=q_j|\lambda)\\ &={\color{blue}\sum\limits_{i=1}^N}p(o_1,o_2,\cdots,o_{t+1},i_{t+1}=q_j,{\color{blue}i_t=q_i|}\lambda)\\ &=\sum\limits_{i=1}^Np(o_{t+1}|o_1,o_2,\cdots,i_{t+1}=q_j,i_t=q_i|\lambda)p(o_1,\cdots,o_t,i_t=q_i,i_{t+1}=q_j|\lambda) \\ &=\sum\limits_{i=1}^N{\color{blue}p(o_{t+1}|i_{t+1}=q_j)}p(o_1,\cdots,o_t,i_t=q_i,i_{t+1}=q_j|\lambda)\quad\color{blue}\text{观测独立假设}\\ &=\sum\limits_{i=1}^N\underbrace{p(o_{t+1}|i_{t+1}=q_j)}_{\color{blue}b_{j}(o_t)}\underbrace{p({\color{blue}i_{t+1}=q_j}|o_1,\cdots,o_t,{\color{blue}i_t=q_i},\lambda)}_{\color{blue}p(i_{t+1}=q_j|i_t=q_i,\lambda)=a_{ij}} \underbrace{p(o_1,\cdots,o_t,i_t=q_i|\lambda)}_{\color{blue}\alpha_t(i)}\\ &=\sum\limits_{i=1}^Nb_{j}(o_t)a_{ij}{\color{blue}\alpha_t(i)} \end{aligned}
αt+1(j)=p(o1,o2,⋯,ot+1,it+1=qj∣λ)=i=1∑Np(o1,o2,⋯,ot+1,it+1=qj,it=qi∣λ)=i=1∑Np(ot+1∣o1,o2,⋯,it+1=qj,it=qi∣λ)p(o1,⋯,ot,it=qi,it+1=qj∣λ)=i=1∑Np(ot+1∣it+1=qj)p(o1,⋯,ot,it=qi,it+1=qj∣λ)观测独立假设=i=1∑Nbj(ot)
p(ot+1∣it+1=qj)p(it+1=qj∣it=qi,λ)=aij
p(it+1=qj∣o1,⋯,ot,it=qi,λ)αt(i)
p(o1,⋯,ot,it=qi∣λ)=i=1∑Nbj(ot)aijαt(i)
利用齐次 Markov 假设得到递推公式,也称为前向算法。
后向算法
定义:
β
t
(
i
)
=
p
(
o
t
+
1
,
⋯
,
o
T
∣
i
t
=
q
i
,
λ
)
⋮
β
1
(
i
)
=
p
(
o
2
,
⋯
,
o
T
∣
i
1
=
q
i
,
λ
)
\beta_t(i)=p(o_{t+1},\cdots,o_T|{\color{blue}i_t}=q_i,\lambda)\\ \vdots\\ \beta_1(i)=p(o_{2},\cdots,o_T|{\color{blue}i_1}=q_i,\lambda)
βt(i)=p(ot+1,⋯,oT∣it=qi,λ)⋮β1(i)=p(o2,⋯,oT∣i1=qi,λ)
则有:
p
(
O
∣
λ
)
=
p
(
o
1
,
⋯
,
o
T
∣
λ
)
=
∑
i
=
1
N
p
(
o
1
,
o
2
,
⋯
,
o
T
,
i
1
=
q
i
∣
λ
)
引入
i
1
=
∑
i
=
1
N
p
(
o
1
,
o
2
,
⋯
,
o
T
∣
i
1
=
q
i
,
λ
)
p
(
i
1
=
q
i
)
⏟
初始概率分布
π
i
=
∑
i
=
1
N
p
(
o
1
,
o
2
,
⋯
,
o
T
∣
i
1
=
q
i
,
λ
)
π
i
=
∑
i
=
1
N
p
(
o
1
∣
o
2
,
⋯
,
o
T
,
i
1
=
q
i
,
λ
)
⏟
观测独立假设
p
(
o
2
,
⋯
,
o
T
∣
i
1
=
q
i
,
λ
)
⏟
β
1
(
i
)
π
i
=
∑
i
=
1
N
b
i
(
o
1
)
π
i
β
1
(
i
)
\begin{aligned} {\color{blue}p(O|\lambda)}&=p(o_1,\cdots,o_T|\lambda)\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_T,i_1=q_i|\lambda)\qquad\color{blue}\text{引入$i_1$}\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_T|i_1=q_i,\lambda)\underbrace{p(i_1=q_i)}_{\color{blue}\text{初始概率分布}\pi_i}\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_T|i_1=q_i,\lambda)\pi_i\\ &=\sum\limits_{i=1}^N\underbrace{p({\color{blue}o_1}|o_2,\cdots,o_T,{\color{blue}i_1=q_i},\lambda)}_{\color{blue}\text{观测独立假设}}\underbrace{p(o_2,\cdots,o_T|i_1=q_i,\lambda)}_{\color{blue}\beta_1(i)}\pi_i\\ &=\sum\limits_{i=1}^Nb_i(o_1)\pi_i{\color{blue}\beta_1(i) } \end{aligned}
p(O∣λ)=p(o1,⋯,oT∣λ)=i=1∑Np(o1,o2,⋯,oT,i1=qi∣λ)引入i1=i=1∑Np(o1,o2,⋯,oT∣i1=qi,λ)初始概率分布πi
p(i1=qi)=i=1∑Np(o1,o2,⋯,oT∣i1=qi,λ)πi=i=1∑N观测独立假设
p(o1∣o2,⋯,oT,i1=qi,λ)β1(i)
p(o2,⋯,oT∣i1=qi,λ)πi=i=1∑Nbi(o1)πiβ1(i)
β
1
(
i
)
\beta_1(i)
β1(i) 同样可以通过递推公式得到:
β
t
(
i
)
=
p
(
o
t
+
1
,
⋯
,
o
T
∣
i
t
=
q
i
)
=
∑
j
=
1
N
p
(
o
t
+
1
,
o
t
+
2
,
⋯
,
o
T
,
i
t
+
1
=
q
j
∣
i
t
=
q
i
)
=
∑
j
=
1
N
p
(
o
t
+
1
,
⋯
,
o
T
∣
i
t
+
1
=
q
j
,
i
t
=
q
i
)
p
(
i
t
+
1
=
q
j
∣
i
t
=
q
i
)
=
∑
j
=
1
N
p
(
o
t
+
1
,
⋯
,
o
T
∣
i
t
+
1
=
q
j
)
a
i
j
=
∑
j
=
1
N
p
(
o
t
+
1
∣
o
t
+
2
,
⋯
,
o
T
,
i
t
+
1
=
q
j
)
p
(
o
t
+
2
,
⋯
,
o
T
∣
i
t
+
1
=
q
j
)
a
i
j
=
∑
j
=
1
N
b
j
(
o
t
+
1
)
a
i
j
β
t
+
1
(
j
)
\begin{aligned} {\color{blue}\beta_t(i)}&=p(o_{t+1},\cdots,o_T|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},o_{t+2},\cdots,o_T,i_{t+1}=q_j|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},\cdots,o_T|i_{t+1}=q_j,i_t=q_i)p(i_{t+1}=q_j|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},\cdots,o_T|i_{t+1}=q_j)a_{ij}\\ &=\sum\limits_{j=1}^Np(o_{t+1}|o_{t+2},\cdots,o_T,i_{t+1}=q_j)p(o_{t+2},\cdots,o_T|i_{t+1}=q_j)a_{ij}\\ &=\sum\limits_{j=1}^Nb_j(o_{t+1})a_{ij}{\color{blue}\beta_{t+1}(j) } \end{aligned}
βt(i)=p(ot+1,⋯,oT∣it=qi)=j=1∑Np(ot+1,ot+2,⋯,oT,it+1=qj∣it=qi)=j=1∑Np(ot+1,⋯,oT∣it+1=qj,it=qi)p(it+1=qj∣it=qi)=j=1∑Np(ot+1,⋯,oT∣it+1=qj)aij=j=1∑Np(ot+1∣ot+2,⋯,oT,it+1=qj)p(ot+2,⋯,oT∣it+1=qj)aij=j=1∑Nbj(ot+1)aijβt+1(j)
上述两种算法的复杂度均为 O ( T N 2 ) O(TN^2) O(TN2),使得计算量大大降低。
Learning
为了学习得到参数的最优值,要用到 MLE : λ M L E = a r g m a x λ p ( O ∣ λ ) \lambda_{MLE}=\mathop{argmax}\limits_\lambda p(O|\lambda) λMLE=λargmaxp(O∣λ)
上式难以直接求解,需要采用 EM 算法(这里也叫 Baum Welch 算法)进行迭代求解,EM 算法的迭代公式为: θ t + 1 = a r g m a x θ ∫ z log p ( X , Z ∣ θ ) p ( Z ∣ X , θ t ) d z \theta^{t+1}=\mathop{argmax}\limits_{\theta}\int_z\log p(X,Z|\theta)p(Z|X,\theta^t)dz θt+1=θargmax∫zlogp(X,Z∣θ)p(Z∣X,θt)dz
上式
X
X
X 是观测变量,
Z
Z
Z 是隐变量序列,
θ
\theta
θ 为模型参数,分别与这里的
O
,
I
,
λ
\color{blue}O,I,\lambda
O,I,λ 对应,于是:
λ
t
+
1
=
a
r
g
m
a
x
λ
∑
I
log
p
(
O
,
I
∣
λ
)
p
(
I
∣
O
,
λ
t
)
=
a
r
g
m
a
x
λ
∑
I
log
p
(
O
,
I
∣
λ
)
p
(
O
,
I
∣
λ
t
)
p
(
O
,
λ
t
)
框中
O
为
λ
无关项
=
a
r
g
m
a
x
λ
∑
I
log
p
(
O
,
I
∣
λ
)
p
(
O
,
I
∣
λ
t
)
\begin{aligned} \lambda^{t+1}&=\mathop{argmax}\limits_\lambda\sum\limits_I\log p(O,I|\lambda)p(I|O,\lambda^t)\\ &=\mathop{argmax}\limits_\lambda\sum\limits_I\log p(O,I|\lambda)\frac{p(O,I|\lambda^t)}{\boxed{p(O,\lambda^t)}} \quad_{\color{blue}\text{框中$O$为$\lambda$无关项}}\\ &=\mathop{argmax}\limits_\lambda\sum\limits_I\log p(O,I|\lambda)p(O,I|\lambda^t) \end{aligned}
λt+1=λargmaxI∑logp(O,I∣λ)p(I∣O,λt)=λargmaxI∑logp(O,I∣λ)p(O,λt)p(O,I∣λt)框中O为λ无关项=λargmaxI∑logp(O,I∣λ)p(O,I∣λt)
上式中 p ( O ∣ λ t ) p(O|\lambda^t) p(O∣λt) 和 λ \lambda λ 无关,去掉该项不影响参数 λ \lambda λ 的优化。
由 Evaluation 中的推导可知:
Q
(
λ
,
λ
t
)
=
∑
I
log
p
(
O
,
I
∣
λ
)
p
(
O
,
I
∣
λ
t
)
=
∑
I
[
log
π
i
1
+
∑
t
=
2
T
log
a
i
t
−
1
,
i
t
+
∑
t
=
1
T
log
b
i
t
(
o
t
)
]
p
(
O
,
I
∣
λ
t
)
\begin{aligned}Q(\lambda,\lambda^{t}) &=\sum\limits_I\log p(O,I|\lambda)p(O,I|\lambda^t)\\ &=\sum\limits_I[\log \pi_{i_1}+\sum\limits_{t=2}^T\log a_{i_{t-1},i_t}+\sum\limits_{t=1}^T\log b_{i_t}(o_t)]p(O,I|\lambda^t) \end{aligned}
Q(λ,λt)=I∑logp(O,I∣λ)p(O,I∣λt)=I∑[logπi1+t=2∑Tlogait−1,it+t=1∑Tlogbit(ot)]p(O,I∣λt)
提取上式中
π
\color{blue}\boxed\pi
π 相关项:
π
t
+
1
=
a
r
g
m
a
x
π
∑
I
[
log
π
i
1
p
(
O
,
I
∣
λ
t
)
]
=
a
r
g
m
a
x
π
∑
I
[
log
π
i
1
⋅
p
(
O
,
i
1
,
i
2
,
⋯
,
i
T
∣
λ
t
)
]
=
a
r
g
m
a
x
π
∑
i
1
∑
i
2
⋯
∑
i
T
[
log
π
i
1
⋅
p
(
O
,
i
1
,
i
2
,
⋯
,
i
T
∣
λ
t
)
]
⏟
相当于对
i
2
,
i
2
,
⋯
,
i
T
求边缘分布
=
a
r
g
m
a
x
π
∑
i
1
[
log
π
i
1
⋅
p
(
O
,
i
1
∣
λ
t
)
]
\begin{aligned}\pi^{t+1}&=\mathop{argmax}\limits_\pi\sum\limits_I[\log \pi_{i_1}p(O,I|\lambda^t)]\\ &=\mathop{argmax}\limits_\pi\sum\limits_I[\log \pi_{i_1}\cdot p(O,i_1,i_2,\cdots,i_T|\lambda^t)] \\ &=\mathop{argmax}\limits_\pi\sum\limits_{i_1} \underbrace{ \sum\limits_{i_2}\cdots\sum\limits_{i_T}[\log \pi_{i_1}\cdot p(O,i_1,i_2,\cdots,i_T|\lambda^t)] }_{\color{blue}\text{相当于对$i_2,i_2,\cdots,i_T$ 求边缘分布}}\\ &=\mathop{argmax}\limits_\pi\sum\limits_{i_1}[\log \pi_{i_1}\cdot p(O,i_1|\lambda^t)] \end{aligned}
πt+1=πargmaxI∑[logπi1p(O,I∣λt)]=πargmaxI∑[logπi1⋅p(O,i1,i2,⋯,iT∣λt)]=πargmaxi1∑相当于对i2,i2,⋯,iT 求边缘分布
i2∑⋯iT∑[logπi1⋅p(O,i1,i2,⋯,iT∣λt)]=πargmaxi1∑[logπi1⋅p(O,i1∣λt)]
上式中: i 1 i_1 i1 有 N N N个状态: i 1 = q i i_1=q_i i1=qi, π \pi π 的约束条件 s t : ∑ i π i = 1 st:\sum\limits_i\pi_i=1 st:i∑πi=1,可定义 Lagrange 函数: L ( π , η ) = ∑ i = 1 N log π i ⋅ p ( O , i 1 = q i ∣ λ t ) + η ( ∑ i = 1 N π i − 1 ) ⏟ l i L(\pi,\eta)=\sum\limits_{i=1}^N\underbrace{\log \pi_i\cdot p(O,i_1=q_i|\lambda^t)+\eta(\sum\limits_{i=1}^N\pi_i-1)}_{\color{blue}l_i} L(π,η)=i=1∑Nli logπi⋅p(O,i1=qi∣λt)+η(i=1∑Nπi−1)
对求和符号中的被求和项求偏导:
∂
l
i
∂
π
i
=
1
π
i
p
(
O
,
i
1
=
q
i
∣
λ
t
)
+
η
=
0
\frac{\partial l_i}{\partial\pi_i}=\frac{1}{\pi_i}p(O,i_1=q_i|\lambda^t)+\eta=0
∂πi∂li=πi1p(O,i1=qi∣λt)+η=0
对上式求和:
∂
L
∂
π
i
=
∑
i
=
1
N
p
(
O
,
i
1
=
q
i
∣
λ
t
)
+
π
i
η
=
0
⇒
η
=
−
p
(
O
∣
λ
t
)
\frac{\partial L}{\partial\pi_i}=\sum\limits_{i=1}^Np(O,i_1=q_i|\lambda^t)+\pi_i\eta=0\\ \Rightarrow\eta=-p(O|\lambda^t)
∂πi∂L=i=1∑Np(O,i1=qi∣λt)+πiη=0⇒η=−p(O∣λt)
于是可得到:
π
i
t
+
1
=
p
(
O
,
i
1
=
q
i
∣
λ
t
)
p
(
O
∣
λ
t
)
{\color{blue}\pi_i^{t+1}}=\frac{p(O,i_1=q_i|\lambda^t)}{p(O|\lambda^t)}
πit+1=p(O∣λt)p(O,i1=qi∣λt)
Decoding
Decoding 问题表示为:
I
=
a
r
g
m
a
x
I
p
(
I
∣
O
,
λ
)
I=\mathop{argmax}\limits_{I}p(I|O,\lambda)
I=Iargmaxp(I∣O,λ)
就是找到一个由 q i q_i qi 组成的序列,使得概率最大,可采用动态规划的思想求解
定义:
δ
t
(
i
)
=
max
i
1
,
⋯
,
i
t
−
1
p
(
o
1
,
⋯
,
o
t
,
i
1
,
⋯
,
i
t
−
1
,
i
t
=
q
i
)
\delta_{t}(i)=\max\limits_{i_1,\cdots,i_{t-1}}p(o_1,\cdots,o_t,i_1,\cdots,i_{t-1},i_t=q_i)
δt(i)=i1,⋯,it−1maxp(o1,⋯,ot,i1,⋯,it−1,it=qi)
则: δ t + 1 ( j ) = max 1 ≤ i ≤ N δ t ( i ) a i j b j ( o t + 1 ) \delta_{t+1}(j)=\max\limits_{1\le i\le N}\delta_t(i)a_{ij}\color{blue}b_j(o_{t+1}) δt+1(j)=1≤i≤Nmaxδt(i)aijbj(ot+1)
从上一步到下一步的概率再求最大值,记录序列的路径(每个时刻对应状态
q
i
q_i
qi的下标):
ψ
t
+
1
(
j
)
=
a
r
g
m
a
x
1
≤
i
≤
N
δ
t
(
i
)
a
i
j
\psi_{t+1}(j)=\mathop{argmax}\limits_{1\le i\le N}\delta_t(i)a_{ij}
ψt+1(j)=1≤i≤Nargmaxδt(i)aij
小结
HMM 是一种动态模型,是由混合树形模型和时序结合起来的一种模型(类似 GMM + Time)。对于类似 HMM 的这种状态空间模型,普遍的除了学习任务(采用 EM )外,还有推断任务,推断任务包括:
-
解码 Decoding: p ( z 1 , z 2 , ⋯ , z t ∣ x 1 , x 2 , ⋯ , x t ) p(z_1,z_2,\cdots,z_t|x_1,x_2,\cdots,x_t) p(z1,z2,⋯,zt∣x1,x2,⋯,xt)
-
似然概率: p ( X ∣ θ ) p(X|\theta) p(X∣θ)
-
滤波: p ( z t ∣ x 1 , ⋯ , x t ) p(z_t|x_1,\cdots,x_t) p(zt∣x1,⋯,xt),Online p ( z t ∣ x 1 : t ) = p ( x 1 : t , z t ) p ( x 1 : t ) = C α t ( z t ) p(z_t|x_{1:t})=\frac{p(x_{1:t},z_t)}{p(x_{1:t})}=C\alpha_t(z_t) p(zt∣x1:t)=p(x1:t)p(x1:t,zt)=Cαt(zt)
-
平滑: p ( z t ∣ x 1 , ⋯ , x T ) p(z_t|x_1,\cdots,x_T) p(zt∣x1,⋯,xT),Offline p ( z t ∣ x 1 : T ) = p ( x 1 : T , z t ) p ( x 1 : T ) = α t ( z t ) p ( x t + 1 : T ∣ x 1 : t , z t ) p ( x 1 : T ) p(z_t|x_{1:T})=\frac{p(x_{1:T},z_t)}{p(x_{1:T})}=\frac{\alpha_t(z_t)p(x_{t+1:T}|x_{1:t},z_t)}{p(x_{1:T})} p(zt∣x1:T)=p(x1:T)p(x1:T,zt)=p(x1:T)αt(zt)p(xt+1:T∣x1:t,zt) 根据概率图的条件独立性,有: p ( z t ∣ x 1 : T ) = α t ( z t ) p ( x t + 1 : T ∣ z t ) p ( x 1 : T ) = C α t ( z t ) β t ( z t ) p(z_t|x_{1:T})=\frac{\alpha_t(z_t)p(x_{t+1:T}|z_t)}{p(x_{1:T})}=C\alpha_t(z_t)\beta_t(z_t) p(zt∣x1:T)=p(x1:T)αt(zt)p(xt+1:T∣zt)=Cαt(zt)βt(zt) 这个算法叫做前向后向算法。
-
预测: p ( z t + 1 , z t + 2 ∣ x 1 , ⋯ , x t ) , p ( x t + 1 , x t + 2 ∣ x 1 , ⋯ , x t ) p(z_{t+1},z_{t+2}|x_1,\cdots,x_t),p(x_{t+1},x_{t+2}|x_1,\cdots,x_t) p(zt+1,zt+2∣x1,⋯,xt),p(xt+1,xt+2∣x1,⋯,xt) p ( z t + 1 ∣ x 1 : t ) = ∑ z t p ( z t + 1 , z t ∣ x 1 : t ) = ∑ z t p ( z t + 1 ∣ z t ) p ( z t ∣ x 1 : t ) p(z_{t+1}|x_{1:t})=\sum_{z_t}p(z_{t+1},z_t|x_{1:t})=\sum\limits_{z_t}p(z_{t+1}|z_t)p(z_t|x_{1:t}) p(zt+1∣x1:t)=zt∑p(zt+1,zt∣x1:t)=zt∑p(zt+1∣zt)p(zt∣x1:t) p ( x t + 1 ∣ x 1 : t ) = ∑ z t + 1 p ( x t + 1 , z t + 1 ∣ x 1 : t ) = ∑ z t + 1 p ( x t + 1 ∣ z t + 1 ) p ( z t + 1 ∣ x 1 : t ) p(x_{t+1}|x_{1:t})=\sum\limits_{z_{t+1}}p(x_{t+1},z_{t+1}|x_{1:t})=\sum\limits_{z_{t+1}}p(x_{t+1}|z_{t+1})p(z_{t+1}|x_{1:t}) p(xt+1∣x1:t)=zt+1∑p(xt+1,zt+1∣x1:t)=zt+1∑p(xt+1∣zt+1)p(zt+1∣x1:t)