估计HMM模型参数时,根据是否已知观测序列对应状态序列,可分为由监督学习算法实现和由无监督学习算法实现。
1. 有监督学习估计HMM模型参数
假设给定训练数据包含 n n n 个观测序列和对应的状态序列(不同观测序列长度可以相同,也可以不同) { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ⋯ , ( X n , Y n ) } \{(X_1, Y_1), (X_2, Y_2), \cdots, (X_n, Y_n)\} {(X1,Y1),(X2,Y2),⋯,(Xn,Yn)},当已知状态和挂测序列时,可按模型参数定义直接统计得到每个样本上的模型参数值,然后在整个训练数据集的所有样本上求期望即可得到模型参数的估计。
1.1 转移概率 a i j a_{ij} aij 的估计
设第
k
k
k 个样本中,
t
t
t 时刻处于状态
s
i
s_i
si、
t
+
1
t+1
t+1 时刻处于状态
s
j
s_j
sj 的频数为
A
i
j
A_{ij}
Aij,那么状态转移概率
a
i
j
a_{ij}
aij 的估计是:
a
^
i
j
=
∑
k
=
1
n
A
i
j
∑
k
=
1
n
∑
j
=
1
N
A
i
j
,
i
=
1
,
2
,
⋯
,
N
;
j
=
1
,
2
,
⋯
,
N
(1.1)
\hat{a}_{ij} = \frac{ \sum_{k=1}^{n} A_{ij} }{ \sum_{k=1}^{n} \sum_{j=1}^{N} A_{ij} }, \ \ \ \ \ \ i = 1, 2, \cdots, N; \ \ \ \ j = 1, 2, \cdots, N \tag{1.1}
a^ij=∑k=1n∑j=1NAij∑k=1nAij, i=1,2,⋯,N; j=1,2,⋯,N(1.1)
1.2 观测发生概率 b j ( v l ) b_j(v_l) bj(vl) 的估计
设第
k
k
k 个样本中,状态为
s
j
s_j
sj 且观测为
o
l
o_l
ol 的频数是
B
j
l
B_{jl}
Bjl,那么观测发生概率
b
j
(
v
l
)
b_j(v_l)
bj(vl) 的估计是:
b
^
j
(
o
l
)
=
∑
k
=
1
n
B
j
l
∑
k
=
1
n
∑
l
=
1
M
B
j
l
,
i
=
1
,
2
,
⋯
,
N
;
l
=
1
,
2
,
⋯
,
M
(1.2)
\hat{b}_{j}(o_l) = \frac{ \sum_{k=1}^{n} B_{jl} }{ \sum_{k=1}^{n} \sum_{l=1}^{M} B_{jl} }, \ \ \ \ \ \ i = 1, 2, \cdots, N; \ \ \ \ l = 1, 2, \cdots, M \tag{1.2}
b^j(ol)=∑k=1n∑l=1MBjl∑k=1nBjl, i=1,2,⋯,N; l=1,2,⋯,M(1.2)
1.3 初始状态概率 π i \pi_i πi 的估计
初始状态概率
π
i
\pi_i
πi 的估计是
n
n
n 个样本中,对应状态出现的频率:
π
^
i
=
1
n
∑
k
=
1
n
i
f
(
y
1
=
s
i
,
1
,
0
)
,
i
=
1
,
2
,
⋯
,
N
(1.3)
\hat{\pi}_i = \frac{1}{n}\sum_{k=1}^{n}\ if(y_1 = s_i, \ 1, \ 0), \ \ \ \ \ \ i = 1, 2, \cdots, N \tag{1.3}
π^i=n1k=1∑n if(y1=si, 1, 0), i=1,2,⋯,N(1.3)
2. 无监督学习估计HMM模型参数
因为状态标注成本较高,所以仅给出观测序列数据、要求估计HMM模型参数的情况更为常见。假定训练数据只包含长度为 T T T 的观测序列 X X X 而没有对应的状态序列 Y Y Y,目标是估计隐马尔科夫模型的参数 λ = ( A , B , π ) \lambda = (A, B, \pi) λ=(A,B,π)。此情况下,状态序列 Y Y Y 是不可观测的隐变量(hidden variable),HHM模型是一个含有隐变量的概率模型:
P
(
X
∣
λ
)
=
∑
Y
P
(
X
,
Y
∣
λ
)
=
∑
Y
P
(
X
∣
Y
,
λ
)
P
(
Y
∣
λ
)
(2.1)
P(X | \lambda)= \sum_{Y}P(X, Y | \lambda) = \sum_{Y}P(X | Y, \lambda)P(Y | \lambda) \tag{2.1}
P(X∣λ)=Y∑P(X,Y∣λ)=Y∑P(X∣Y,λ)P(Y∣λ)(2.1)根据EM算法,HHM模型参数的极大似然估计为1:
λ
^
=
arg max
λ
Q
(
λ
,
λ
ˉ
)
(2.2)
\hat{\lambda} = \argmax_{\lambda} Q(\lambda, \bar{\lambda}) \tag{2.2}
λ^=λargmaxQ(λ,λˉ)(2.2)
Q
(
λ
,
λ
ˉ
)
=
∑
Y
P
(
X
,
Y
∣
λ
ˉ
)
⋅
l
o
g
P
(
X
,
Y
∣
λ
)
(2.3)
Q(\lambda, \bar{\lambda}) = \sum_{Y} P(X, Y | \bar{\lambda}) \cdot logP(X, Y | \lambda) \tag{2.3}
Q(λ,λˉ)=Y∑P(X,Y∣λˉ)⋅logP(X,Y∣λ)(2.3)
按照 Q Q Q函数定义, 式 ( 2.3 ) 式(2.3) 式(2.3)省去了对 λ \lambda λ 而言的常数因子 1 / P ( X ∣ λ ) 1/P(X | \lambda) 1/P(X∣λ)。
因为根据定义,HMM模型中
P
(
X
,
Y
∣
λ
)
=
π
i
1
b
i
1
(
x
1
)
⋅
a
i
1
i
2
b
i
2
(
x
2
)
⋯
a
i
T
−
1
i
T
b
i
T
(
x
T
)
P(X, Y | \lambda) = \pi_{i_1}b_{i_1}(x_1) \cdot a_{i_1i_2}b_{i_2}(x_2) \cdots a_{i_{T-1}i_T}b_{i_T}(x_T)
P(X,Y∣λ)=πi1bi1(x1)⋅ai1i2bi2(x2)⋯aiT−1iTbiT(xT)
所以基于对数的运算法则,可将参数估计中模型每种参数涉及的子项拆解并归集到一起,
式
(
2.3
)
式(2.3)
式(2.3)可改写为如下所示:
Q
(
λ
,
λ
ˉ
)
=
∑
Y
l
o
g
(
π
i
1
)
⋅
P
(
X
,
Y
∣
λ
ˉ
)
+
∑
Y
[
∑
t
=
1
T
−
1
l
o
g
(
a
i
t
i
t
+
1
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
+
∑
Y
[
∑
t
=
1
T
l
o
g
(
b
i
t
(
x
t
)
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
Q(\lambda, \bar{\lambda}) = \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) + \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) + \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda})
Q(λ,λˉ)=Y∑log(πi1)⋅P(X,Y∣λˉ)+Y∑[t=1∑T−1log(aitit+1)]⋅P(X,Y∣λˉ)+Y∑[t=1∑Tlog(bit(xt))]⋅P(X,Y∣λˉ)
因此,求模型参数
λ
\lambda
λ 的极大似然估计,可转换为单独求 模型每一种参数的极大似然估计:
π
^
i
=
arg max
π
i
∑
Y
l
o
g
(
π
i
1
)
⋅
P
(
X
,
Y
∣
λ
ˉ
)
(2.4)
\hat{\pi}_i = \argmax_{\pi_i} \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) \tag{2.4}
π^i=πiargmaxY∑log(πi1)⋅P(X,Y∣λˉ)(2.4)
a
^
i
j
=
arg max
π
i
∑
Y
[
∑
t
=
1
T
−
1
l
o
g
(
a
i
t
i
t
+
1
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
(2.5)
\hat{a}_{ij} = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) \tag{2.5}
a^ij=πiargmaxY∑[t=1∑T−1log(aitit+1)]⋅P(X,Y∣λˉ)(2.5)
b
^
j
(
k
)
=
arg max
π
i
∑
Y
[
∑
t
=
1
T
l
o
g
(
b
i
t
(
x
t
)
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
(2.6)
\hat{b}_j(k) = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda}) \tag{2.6}
b^j(k)=πiargmaxY∑[t=1∑Tlog(bit(xt))]⋅P(X,Y∣λˉ)(2.6)
2.1 初始状态概率 π i \pi_i πi 的估计
π ^ i = arg max π i ∑ Y l o g ( π i 1 ) ⋅ P ( X , Y ∣ λ ˉ ) (2.4) \hat{\pi}_i = \argmax_{\pi_i} \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) \tag{2.4} π^i=πiargmaxY∑log(πi1)⋅P(X,Y∣λˉ)(2.4) π ^ i = arg max π i ∑ i = 1 N l o g ( π i ) ⋅ P ( X , y 1 = s i ∣ λ ˉ ) (2.1.1) \hat{\pi}_i = \argmax_{\pi_i} \sum_{i=1}^{N} log(\pi_{i}) \cdot P(X, y1=s_i | \bar{\lambda}) \tag{2.1.1} π^i=πiargmaxi=1∑Nlog(πi)⋅P(X,y1=si∣λˉ)(2.1.1)
注意到 π i \pi_i πi 满足约束条件 ∑ i = 1 N π i = 1 \sum_{i=1}^{N} \pi_i = 1 ∑i=1Nπi=1,利用拉尔朗日乘子法写出 式 ( 2.1.1 ) 式(2.1.1) 式(2.1.1)的拉格朗日函数:
∑ i = 1 N l o g ( π i ) ⋅ P ( X , y 1 = s i ∣ λ ˉ ) + γ ( ∑ i = 1 N π i − 1 ) (2.1.2) \sum_{i=1}^{N} log(\pi_{i}) \cdot P(X, y1=s_i | \bar{\lambda}) + \gamma(\sum_{i=1}^{N} \pi_i - 1) \tag{2.1.2} i=1∑Nlog(πi)⋅P(X,y1=si∣λˉ)+γ(i=1∑Nπi−1)(2.1.2)
对 式 ( 2.1.2 ) 式(2.1.2) 式(2.1.2)求关于 π i \pi_i πi 的偏导数并令结果等于0,得:
P ( X , y 1 = s i ∣ λ ˉ ) π i + γ = 0 \frac{P(X, y1=s_i | \bar{\lambda})}{\pi_{i}} + \gamma = 0 πiP(X,y1=si∣λˉ)+γ=0 P ( X , y 1 = s i ∣ λ ˉ ) + γ π i = 0 (2.1.3) P(X, y1=s_i | \bar{\lambda}) + \gamma \pi_{i} = 0 \tag{2.1.3} P(X,y1=si∣λˉ)+γπi=0(2.1.3)
对 式 ( 2.1.3 ) 式(2.1.3) 式(2.1.3)中所有 i i i 的可能情况求和,得到:
∑
i
=
1
N
[
P
(
X
,
y
1
=
s
i
∣
λ
ˉ
)
+
γ
π
i
]
=
0
\sum_{i=1}^{N} [P(X, y1=s_i | \bar{\lambda}) + \gamma \pi_{i}] = 0
i=1∑N[P(X,y1=si∣λˉ)+γπi]=0
∑
i
=
1
N
P
(
X
,
y
1
=
s
i
∣
λ
ˉ
)
+
γ
∑
i
=
1
N
π
i
=
0
(2.1.4)
\sum_{i=1}^{N} P(X, y1=s_i | \bar{\lambda}) + \gamma \sum_{i=1}^{N} \pi_{i} = 0 \tag{2.1.4}
i=1∑NP(X,y1=si∣λˉ)+γi=1∑Nπi=0(2.1.4)因为
{
∑
i
=
1
N
P
(
X
,
y
1
=
s
i
∣
λ
ˉ
)
=
P
(
X
∣
λ
ˉ
)
∑
i
=
1
N
π
i
=
1
\begin{cases} \sum_{i=1}^{N} P(X, y1=s_i | \bar{\lambda}) = P(X | \bar {\lambda}) \\ \\ \sum_{i=1}^{N} \pi_{i} = 1 \end{cases}
⎩⎪⎨⎪⎧∑i=1NP(X,y1=si∣λˉ)=P(X∣λˉ)∑i=1Nπi=1所以
γ
=
−
P
(
X
∣
λ
ˉ
)
(2.1.5)
\gamma = -P(X | \bar {\lambda}) \tag{2.1.5}
γ=−P(X∣λˉ)(2.1.5)
所以将其带入 式 ( 2.1.3 ) 式(2.1.3) 式(2.1.3)后,得到参数 π i \pi_i πi的极大似然估计:
π ^ i = P ( X , y 1 = s i ∣ λ ˉ ) P ( X ∣ λ ˉ ) (2.1.6) \hat{\pi}_i = \frac{P(X, y1=s_i | \bar{\lambda})}{P(X | \bar {\lambda})} \tag{2.1.6} π^i=P(X∣λˉ)P(X,y1=si∣λˉ)(2.1.6)
2.2 转移概率 a i j a_{ij} aij 的估计
a
^
i
j
=
arg max
π
i
∑
Y
[
∑
t
=
1
T
−
1
l
o
g
(
a
i
t
i
t
+
1
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
(2.5)
\hat{a}_{ij} = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) \tag{2.5}
a^ij=πiargmaxY∑[t=1∑T−1log(aitit+1)]⋅P(X,Y∣λˉ)(2.5)
a
^
i
j
=
arg max
π
i
∑
i
=
1
N
∑
j
=
1
N
∑
t
=
1
T
−
1
l
o
g
(
a
i
j
)
⋅
P
(
X
,
y
t
=
s
i
,
y
t
+
1
=
s
j
∣
λ
ˉ
)
(2.2.1)
\hat{a}_{ij} = \argmax_{\pi_i} \sum_{i=1}^{N} \sum_{j=1}^{N} \sum_{t=1}^{T-1} log(a_{ij}) \cdot P(X, y_t=s_i, y_{t+1}=s_j | \bar{\lambda}) \tag{2.2.1}
a^ij=πiargmaxi=1∑Nj=1∑Nt=1∑T−1log(aij)⋅P(X,yt=si,yt+1=sj∣λˉ)(2.2.1)
同理,应用具有约束条件
∑
j
=
1
N
a
i
j
=
1
\sum_{j=1}^{N} a_{ij} = 1
∑j=1Naij=1 的拉格朗日乘子法,可以求出
a
i
j
a_{ij}
aij 的极大似然估计:
a ^ i j = ∑ t = 1 T − 1 P ( X , y t = s i , y t + 1 = s j ∣ λ ˉ ) ∑ t = 1 T − 1 P ( X , y t = s i ∣ λ ˉ ) (2.2.2) \hat{a}_{ij} = \frac{ \sum_{t=1}^{T-1} P(X, y_t=s_i, y_{t+1}=s_j | \bar{\lambda}) }{ \sum_{t=1}^{T-1} P(X, y_t=s_i | \bar{\lambda}) } \tag{2.2.2} a^ij=∑t=1T−1P(X,yt=si∣λˉ)∑t=1T−1P(X,yt=si,yt+1=sj∣λˉ)(2.2.2)
2.3 观测发生概率 b j ( v l ) b_j(v_l) bj(vl) 的估计
b
^
j
(
k
)
=
arg max
π
i
∑
Y
[
∑
t
=
1
T
l
o
g
(
b
i
t
(
x
t
)
)
]
⋅
P
(
X
,
Y
∣
λ
ˉ
)
(2.6)
\hat{b}_j(k) = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda}) \tag{2.6}
b^j(k)=πiargmaxY∑[t=1∑Tlog(bit(xt))]⋅P(X,Y∣λˉ)(2.6)
b
^
j
(
k
)
=
arg max
π
i
∑
j
=
1
N
[
∑
t
=
1
T
l
o
g
(
b
j
(
x
t
)
)
]
⋅
P
(
X
,
y
t
=
s
j
∣
λ
ˉ
)
(2.3.1)
\hat{b}_j(k) = \argmax_{\pi_i} \sum_{j=1}^{N} [\sum_{t=1}^{T} log(b_{j}(x_t)) ] \cdot P(X, y_t=s_j | \bar{\lambda}) \tag{2.3.1}
b^j(k)=πiargmaxj=1∑N[t=1∑Tlog(bj(xt))]⋅P(X,yt=sj∣λˉ)(2.3.1)
同样应用拉格朗日乘子法,约束条件是
∑
k
=
1
M
b
j
(
k
)
=
1
\sum_{k=1}^{M} b_j(k) = 1
∑k=1Mbj(k)=1。注意,只有在
x
t
=
o
k
x_t = o_k
xt=ok 时
b
j
(
x
t
)
b_j(x_t)
bj(xt) 对
b
j
(
k
)
b_j(k)
bj(k) 的偏导数才不恒为0,求得
b
j
(
k
)
b_j(k)
bj(k) 的极大似然估计:
b ^ j ( k ) = ∑ t = 1 T P ( X ∩ x ˉ t , x t = o k , y t = s j ∣ λ ˉ ) ∑ t = 1 T ∑ k = 1 M P ( X ∩ x ˉ t , x t = o k , y t = s j ∣ λ ˉ ) \hat{b}_j(k) = \frac{ \sum_{t=1}^{T} P(X \cap \bar{x}_t, x_t=o_k, y_t=s_j | \bar{\lambda}) }{ \sum_{t=1}^{T} \sum_{k=1}^{M} P(X \cap \bar{x}_t, x_t=o_k, y_t=s_j | \bar{\lambda}) } b^j(k)=∑t=1T∑k=1MP(X∩xˉt,xt=ok,yt=sj∣λˉ)∑t=1TP(X∩xˉt,xt=ok,yt=sj∣λˉ)
2.4 Baum-Welch算法实现
输入:随机过程的观测序列
输出:隐马尔科夫模型参数的极大似然估计。
(1)初始化
对于
n
=
0
n=0
n=0,选取任意符合定义范围的
a
i
j
(
0
)
,
b
j
(
k
)
(
0
)
,
π
i
(
0
)
a_{ij}^{(0)}, \ b_{j}(k)^{(0)}, \ \pi_i^{(0)}
aij(0), bj(k)(0), πi(0),得到模型参数初值
λ
(
0
)
=
(
A
(
0
)
,
B
(
0
)
,
π
(
0
)
)
\lambda^{(0)} = (A^{(0)}, B^{(0)}, \pi^{(0)})
λ(0)=(A(0),B(0),π(0));
(2)迭代训练
a
i
j
(
n
+
1
)
=
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
∣
X
,
λ
(
n
)
)
∑
t
=
1
T
−
1
γ
t
(
i
∣
X
,
λ
(
n
)
)
a_{ij}^{(n+1)} = \ \frac{ \sum_{t=1}^{T-1} \xi_t(i,j | X, \lambda^{(n)}) }{ \sum_{t=1}^{T-1} \gamma_t(i | X, \lambda^{(n)}) }
aij(n+1)= ∑t=1T−1γt(i∣X,λ(n))∑t=1T−1ξt(i,j∣X,λ(n))
b
j
(
k
)
(
n
+
1
)
=
∑
t
=
1
,
x
t
=
o
k
T
−
1
γ
t
(
j
∣
X
,
λ
(
n
)
)
∑
t
=
1
T
−
1
γ
t
(
j
∣
X
,
λ
(
n
)
)
b_j(k)^{(n+1)} = \ \frac{ \sum_{t=1, \ x_t=o_k}^{T-1} \gamma_t(j | X, \lambda^{(n)}) }{ \sum_{t=1}^{T-1} \gamma_t(j | X, \lambda^{(n)}) }
bj(k)(n+1)= ∑t=1T−1γt(j∣X,λ(n))∑t=1, xt=okT−1γt(j∣X,λ(n))
π
i
(
n
+
1
)
=
γ
1
(
i
∣
X
,
λ
(
n
)
)
\pi_i^{(n+1)} = \gamma_1(i | X, \lambda^{(n)})
πi(n+1)=γ1(i∣X,λ(n))
式中
ξ
t
(
i
,
j
)
\xi_t(i,j)
ξt(i,j) 和
γ
t
(
i
)
\gamma_t(i)
γt(i) 由HMM模型的前向算法和后向算法推出,具体推导过程请见作者文章:隐马尔科夫模型(HMM):计算观测序列的出现概率中的第4小节。
(3)终止
当
λ
(
n
+
1
)
\lambda^{(n+1)}
λ(n+1) 几乎不再改变或改变已小于给定阈值(即已收敛)时,停止迭代训练。得到模型参数的极大似然估计:
{
a
^
i
j
=
a
i
j
(
n
+
1
)
b
^
j
(
k
)
=
b
j
(
k
)
(
n
+
1
)
π
^
i
=
π
i
(
n
+
1
)
\begin{cases} \hat{a}_{ij} = a_{ij}^{(n+1)} \\ \\ \hat{b}_j(k) = b_j(k)^{(n+1)} \\ \\ \hat{\pi}_i = \pi_i^{(n+1)} \end{cases}
⎩⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎧a^ij=aij(n+1)b^j(k)=bj(k)(n+1)π^i=πi(n+1)