10.3学习问题(解决Learining: λ M L E = a r g m a x λ P ( O ∣ λ ) \lambda_{MLE} = argmax_{\lambda}P(O|\lambda) λMLE=argmaxλP(O∣λ))
10.3.1 监督学习方法
假设已给出训练数据包含 S 个长度相同的观测序列和对应的状态序列 ( O 1 , I 1 ) , ( O 2 , I 2 ) , ⋅ ⋅ ⋅ , ( O T , I T ) {(O_{1},I_{1}),(O_{2},I_{2}), \cdot \cdot \cdot ,(O_{T},I_{T})} (O1,I1),(O2,I2),⋅⋅⋅,(OT,IT),那么可以利用极大似然估计法来估计隐马尔可夫模型的参数,具体方法如下:
- 转移概率
a
i
j
a_{ij}
aij的估计:
a i j = A i j ∑ j = 1 N A i j − − − − ( 10.30 ) a_{ij} = \frac{A_{ij}}{\sum_{j=1}^{N}A_{ij}}----(10.30) aij=∑j=1NAijAij−−−−(10.30)
其中, A i j A_{ij} Aij为样本中时刻 t 处于状态 q i q_{i} qi而到时刻t+1转移到状态 q j q_{j} qj的频数; - 观测概率
b
j
(
k
)
b_{j}(k)
bj(k)的估计:
b j k = B j k ∑ k = 1 M A j k − − − − ( 10.31 ) b_{jk} = \frac{B_{jk}}{\sum_{k=1}^{M}A_{jk}}----(10.31) bjk=∑k=1MAjkBjk−−−−(10.31)
其中, B j k B_{jk} Bjk为样本中状态为 q j q_{j} qj,其对应观测为 v k v_{k} vk的频数; - 初始状态概率 π i \pi_{i} πi的估计为 S 个样本中初始状态为 q i q_{i} qi的频率。
显然此训练数据中的状态序列数据通常是需要人工标注出来的,因此代价较高,所以非监督学习的方法更为实用,例如Baum-Welch算法。
10.3.2 Baum-Welch算法
如果只有观测序列数据
O
=
(
o
1
,
o
2
,
⋅
⋅
⋅
,
o
T
)
O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T})
O=(o1,o2,⋅⋅⋅,oT),而没有状态序列数据
S
=
(
s
1
,
s
2
,
⋅
⋅
⋅
,
s
T
)
S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T})
S=(s1,s2,⋅⋅⋅,sT),那么隐马尔可夫模型就是一个含有隐变量的概率模型:
P
(
O
∣
λ
)
=
∑
S
P
(
O
∣
S
,
λ
)
P
(
S
∣
λ
)
−
−
−
−
(
10.32
)
P(O|λ) = \sum_{S}P(O|S,\lambda)P(S|\lambda)----(10.32)
P(O∣λ)=S∑P(O∣S,λ)P(S∣λ)−−−−(10.32)
如果要对它进行参数估计,则可以采用EM算法来实现,具体步骤如下:
-
确定完全数据的对数似然函数
此时观测数据为 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T}) O=(o1,o2,⋅⋅⋅,oT),未观测数据为 S = ( s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) S=(s1,s2,⋅⋅⋅,sT),则完全数据为 ( O , S ) = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T , s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) (O,S) = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T},s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) (O,S)=(o1,o2,⋅⋅⋅,oT,s1,s2,⋅⋅⋅,sT),完全数据的对数似然函数为:
l o g P ( O , S ∣ λ ) logP(O,S|\lambda) logP(O,S∣λ)
其中, P ( O , S ∣ λ ) = π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) P(O,S|\lambda) = \pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T}) P(O,S∣λ)=πs1bs1(o1)as1s2bs2(o2)⋅⋅⋅asT−1sTbsT(oT),所以可以进一步推得:
l o g P ( O , S ∣ λ ) logP(O,S|\lambda) logP(O,S∣λ)
= l o g ( π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) ) = log(\pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T})) =log(πs1bs1(o1)as1s2bs2(o2)⋅⋅⋅asT−1sTbsT(oT))
= l o g π s 1 + ∑ t = 1 T − 1 l n a s t s t + 1 + ∑ t = 1 T l o g b s t ( o t ) − − − − ( 10.33 ∗ ) = log\pi_{s_{1}} + \sum_{t =1}^{T-1}lna_{s_{t}s_{t+1}} + \sum_{t=1}^{T}logb_{s_{t}}(o_{t})----(10.33*) =logπs1+∑t=1T−1lnastst+1+∑t=1Tlogbst(ot)−−−−(10.33∗) -
EM算法E步:求Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))
Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g P ( O , S ∣ λ ) − − − − ( 10.33 ) Q(\lambda,\lambda^{(t)}) = \sum_{S}P(O,S|\lambda^{(t)})logP(O,S|\lambda)----(10.33) Q(λ,λ(t))=S∑P(O,S∣λ(t))logP(O,S∣λ)−−−−(10.33)
其中, λ ( t ) \lambda^{(t)} λ(t)是隐马尔可夫模型参数的当前估计值,λ 是要极大化的隐马尔可夫模型参数。为了便于后续计算,Q 函数还可以作如下恒等变形,将(10.33*)代入:
Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T l o g b s t ( o t ) − − − − ( 10.34 ) Q(\lambda,\lambda^{(t)})= \sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} + \sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} + \sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t})----(10.34) Q(λ,λ(t))=S∑P(O,S∣λ(t))logπs1+S∑P(O,S∣λ(t))t=1∑T−1logastst+1+S∑P(O,S∣λ(t))t=1∑Tlogbst(ot)−−−−(10.34) -
EM算法的M步:极大化Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))求模型参数 A , B , π A,B,\pi A,B,π
(1)只有式(10.34)的第1项含有 π s i \pi_{s_{i}} πsi,根据第一项对参数 π \pi π进行求最大化,更新 π \pi π的值,具体推导如下:
π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t))
= a r g m a x π ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑SP(O,S∣λ(t))logπs1
= a r g m a x π ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 , s 2 , ⋅ ⋅ ⋅ , s T , ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1},s_{2},\cdot \cdot \cdot ,s_{T},|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑q1∑q2⋅⋅⋅∑qTP(O,s1,s2,⋅⋅⋅,sT,∣λ(t))logπs1
= a r g m a x π ∑ q 1 P ( O , s 1 ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}P(O,s_{1}|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑q1P(O,s1∣λ(t))logπs1
= a r g m a x π ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i = argmax_{\pi}\sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} =argmaxπ∑i=1NP(O,s1=qi∣λ(t))logπi
(这里隐含了一个约束 s . t . ∑ i = 1 N π i = 1 s.t. \sum_{i=1}^{N}\pi_{i} = 1 s.t.∑i=1Nπi=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( π , η 1 ) \delta(\pi,\eta_{1}) δ(π,η1):
δ ( π , η ) = ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i + η 1 ( ∑ i = 1 N π − 1 ) \delta(\pi,\eta) = \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} + \eta_{1}(\sum_{i=1}^{N}\pi - 1) δ(π,η)=i=1∑NP(O,s1=qi∣λ(t))logπi+η1(i=1∑Nπ−1)
对 π i \pi_{i} πi求偏导,令其为0:
∂ δ ∂ π i = 1 π i P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 = 0 − − − − ( 10.35 ) \frac{\partial \delta}{\partial \pi_{i}} = \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} = 0----(10.35) ∂πi∂δ=πi1P(O,s1=qi∣λ(t))+η1=0−−−−(10.35)
P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i = 0 P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} = 0 P(O,s1=qi∣λ(t))+η1πi=0
因为 ∑ i = 1 N π i = 1 \sum_{i=1}^{N}\pi_{i} = 1 ∑i=1Nπi=1,对两边求和:
∑ i = 1 N [ P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i ] = 0 \sum_{i=1}^{N} \left[ P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} \right ] = 0 i=1∑N[P(O,s1=qi∣λ(t))+η1πi]=0
∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) + ∑ i = 1 N η 1 π i = 0 \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \sum_{i=1}^{N}\eta_{1} \pi_{i} = 0 i=1∑NP(O,s1=qi∣λ(t))+i=1∑Nη1πi=0
P ( O ∣ λ ( t ) ) + η 1 = 0 P(O|\lambda^{(t)}) + \eta_{1} = 0 P(O∣λ(t))+η1=0
η 1 = − P ( O ∣ λ ( t ) ) − − − − ( 10.35 ∗ ) \eta_{1} = -P(O|\lambda^{(t)})----(10.35*) η1=−P(O∣λ(t))−−−−(10.35∗)
将(10.35*)代入(10.35):
1 π i P ( O , s 1 = q i ∣ λ ( t ) ) − P ( O ∣ λ ( t ) ) = 0 \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) - P(O|\lambda^{(t)}) = 0 πi1P(O,s1=qi∣λ(t))−P(O∣λ(t))=0
π i = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) − − − − ( 10.36 ) \pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.36) πi=P(O∣λ(t))P(O,s1=qi∣λ(t))−−−−(10.36)
由于 π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t)),所以得到更新后的 π ( t + 1 ) \pi^{(t+1)} π(t+1):
π ( t + 1 ) = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \pi^{(t+1)} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})} π(t+1)=P(O∣λ(t))P(O,s1=qi∣λ(t))
最终更新得到整个初始概率向量 π \pi π:
π ( t + 1 ) = ( π 1 ( t + 1 ) , π 2 ( t + 1 ) , ⋅ ⋅ ⋅ , π N ( t + 1 ) ) \pi^{(t+1)} = (\pi_{1}^{(t+1)},\pi_{2}^{(t+1)}, \cdot \cdot \cdot ,\pi_{N}^{(t+1)}) π(t+1)=(π1(t+1),π2(t+1),⋅⋅⋅,πN(t+1))
(2)只有式(10.34)的第2项含有
a
i
j
a_{ij}
aij,根据第二项对参数
a
i
j
a_{ij}
aij进行求最大化,更新
a
i
j
a_{ij}
aij的值,具体推导如下:
a
i
j
t
+
1
=
a
r
g
m
a
x
a
i
j
Q
(
λ
,
λ
(
t
)
)
a_{ij}^{t+1} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)})
aijt+1=argmaxaijQ(λ,λ(t))
=
a
r
g
m
a
x
a
i
j
∑
S
P
(
O
,
S
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
a
s
t
s
t
+
1
= argmax_{a_{ij}}\sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}}
=argmaxaij∑SP(O,S∣λ(t))∑t=1T−1logastst+1
=
a
r
g
m
a
x
a
i
j
∑
q
1
∑
q
2
⋅
⋅
⋅
∑
q
T
P
(
O
,
s
1
,
s
2
,
⋅
⋅
⋅
,
s
T
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
a
s
t
s
t
+
1
= argmax_{a_{ij}}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1},s_{2} ,\cdot \cdot \cdot ,s_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}}
=argmaxaij∑q1∑q2⋅⋅⋅∑qTP(O,s1,s2,⋅⋅⋅,sT∣λ(t))∑t=1T−1logastst+1
=
a
r
g
m
a
x
a
i
j
∑
q
t
∑
q
t
+
1
P
(
O
,
s
t
,
s
t
+
1
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
a
s
t
s
t
+
1
= argmax_{a_{ij}}\sum_{q_{t}}\sum_{q_{t+1}} P(O,s_{t},s_{t+1}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}}
=argmaxaij∑qt∑qt+1P(O,st,st+1∣λ(t))∑t=1T−1logastst+1
=
a
r
g
m
a
x
a
i
j
∑
i
=
1
N
∑
j
=
1
N
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
a
i
j
= argmax_{a_{ij}}\sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij}
=argmaxaij∑i=1N∑j=1N∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))∑t=1T−1logaij
(这里隐含了一个约束
s
.
t
.
∑
i
=
1
N
a
i
j
=
1
s.t. \sum_{i=1}^{N} a_{ij} = 1
s.t.∑i=1Naij=1)
利用拉格朗日乘子法,进行后续计算,先构造
δ
(
a
i
j
,
η
2
)
\delta(a_{ij},\eta_{2})
δ(aij,η2):
δ
(
a
i
j
,
η
)
=
∑
i
=
1
N
∑
j
=
1
N
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
a
i
j
+
η
2
(
∑
i
=
1
N
a
i
j
−
1
)
\delta(a_{ij},\eta) = \sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij} + \eta_{2}(\sum_{i=1}^{N} a_{ij} - 1)
δ(aij,η)=i=1∑Nj=1∑Nt=1∑T−1P(O,st=qi,st+1=qj∣λ(t))t=1∑T−1logaij+η2(i=1∑Naij−1)
对
a
i
j
a_{ij}
aij求偏导,令其为0:
∂
δ
∂
a
i
j
=
1
a
i
j
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
+
η
2
=
0
−
−
−
−
(
10.37
)
\frac{\partial \delta}{\partial a_{ij}} = \frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} = 0----(10.37)
∂aij∂δ=aij1t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2=0−−−−(10.37)
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
+
η
2
a
i
j
=
0
\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} = 0
t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2aij=0
因为
∑
j
=
1
N
a
i
j
=
1
\sum_{j=1}^{N}a_{ij} = 1
∑j=1Naij=1,对两边求和:
∑
j
=
1
N
[
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
+
η
2
a
i
j
]
=
0
\sum_{j=1}^{N} \left[ \sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} \right ] = 0
j=1∑N[t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2aij]=0
∑
j
=
1
N
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
+
∑
j
=
1
N
η
2
a
i
j
=
0
\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \sum_{j=1}^{N}\eta_{2} a_{ij} = 0
j=1∑Nt=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+j=1∑Nη2aij=0
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
+
η
2
=
0
\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{2} = 0
t=1∑T−1P(O,st=qi∣λ(t))+η2=0
η
2
=
−
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
−
−
−
−
(
10.37
∗
)
\eta_{2} = -\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})----(10.37*)
η2=−t=1∑T−1P(O,st=qi∣λ(t))−−−−(10.37∗)
将(10.37*)代入(10.37):
1
a
i
j
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
=
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
\frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) = \sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})
aij1t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))=t=1∑T−1P(O,st=qi∣λ(t))
a
i
j
=
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})}
aij=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))
由于
a
i
j
(
t
+
1
)
=
a
r
g
m
a
x
a
i
j
Q
(
λ
,
λ
(
t
)
)
a_{ij}^{(t+1)} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)})
aij(t+1)=argmaxaijQ(λ,λ(t)),所以得到更新后的
a
i
j
(
t
+
1
)
a_{ij}^{(t+1)}
aij(t+1):
a
i
j
(
t
+
1
)
=
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
a_{ij}^{(t+1)} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})}
aij(t+1)=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))
最终更新得到整个状态转移矩阵 A:
A
(
t
+
1
)
=
{
a
i
j
(
t
+
1
)
}
N
∗
N
A^{(t+1)} = \left \{ a_{ij}^{(t+1)} \right \}_{N*N}
A(t+1)={aij(t+1)}N∗N
(3)只有式(10.34)的第3项含有
b
j
(
k
)
b_{j}(k)
bj(k),根据第三项对参数
b
j
(
k
)
b_{j}(k)
bj(k)进行求最大化,更新
b
j
(
k
)
b_{j}(k)
bj(k)的值,具体推导如下:
b
j
(
k
)
(
t
+
1
)
=
a
r
g
m
a
x
b
j
(
k
)
Q
(
λ
,
λ
(
t
)
)
b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)})
bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t))
=
a
r
g
m
a
x
b
j
(
k
)
∑
S
P
(
O
,
S
∣
λ
(
t
)
)
∑
t
=
1
T
l
o
g
b
s
t
(
o
t
)
= argmax_{b_{j}(k)}\sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t})
=argmaxbj(k)∑SP(O,S∣λ(t))∑t=1Tlogbst(ot)
=
a
r
g
m
a
x
b
j
(
k
)
∑
q
1
∑
q
2
⋅
⋅
⋅
∑
q
T
P
(
O
,
s
1
,
s
2
,
⋅
⋅
⋅
,
s
T
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
b
s
t
(
o
t
)
= argmax_{b_{j}(k)}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1},s_{2},\cdot \cdot \cdot ,s_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t})
=argmaxbj(k)∑q1∑q2⋅⋅⋅∑qTP(O,s1,s2,⋅⋅⋅,sT∣λ(t))∑t=1T−1logbst(ot)
=
a
r
g
m
a
x
b
j
(
k
)
∑
q
j
P
(
O
,
s
t
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
l
o
g
b
s
t
(
o
t
)
= argmax_{b_{j}(k)}\sum_{q_{j}} P(O,s_{t}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t})
=argmaxbj(k)∑qjP(O,st∣λ(t))∑t=1T−1logbst(ot)
=
a
r
g
m
a
x
b
j
(
k
)
∑
j
=
1
N
∑
t
=
1
T
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
l
o
g
b
s
t
(
o
t
)
−
−
−
−
(
10.38
∗
)
= argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{s_{t}}(o_{t})----(10.38*)
=argmaxbj(k)∑j=1N∑t=1TP(O,st=qj∣λ(t))logbst(ot)−−−−(10.38∗)
写到这你可能有疑问,就是明明在推导
b
j
(
k
)
b_{j}(k)
bj(k),但是现在就只有
b
s
t
(
o
t
)
b_{s_{t}}(o_{t})
bst(ot),k在哪呢?
其实这里可以考虑一个问题,就是说因为观测序列是给定的,所以只有一个观测是正确的,也就是说
o
t
o_{t}
ot是给定的,但是对于一个时刻下处于摸一个状态可以观测的所有观测值的概率和为 1,也就是说
∑
k
=
1
M
b
s
t
(
k
)
\sum_{k=1}^{M}b_{s_{t}}(k)
∑k=1Mbst(k) ,这里就要引入一个指示函数
I
(
o
t
=
v
k
)
I(o_{t} = v_{k})
I(ot=vk),这里只有在
o
t
=
v
k
o_{t} = v_{k}
ot=vk的时候
I
(
o
t
=
v
k
)
=
1
I(o_{t} = v_{k}) = 1
I(ot=vk)=1,其他情况
I
(
o
t
=
v
k
)
=
0
I(o_{t} = v_{k}) = 0
I(ot=vk)=0,所以这个时候可以把
b
s
t
(
o
t
)
b_{s_{t}}(o_{t})
bst(ot)换成
b
j
(
k
)
I
(
o
t
=
v
k
)
b_{j}(k)I(o_{t} = v_{k})
bj(k)I(ot=vk)(这里状态
s
t
s_{t}
st换成状态
q
j
q_{j}
qj;剩下的状态可以用指示函数替换,这里要停下来好好想一下),这样再看(10.38*),就可以改成以下形式:
b
j
(
k
)
(
t
+
1
)
=
a
r
g
m
a
x
b
j
(
k
)
∑
j
=
1
N
∑
t
=
1
T
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
l
o
g
b
j
(
k
)
I
(
o
t
=
v
k
)
−
−
−
−
(
10.38
∗
∗
)
b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k})----(10.38**)
bj(k)(t+1)=argmaxbj(k)∑j=1N∑t=1TP(O,st=qj∣λ(t))logbj(k)I(ot=vk)−−−−(10.38∗∗)
(现在就可以加入这个约束
s
.
t
.
∑
k
=
1
M
b
j
(
k
)
=
1
s.t. \sum_{k=1}^{M} b_{j}(k) = 1
s.t.∑k=1Mbj(k)=1)
利用拉格朗日乘子法,进行后续计算,先构造
δ
(
b
j
(
k
)
,
η
3
)
\delta(b_{j}(k),\eta_{3})
δ(bj(k),η3):
δ
(
b
j
(
k
)
,
η
)
=
∑
j
=
1
N
∑
t
=
1
T
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
l
o
g
b
j
(
k
)
I
(
o
t
=
v
k
)
+
η
3
(
∑
i
=
1
N
b
j
(
k
)
−
1
)
\delta(b_{j}(k),\eta) = \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k}) + \eta_{3}(\sum_{i=1}^{N} b_{j}(k) - 1)
δ(bj(k),η)=j=1∑Nt=1∑TP(O,st=qj∣λ(t))logbj(k)I(ot=vk)+η3(i=1∑Nbj(k)−1)
对
b
j
(
k
)
b_{j}(k)
bj(k)求偏导,令其为0:
∂
δ
∂
b
j
(
k
)
=
1
b
j
(
k
)
∑
j
=
1
N
∑
t
=
1
T
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
+
η
3
=
0
−
−
−
−
(
10.38
)
\frac{\partial \delta}{\partial b_{j}(k)} = \frac{1}{b_{j}(k)} \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} = 0----(10.38)
∂bj(k)∂δ=bj(k)1j=1∑Nt=1∑TP(O,st=qj∣λ(t))I(ot=vk)+η3=0−−−−(10.38)
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
+
η
3
b
j
(
k
)
=
0
\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)}) + \eta_{3} b_{j}(k) = 0
j=1∑NP(O,st=qj∣λ(t))+η3bj(k)=0
因为
∑
j
=
1
N
b
j
(
k
)
=
1
\sum_{j=1}^{N}b_{j}(k) = 1
∑j=1Nbj(k)=1,对两边求和:
∑
k
=
1
M
[
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
+
η
3
b
j
(
k
)
]
=
0
\sum_{k=1}^{M} \left[ \sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} b_{j}(k) \right ] = 0
k=1∑M[j=1∑NP(O,st=qj∣λ(t))I(ot=vk)+η3bj(k)]=0
∑
k
=
1
M
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
+
∑
j
=
1
N
η
3
b
j
(
k
)
=
0
\sum_{k=1}^{M}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \sum_{j=1}^{N}\eta_{3} b_{j}(k) = 0
k=1∑Mj=1∑NP(O,st=qj∣λ(t))I(ot=vk)+j=1∑Nη3bj(k)=0
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
+
η
3
=
0
\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{3} = 0
j=1∑NP(O,st=qi∣λ(t))+η3=0
η
3
=
−
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
−
−
−
−
(
10.38
∗
∗
∗
)
\eta_{3} = -\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )----(10.38***)
η3=−j=1∑NP(O,st=qi∣λ(t))−−−−(10.38∗∗∗)
将(10.38***)代入(10.38):
1
b
j
(
k
)
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
=
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
\frac{1}{b_{j}(k)}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) = \sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )
bj(k)1j=1∑NP(O,st=qj∣λ(t))I(ot=vk)=j=1∑NP(O,st=qi∣λ(t))
b
j
(
k
)
=
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
b_{j}(k) = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )}
bj(k)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)
由于
b
j
(
k
)
(
t
+
1
)
=
a
r
g
m
a
x
b
j
(
k
)
Q
(
λ
,
λ
(
t
)
)
b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)})
bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t)),所以得到更新后的
b
j
(
k
)
(
t
+
1
)
b_{j}(k)^{(t+1)}
bj(k)(t+1):
b
j
(
k
)
(
t
+
1
)
=
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )}
bj(k)(t+1)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)
最终更新得到整个观测概率矩阵 B:
B
j
(
t
+
1
)
(
k
)
=
{
b
j
(
t
+
1
)
(
k
)
}
N
∗
M
B_{j}^{(t+1)}(k) = \left \{ b_{j}^{(t+1)}(k) \right \}_{N*M}
Bj(t+1)(k)={bj(t+1)(k)}N∗M
10.3.3Baum-Welch模型模型参数估计公式
将式(10.36)~式(10.38)中的各概率分别用
γ
t
(
i
)
,
ξ
t
(
i
,
j
)
\gamma_{t}(i),\xi_{t}(i,j)
γt(i),ξt(i,j)表示,则可将相应的公式写成:
(1)对于
a
i
j
a_{ij}
aij:
a
i
j
=
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
)
∑
t
=
1
T
−
1
γ
t
(
i
)
−
−
−
−
−
(
10.39
)
a_{ij} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}-----(10.39)
aij=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)−−−−−(10.39)
a
i
j
=
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
∑
t
=
1
T
−
1
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
−
−
−
−
−
(
10.39
∗
)
a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})}-----(10.39*)
aij=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))−−−−−(10.39∗)
(10.39*)是给(10.39)用作对比参考
(2)对于
b
j
(
k
)
b_{j}(k)
bj(k):
b
j
(
k
)
=
∑
t
=
1
,
o
t
=
v
k
T
γ
t
(
j
)
∑
t
=
1
T
γ
t
(
j
)
−
−
−
−
(
10.40
)
b_{j}(k) = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}----(10.40)
bj(k)=∑t=1Tγt(j)∑t=1,ot=vkTγt(j)−−−−(10.40)
b
j
(
k
)
(
t
+
1
)
=
∑
j
=
1
N
P
(
O
,
s
t
=
q
j
∣
λ
(
t
)
)
I
(
o
t
=
v
k
)
∑
j
=
1
N
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
−
−
−
−
(
10.40
∗
)
b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )}----(10.40*)
bj(k)(t+1)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)−−−−(10.40∗)
(10.40*)是给(10.40)用作对比参考
(3)对于
π
i
\pi_{i}
πi:
π
i
=
γ
1
(
i
)
−
−
−
−
(
10.41
)
\pi_{i} = \gamma_{1}(i)----(10.41)
πi=γ1(i)−−−−(10.41)
π
i
=
P
(
O
,
s
1
=
q
i
∣
λ
(
t
)
)
P
(
O
∣
λ
(
t
)
)
−
−
−
−
(
10.41
∗
)
\pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.41*)
πi=P(O∣λ(t))P(O,s1=qi∣λ(t))−−−−(10.41∗)
(10.41*)是给(10.41)用作对比参考
(4)对
γ
t
(
i
)
\gamma_{t}(i)
γt(i)和
ξ
t
(
i
,
j
)
\xi_{t}(i,j)
ξt(i,j)做一个总结:
γ
t
(
i
)
=
P
(
O
,
s
t
=
q
i
∣
λ
(
t
)
)
P
(
O
∣
λ
(
t
)
)
\gamma_{t}(i) = \frac{P(O,s_{t} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}
γt(i)=P(O∣λ(t))P(O,st=qi∣λ(t))
ξ
t
(
i
,
j
)
=
P
(
O
,
s
t
=
q
i
,
s
t
+
1
=
q
j
∣
λ
(
t
)
)
P
(
O
∣
λ
(
t
)
)
\xi_{t}(i,j) = \frac{P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{P(O|\lambda^{(t)})}
ξt(i,j)=P(O∣λ(t))P(O,st=qi,st+1=qj∣λ(t))
这才是
γ
t
(
i
)
\gamma_{t}(i)
γt(i)和
ξ
t
(
i
,
j
)
\xi_{t}(i,j)
ξt(i,j)真正的样子
算法10.4(Baum-Welch算法)
输入:观测数据
O
=
(
o
1
,
o
2
,
⋅
⋅
⋅
,
o
T
)
O = (o_{1},o_{2},\cdot \cdot \cdot,o_{T})
O=(o1,o2,⋅⋅⋅,oT)
输出:HMM的模型参数
λ
\lambda
λ
(1)初始化。对n = 0,选取
a
i
j
(
0
)
,
b
j
(
k
)
(
0
)
,
π
i
(
0
)
a_{ij}^{(0)},b_{j}(k)^{(0)},\pi_{i}^{(0)}
aij(0),bj(k)(0),πi(0),得到模型
λ
(
0
)
=
(
A
(
0
)
,
B
(
0
)
,
π
(
0
)
)
\lambda^{(0)} = (A^{(0)},B^{(0)},\pi^{(0)})
λ(0)=(A(0),B(0),π(0)).
(2)递推。对
n
=
1
,
2
,
⋅
⋅
⋅
,
n = 1,2, \cdot \cdot \cdot ,
n=1,2,⋅⋅⋅,
a
i
j
(
n
+
1
)
=
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
)
∑
t
=
1
T
−
1
γ
t
(
i
)
a_{ij}^{(n+1)} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}
aij(n+1)=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)
b
j
(
k
)
(
n
+
1
)
=
∑
t
=
1
,
o
t
=
v
k
T
γ
t
(
j
)
∑
t
+
1
T
γ
t
(
j
)
b_{j}(k)^{(n+1)} = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t+1}^{T}\gamma_{t}(j)}
bj(k)(n+1)=∑t+1Tγt(j)∑t=1,ot=vkTγt(j)
π
i
(
t
+
1
)
=
γ
1
(
i
)
\pi_{i}^{(t+1)} = \gamma_{1}(i)
πi(t+1)=γ1(i)
(3)终止。得到模型参数
λ
(
n
+
1
)
=
(
a
i
j
(
n
+
1
)
,
b
j
(
k
)
(
n
+
1
)
,
π
i
(
n
+
1
)
)
\lambda^{(n+1)} = (a_{ij}^{(n+1)},b_{j}(k)^{(n+1)},\pi_{i}^{(n+1)})
λ(n+1)=(aij(n+1),bj(k)(n+1),πi(n+1)).
参考文献
以下是HMM系列文章的参考文献:
- 李航——《统计学习方法》
- YouTube——shuhuai008的视频课程HMM
- YouTube——徐亦达机器学习HMM、EM
- *[https://www.huaxiaozhuan.com/%E7%BB%9F%E8%AE%A1%E5%AD%A6%E4%B9%A0/chapters/15_HMM.html]:隐马尔可夫模型
- [https://sm1les.com/2019/04/10/hidden-markov-model/]:隐马尔可夫模型(HMM)及其三个基本问题
- 例子可以看这个[https://www.cnblogs.com/skyme/p/4651331.html]:一文搞懂HMM(隐马尔可夫模型)
- [https://www.zhihu.com/question/55974064]:南屏晚钟的解答
感谢以上作者对本文的贡献,如有侵权联系后删除相应内容。