隐马尔可夫模型的学习,根据训练数据是包括观测序列和状态序列还是只有观测序列,可以分别由监督学习与非监督学习实现。由于监督学习需要使用训练数据,而人工标注训练数据往往代价很高,有时就会利用非监督学习的方法,即Baum-Welch算法(也就是EM算法)。在介绍学习算法之前,先介绍一些概率和期望值的计算。这些计算会成为Baum-Welch算法公式的基础。
一些概率和期望值的计算
利用前向概率和后向概率,可以得到关于单个状态和两个状态概率的计算公式。
- 给定模型
λ
\lambda
λ和观测
O
O
O,在时刻
t
t
t处于状态
q
i
q_i
qi的概率。记为
γ t ( i ) = P ( i t = q i ∣ O , λ ) \gamma_t(i) = P(i_t = q_i |O,\lambda) γt(i)=P(it=qi∣O,λ)
先分解为分数形式
γ t ( i ) = P ( i t = q i , O ∣ λ ) P ( O ∣ λ ) (1) \gamma_t(i) = \frac{P(i_t = q_i, O | \lambda)}{P(O|\lambda)}\tag{1} γt(i)=P(O∣λ)P(it=qi,O∣λ)(1)
根据前向概率的定义可以做以下变换
α t ( i ) = P ( o 1 , o 2 . . . o t , i t = q t ∣ λ ) = P ( i t = q t ∣ λ ) P ( o 1 , o 2 . . . o t ∣ i t = q t , λ ) \alpha_t(i) = P(o_1,o_2...o_t, i_t = q_t | \lambda) = P(i_t = q_t | \lambda)P(o_1,o_2...o_t| i_t = q_t , \lambda) αt(i)=P(o1,o2...ot,it=qt∣λ)=P(it=qt∣λ)P(o1,o2...ot∣it=qt,λ)
后向概率的定义如下
β t ( i ) = P ( o t + 1 , o t + 2 . . . , o T ∣ i t = q t , λ ) \beta_t(i) = P(o_{t+1},o_{t+2}...,o_T | i_t = q_t , \lambda) βt(i)=P(ot+1,ot+2...,oT∣it=qt,λ)
将这两者相乘得到
α t ( i ) ∗ β t ( i ) = P ( i t = q t , O ∣ λ ) (2) \alpha_t(i) * \beta_t(i) =P(i_t = q_t,O | \lambda)\tag{2} αt(i)∗βt(i)=P(it=qt,O∣λ)(2)
以上结果从两者的定义上也很好理解。
对变量
i
i
i在范围
i
=
1
,
2
,
.
.
.
N
i = 1,2,...N
i=1,2,...N上求和
∑
i
=
1
N
P
(
i
t
=
q
t
,
O
∣
λ
)
=
P
(
O
∣
λ
)
(3)
\sum_{i=1}^N P(i_t = q_t,O | \lambda) = {P(O|\lambda)}\tag{3}
i=1∑NP(it=qt,O∣λ)=P(O∣λ)(3)
将式
(
2
)
,
(
3
)
(2),(3)
(2),(3)代入
(
1
)
(1)
(1)可以得到
γ
t
(
i
)
=
α
t
(
i
)
∗
β
t
(
i
)
∑
j
=
1
N
α
t
(
j
)
∗
β
t
(
j
)
(4)
\gamma_t(i) = \frac{\alpha_t(i) * \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) * \beta_t(j)}\tag{4}
γt(i)=∑j=1Nαt(j)∗βt(j)αt(i)∗βt(i)(4)
3. 给定模型
λ
\lambda
λ和观测
O
O
O,在时刻
t
t
t处于状态
q
i
q_i
qi且在时刻
t
+
1
t+1
t+1处于状态
q
j
q_j
qj的概率。记为
ξ
t
(
i
,
j
)
=
P
(
i
t
=
q
i
,
i
t
+
1
=
q
j
∣
O
,
λ
)
\xi_t(i,j) = P(i_t = q_i,i_{t+1} = q_j |O,\lambda)
ξt(i,j)=P(it=qi,it+1=qj∣O,λ)
通过前向后向概率计算:
ξ
t
(
i
)
=
P
(
i
t
=
q
i
,
i
t
+
1
=
q
j
,
O
∣
λ
)
P
(
O
∣
λ
)
=
P
(
i
t
=
q
i
,
i
t
+
1
=
q
j
,
O
∣
λ
)
∑
i
=
1
N
∑
j
=
1
N
P
(
i
t
=
q
i
,
i
t
+
1
=
q
j
,
O
∣
λ
)
\xi_t(i) = \frac{P(i_t = q_i, i_{t+1} = q_j,O | \lambda)}{P(O|\lambda)}=\frac{P(i_t = q_i, i_{t+1} = q_j,O | \lambda)}{\sum_{i=1}^N\sum_{j=1}^NP(i_t = q_i, i_{t+1} = q_j,O | \lambda)}
ξt(i)=P(O∣λ)P(it=qi,it+1=qj,O∣λ)=∑i=1N∑j=1NP(it=qi,it+1=qj,O∣λ)P(it=qi,it+1=qj,O∣λ)
分子可以用前向后向概率表示
P
(
i
t
=
q
i
,
i
t
+
1
=
q
j
,
O
∣
λ
)
=
α
t
(
i
)
a
i
j
b
j
(
o
t
+
1
)
β
t
+
1
(
j
)
P(i_t = q_i, i_{t+1} = q_j,O | \lambda) = \alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)
P(it=qi,it+1=qj,O∣λ)=αt(i)aijbj(ot+1)βt+1(j)
则
ξ
t
(
i
)
\xi_t(i)
ξt(i)可以表示为
ξ
t
(
i
)
=
α
t
(
i
)
a
i
j
b
j
(
o
t
+
1
)
β
t
+
1
(
j
)
∑
i
=
1
N
∑
j
=
1
N
α
t
(
i
)
a
i
j
b
j
(
o
t
+
1
)
β
t
+
1
(
j
)
\xi_t(i) = \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N\sum_{j=1}^N\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}
ξt(i)=∑i=1N∑j=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)
4. 将
γ
t
(
i
)
\gamma_t(i)
γt(i)和
ξ
t
(
i
,
j
)
\xi_t(i,j)
ξt(i,j)对各个时刻求和,可以得到一些有用的期望值。
(1) 观测
O
O
O下,状态
i
i
i出现的期望值
∑
t
=
1
T
γ
t
(
i
)
\sum_{t=1}^T\gamma_t(i)
t=1∑Tγt(i)
将每一个时刻下,出现状态
i
i
i的概率相加
(2) 观测
O
O
O下,由状态
i
i
i转移的期望值
∑
t
=
1
T
−
1
γ
t
(
i
)
\sum_{t=1}^{T-1}\gamma_t(i)
t=1∑T−1γt(i)
能够从状态
i
i
i转移的时刻是
1
,
2...
T
−
1
1,2...T-1
1,2...T−1,比上一个求和公式少了时刻
T
T
T
(3) 观测
O
O
O下,由状态
i
i
i转移到状态
j
j
j的期望值
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
)
\sum_{t=1}^{T-1}\xi_t(i,j)
t=1∑T−1ξt(i,j)
Baum-Welch模型
参数估计公式
推导的过程,尤其是拉格朗日对偶,我暂时还不十分理解,先直接给出训练方法,公式和代码。Baum-Welch算法(Baum-Welch algorithm),它是EM算法在隐马尔可夫模型学习过程中的具体实现,由Baum和Welch提出。
(1)初始化
对
n
=
0
n=0
n=0,选取
a
i
j
0
,
b
j
(
k
)
0
,
π
i
0
a_{ij}^{0} ,b_j(k)^{0} ,\pi_{i}^{0}
aij0,bj(k)0,πi0,得到模型
λ
0
=
(
a
i
j
0
,
b
j
(
k
)
0
,
π
i
0
)
\lambda^0 = (a_{ij}^{0} ,b_j(k)^{0} ,\pi_{i}^{0})
λ0=(aij0,bj(k)0,πi0)
(2)递推。对
n
=
1
,
2
,
.
.
.
n = 1,2,...
n=1,2,...
a
i
j
n
+
1
=
∑
t
=
1
T
−
1
ξ
t
(
i
,
j
)
∑
t
=
1
T
−
1
γ
t
(
i
)
a_{ij}^{n+1} = \frac{\sum_{t= 1}^{T-1}\xi_t(i,j)}{\sum_{t= 1}^{T-1}\gamma_t(i)}
aijn+1=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)
b
j
(
k
)
n
+
1
=
∑
t
=
1
,
o
t
=
v
k
T
γ
t
(
j
)
∑
t
=
1
T
γ
t
(
j
)
b_j(k)^{n+1} = \frac{\sum_{t=1,o_t=v_k}^T\gamma_t(j)}{\sum_{t= 1}^T\gamma_t(j)}
bj(k)n+1=∑t=1Tγt(j)∑t=1,ot=vkTγt(j)
π
i
n
+
1
=
γ
1
(
i
)
\pi_i^{n+1} = \gamma_1(i)
πin+1=γ1(i)
公式右端按照观测
O
=
(
o
1
,
o
2
,
.
.
.
o
T
)
O = (o_1,o_2,...o_T)
O=(o1,o2,...oT)和模型
λ
n
=
(
a
i
j
n
,
b
j
(
k
)
n
,
π
i
n
)
\lambda^n = (a_{ij}^{n} ,b_j(k)^{n} ,\pi_{i}^{n})
λn=(aijn,bj(k)n,πin)代入计算
这两个训练的公式还是比较好理解。
a
i
j
n
+
1
a_{ij}^{n+1}
aijn+1是转移概率,分母代表当前观测
O
O
O下,由状态
i
i
i转移的期望值,而分子代表观测
O
O
O下,由状态
i
i
i转移到状态
j
j
j的期望值。两者相除即为
a
i
j
n
+
1
a_{ij}^{n+1}
aijn+1。
(3)终止,得到模型
λ
n
+
1
=
(
a
i
j
n
+
1
,
b
j
(
k
)
n
+
1
,
π
i
n
+
1
)
\lambda^{n+1} = (a_{ij}^{n+1} ,b_j(k)^{n+1} ,\pi_{i}^{n+1})
λn+1=(aijn+1,bj(k)n+1,πin+1)
###Baum-Welch算法的Python实现
def baum_welch_train(self, observations, criterion=0.05):
n_states = self.A.shape[0]
n_samples = len(observations)
done = False
while not done:
# alpha_t(i) = P(O_1 O_2 ... O_t, q_t = S_i | hmm)
# Initialize alpha
alpha = self._forward(observations)
# beta_t(i) = P(O_t+1 O_t+2 ... O_T | q_t = S_i , hmm)
# Initialize beta
beta = self._backward(observations)
xi = np.zeros((n_states,n_states,n_samples-1))
for t in range(n_samples-1):
denom = np.dot(np.dot(alpha[:,t].T, self.A) * self.B[:,observations[t+1]].T, beta[:,t+1])
for i in range(n_states):
numer = alpha[i,t] * self.A[i,:] * self.B[:,observations[t+1]].T * beta[:,t+1].T
xi[i,:,t] = numer / denom
# gamma_t(i) = P(q_t = S_i | O, hmm)
gamma = np.sum(xi,axis=1)
# Need final gamma element for new B
prod = (alpha[:,n_samples-1] * beta[:,n_samples-1]).reshape((-1,1))
gamma = np.hstack((gamma, prod / np.sum(prod))) #append one more to gamma!!!
newpi = gamma[:,0]
newA = np.sum(xi,2) / np.sum(gamma[:,:-1],axis=1).reshape((-1,1))
newB = np.copy(self.B)
num_levels = self.B.shape[1]
sumgamma = np.sum(gamma,axis=1)
for lev in range(num_levels):
mask = observations == lev
newB[:,lev] = np.sum(gamma[:,mask],axis=1) / sumgamma
if np.max(abs(self.pi - newpi)) < criterion and \
np.max(abs(self.A - newA)) < criterion and \
np.max(abs(self.B - newB)) < criterion:
done = 1
self.A[:],self.B[:],self.pi[:] = newA,newB,newpi