HMM模型的参数
转移概率(transition probabilities)
转移概率
P
(
i
t
=
q
i
∣
i
t
−
1
=
q
j
)
P(i_{t}=q_i|i_{t-1}=q_j)
P(it=qi∣it−1=qj) :在
t
t
t 时刻隐状态为
q
i
q_i
qi、在
t
+
1
t+1
t+1 时刻转移到
q
j
q_j
qj 的概率,其中隐状态
Q
h
i
d
d
e
n
Q_{hidden}
Qhidden 的取值范围为
Q
=
{
q
1
,
⋯
,
q
N
}
Q=\{q_1, \cdots, q_N\}
Q={q1,⋯,qN} 。
上图是转移概率矩阵,维度是
N
∗
N
N*N
N∗N。
设该矩阵为
A
A
A,
A
i
j
A_{ij}
Aij 表示矩阵中第
i
i
i 行第
j
j
j 列,则
A
i
j
=
P
(
i
t
+
1
=
q
j
∣
i
t
=
q
i
)
q
i
∈
Q
h
i
d
d
e
n
A_{ij}=P(i_{t+1}= q_j | i_{t} = q_i) \quad q_i \in Q_{hidden}
Aij=P(it+1=qj∣it=qi)qi∈Qhidden
发射概率(emission probabilities)
发射概率
P
(
o
t
=
v
k
∣
i
t
=
q
j
)
P(o_t=v_k|i_t=q_j)
P(ot=vk∣it=qj):在
t
t
t 时刻隐状态为
q
j
q_j
qj 时生成观测
v
k
v_k
vk 的概率,其中观测变量
V
o
b
s
V_{obs}
Vobs 的取值范围为
V
=
{
v
1
,
⋯
,
v
M
}
V=\{v_1,\cdots,v_M \}
V={v1,⋯,vM}。
上图是发射概率矩阵,维度是
N
∗
M
N*M
N∗M。
设该矩阵为
B
B
B ,
B
j
k
B_{jk}
Bjk 表示矩阵中第
j
j
j 行第
k
k
k 列,则
B
j
k
=
P
(
o
t
=
v
k
∣
i
t
=
q
j
)
q
i
∈
Q
h
i
d
d
e
n
v
k
∈
V
o
b
s
B_{jk}=P(o_{t}= v_k | i_{t} = q_j) \quad q_i \in Q_{hidden} \quad v_k \in V_{obs}
Bjk=P(ot=vk∣it=qj)qi∈Qhiddenvk∈Vobs
初始隐状态概率(initial probabilities)
![](https://i-blog.csdnimg.cn/blog_migrate/7764a91afe2342553d143febaafce29b.png)
初始隐状态概率 π = P ( i 1 = q i ) \pi=P(i_1=q_i) π=P(i1=qi):在时刻 1 1 1 隐状态是 q i q_i qi 的概率。
HMM 解决命名实体识别
为文本中的每个字进行命名实体的标注,假设可选的实体标签有以下7个:"B-PER”(人名)、"I-PER”(人名)、"B-LOC”(地名)、"I-LOC”(地名)、"B-ORG”(组织或公司)、“I-ORG”(组织或公司)、”O”(其它非实体)。其中,B 代表实体开头,I 代表实体中间或尾部。
对于命名实体识别这个问题来说,状态序列即为实体标记序列,观测序列是原始文本序列。那么, Q h i d d e n Q_{hidden} Qhidden的取值就是上述的7种标签, V o b s V_{obs} Vobs 的取值是文本中的所有单词(该文本对应的词典)。
比如:
状态序列:B-LOC|I-LOC|I-LOC
观测序列: 自 贸 区
对应的转移概率矩阵和发射概率矩阵如下:
举例来说,
a
71
a_{71}
a71 即代表隐状态在时刻
t
t
t 为
O
O
O,在
t
+
1
t+1
t+1 时刻转移到
B
−
P
E
R
B-PER
B−PER 的概率。
其中,M 等于字典中的字数,
b
71
b_{71}
b71 即代表在
t
t
t 时刻从隐状态
O
O
O 生成观测结果 “阿” 的概率。
参数学习
-
初始隐状态概率 π \pi π 的参数估计:
π ^ q i = P ( i 1 = q i ) = c o u n t ( q i 1 ) c o u n t ( o 1 ) \hat{\pi}_{q_i}=P(i_1=q_i)=\frac{count(q^{1}_{i})}{count(o_1)} π^qi=P(i1=qi)=count(o1)count(qi1)
在时刻 1 1 1,即文本中第一个字 q i q_{i} qi 这类隐状态标签出现的次数占所有第一个字隐状态标签总次数的比例。
-
转移概率矩阵 A A A 的参数估计:
A ^ i j = P ( i t + 1 = q j ∣ i t = q i ) = c o u n t ( q i 后 面 出 现 q j 的 次 数 ) c o u n t ( q i 的 次 数 ) \hat{A}_{ij}=P(i_{t+1}= q_j | i_{t} = q_i)=\frac{count(q_i后面出现q_j的次数)}{count(q_i的次数)} A^ij=P(it+1=qj∣it=qi)=count(qi的次数)count(qi后面出现qj的次数)
当前字的隐状态标签为 q j q_j qj 且前一个字的隐状态为 q i q_i qi 出现的次数占前一个字隐状态为 q i q_i qi 总次数的比例。
-
发射概率矩阵 B B B的参数估计:
B ^ j k = P ( o t = v k ∣ i t = q j ) = c o u n t ( q j 与 v k 同 时 出 现 的 次 数 ) c o u n t ( q j 出 现 的 次 数 ) \hat{B}_{jk}=P(o_{t}= v_k | i_{t} = q_j)=\frac{count(q_j与v_k同时出现的次数)}{count(q_j出现的次数)} B^jk=P(ot=vk∣it=qj)=count(qj出现的次数)count(qj与vk同时出现的次数)
当前字的隐状态为 q j q_j qj 且 观测为 v k v_k vk 出现的次数占当前隐状态为 q j q_j qj 总次数的比例。
此外,值得注意的是初始概率矩阵、发射概率矩阵和转移概率矩阵中的每一行的概率之和等于1,即:
∑ i π q i = 1 \sum_{i}\pi_{q_i} = 1 i∑πqi=1
∑ j A i j = ∑ j P ( i t + 1 = q j ∣ i t = q i ) = 1 \sum_{j}A_{ij} = \sum_{j}P(i_{t+1}= q_j | i_{t} = q_i) = 1 j∑Aij=j∑P(it+1=qj∣it=qi)=1
∑ k B j k = ∑ k P ( o t = v k ∣ i t = q j ) = 1 \sum_{k}B_{jk} = \sum_{k}P(o_{t}= v_k | i_{t} = q_j) =1 ∑kBjk=∑kP(ot=vk∣it=qj)=1
预测
![](https://i-blog.csdnimg.cn/blog_migrate/7e0821783a42a3c7332c3275275a7fe9.png)
举例:
-
假设所有可能的观测结果的集合 V o b s = { v 0 , v 1 } V_{obs}=\{v_0, v_1\} Vobs={v0,v1};
-
所有可能的隐状态的集合 Q h i d d e n = { q 0 , q 1 , q 2 } Q_{hidden}=\{q_0, q_1, q_2\} Qhidden={q0,q1,q2};
-
已经观测到的观测结果序列 O = ( o 1 = v 0 , o 2 = v 1 , o 3 = v 0 ) O=(o_1=v_0, \ o_2=v_1, \ o_3 = v_0) O=(o1=v0, o2=v1, o3=v0);
-
假设参数 λ = ( π , A , B ) \lambda=(\pi,A,B) λ=(π,A,B) 的值如下。
初始化两个暂存表格来暂存每一时刻的计算结果,其大小为 n u m _ h i d d e n _ s t a t e s ∗ s e q u e n c e _ l e n g t h num\_hidden\_states * sequence\_length num_hidden_states∗sequence_length.
![](https://i-blog.csdnimg.cn/blog_migrate/0f24e0f6fcdbc61fbb0dec73431c9fb1.png)
在时刻1计算 δ 1 ( i ) = π i b i ( o 1 ) \delta_1(i)=\pi_ib_i(o_1) δ1(i)=πibi(o1), i = 1 , 2 , 3 i=1,2,3 i=1,2,3
i = 1 : δ 1 ( 1 ) = π 1 b 1 ( o 1 ) = 0.2 ∗ 0.5 = 0.10 i=1: \ \delta_1(1)=\pi_1b_1(o1)=0.2*0.5=0.10 i=1: δ1(1)=π1b1(o1)=0.2∗0.5=0.10
i = 2 : δ 1 ( 2 ) = π 2 b 2 ( o 1 ) = 0.4 ∗ 0.4 = 0.16 i=2: \ \delta_1(2)=\pi_2b_2(o1)=0.4*0.4=0.16 i=2: δ1(2)=π2b2(o1)=0.4∗0.4=0.16
i = 3 : δ 1 ( 2 ) = π 2 b 3 ( o 1 ) = 0.4 ∗ 0.7 = 0.28 i=3: \ \delta_1(2)=\pi_2b_3(o1)=0.4*0.7=0.28 i=3: δ1(2)=π2b3(o1)=0.4∗0.7=0.28
在时刻2计算 δ 2 ( i ) = max j δ 1 ( j ) a j i b i ( o 2 ) \delta_2(i)=\max_j{\delta_1(j)a_{ji}b_i(o_2)} δ2(i)=maxjδ1(j)ajibi(o2), i = 1 , 2 , 3 i=1,2,3 i=1,2,3
j = 1 , 2 , 3 ; i = 1 : δ 2 ( 1 ) = m a x j { δ 1 ( j ) a j 1 b 1 ( o 2 ) } = m a x { 0.10 ∗ 0.5 , 0.16 ∗ 0.3 , 0.28 ∗ 0.2 } ∗ 0.5 = 0.028 j=1,2,3; \ i=1: \ \delta_2(1)=max_j\{\delta_1(j)a_{j1}b_1(o_2)\}=max\{0.10*0.5,0.16*0.3,0.28*0.2\}*0.5=0.028 j=1,2,3; i=1: δ2(1)=maxj{δ1(j)aj1b1(o2)}=max{0.10∗0.5,0.16∗0.3,0.28∗0.2}∗0.5=0.028
ψ 2 ( 1 ) = 3 \psi_2(1)=3 ψ2(1)=3
j = 1 , 2 , 3 ; i = 2 : δ 2 ( 2 ) = m a x j { δ 1 ( j ) a j 2 b 2 ( o 2 ) } = m a x { 0.10 ∗ 0.2 , 0.16 ∗ 0.5 , 0.28 ∗ 0.3 } ∗ 0.6 = 0.0504 j=1,2,3; \ i=2: \ \delta_2(2)=max_j\{\delta_1(j)a_{j2}b_2(o_2)\}=max\{0.10*0.2,0.16*0.5,0.28*0.3\}*0.6=0.0504 j=1,2,3; i=2: δ2(2)=maxj{δ1(j)aj2b2(o2)}=max{0.10∗0.2,0.16∗0.5,0.28∗0.3}∗0.6=0.0504
ψ 2 ( 2 ) = 3 \psi_2(2)=3 ψ2(2)=3
j = 1 , 2 , 3 ; i = 3 : δ 2 ( 3 ) = m a x j { δ 1 ( j ) a j 3 b 3 ( o 2 ) } = m a x { 0.10 ∗ 0.3 , 0.16 ∗ 0.2 , 0.28 ∗ 0.5 } ∗ 0.3 = 0.042 j=1,2,3; \ i=3: \ \delta_2(3)=max_j\{\delta_1(j)a_{j3}b_3(o_2)\}=max\{0.10*0.3,0.16*0.2,0.28*0.5\}*0.3=0.042 j=1,2,3; i=3: δ2(3)=maxj{δ1(j)aj3b3(o2)}=max{0.10∗0.3,0.16∗0.2,0.28∗0.5}∗0.3=0.042
ψ 2 ( 3 ) = 3 \psi_2(3)=3 ψ2(3)=3
在时刻3计算 δ 3 ( i ) = max j δ 2 ( j ) a j i b i ( o 3 ) \delta_3(i)=\max_j{\delta_2(j)a_{ji}b_i(o_3)} δ3(i)=maxjδ2(j)ajibi(o3), i = 1 , 2 , 3 i=1,2,3 i=1,2,3
j = 1 , 2 , 3 ; i = 1 : δ 3 ( 1 ) = m a x j { δ 2 ( j ) a j 1 b 1 ( o 3 ) } = m a x { 0.028 ∗ 0.5 , 0.0504 ∗ 0.3 , 0.042 ∗ 0.2 } ∗ 0.5 = 0.00756 j=1,2,3; \ i=1: \ \delta_3(1)=max_j\{\delta_2(j)a_{j1}b_1(o_3)\}=max\{0.028*0.5,0.0504*0.3,0.042*0.2\}*0.5=0.00756 j=1,2,3; i=1: δ3(1)=maxj{δ2(j)aj1b1(o3)}=max{0.028∗0.5,0.0504∗0.3,0.042∗0.2}∗0.5=0.00756
ψ 3 ( 1 ) = 2 \psi_3(1)=2 ψ3(1)=2
j = 1 , 2 , 3 ; i = 1 : δ 3 ( 2 ) = m a x j { δ 2 ( j ) a j 2 b 2 ( o 3 ) } = m a x { 0.028 ∗ 0.2 , 0.0504 ∗ 0.5 , 0.042 ∗ 0.3 } ∗ 0.4 = 0.01008 j=1,2,3; \ i=1: \ \delta_3(2)=max_j\{\delta_2(j)a_{j2}b_2(o_3)\}=max\{0.028*0.2,0.0504*0.5,0.042*0.3\}*0.4=0.01008 j=1,2,3; i=1: δ3(2)=maxj{δ2(j)aj2b2(o3)}=max{0.028∗0.2,0.0504∗0.5,0.042∗0.3}∗0.4=0.01008
ψ 3 ( 2 ) = 2 \psi_3(2)=2 ψ3(2)=2
j = 1 , 2 , 3 ; i = 1 : δ 3 ( 3 ) = m a x j { δ 2 ( j ) a j 3 b 3 ( o 3 ) } = m a x { 0.028 ∗ 0.3 , 0.0504 ∗ 0.2 , 0.042 ∗ 0.5 } ∗ 0.7 = 0.0147 j=1,2,3; \ i=1: \ \delta_3(3)=max_j\{\delta_2(j)a_{j3}b_3(o_3)\}=max\{0.028*0.3,0.0504*0.2,0.042*0.5\}*0.7=0.0147 j=1,2,3; i=1: δ3(3)=maxj{δ2(j)aj3b3(o3)}=max{0.028∗0.3,0.0504∗0.2,0.042∗0.5}∗0.7=0.0147
ψ 3 ( 3 ) = 3 \psi_3(3)=3 ψ3(3)=3
![](https://i-blog.csdnimg.cn/blog_migrate/6d415b8f50020cd6ffb931a1c52f1950.png)
回溯:
-
计算最后一步达到最大路径的隐状态,即在 T 1 T1 T1表格的第 3 3 3列求 a r g m a x argmax argmax:
i 3 = a r g m a x T 1 [ : , t i m e _ s t e p = 3 ] = 3 i_3 = argmax \ T1[:,time\_step=3]=3 i3=argmax T1[:,time_step=3]=3 -
在 T 2 T2 T2表格中不断向前追溯,即求出当前最大概率是从前一步哪个隐状态转移过来的:
i 2 = T 2 [ i 3 = 3 , t i m e _ s t e p = 3 ] = 3 i_2 = T2[i_3=3,time\_step=3]=3 i2=T2[i3=3,time_step=3]=3
i 1 = T 2 [ i 2 = 2 , t i m e _ s t e p = 2 ] = 3 i_1 = T2[i_2=2,time\_step=2]=3 i1=T2[i2=2,time_step=2]=3
-
因此,最有可能的隐状态序列:
I = ( q 2 , q 2 , q 2 ) I=(q_2, \ q_2, \ q_2) I=(q2, q2, q2)
代码
import json
import numpy as np
from tqdm import tqdm
# 加载字典
def load_dict(path):
with open(path, "r", encoding="utf-8") as f:
return json.load(f)
# 读取txt文件, 加载训练数据
def load_data(path):
with open(path, "r", encoding="utf-8") as f:
return [eval(i) for i in f.readlines()]
class HMM_NER:
def __init__(self, char2idx_path, tag2idx_path):
# 载入一些字典
# char2idx: 字 转换为 token
self.char2idx = load_dict(char2idx_path)
# tag2idx: 标签转换为 token
self.tag2idx = load_dict(tag2idx_path)
# idx2tag: token转换为标签
self.idx2tag = {v: k for k, v in self.tag2idx.items()}
# 初始化隐状态数量(实体标签数)和观测数量(字典字数)
self.tag_size = len(self.tag2idx)
self.vocab_size = len(self.char2idx)
# 初始化A, B, pi为全0
self.pi = np.zeros([1, self.tag_size])
self.transition = np.zeros([self.tag_size, self.tag_size])
self.emission = np.zeros([self.tag_size, self.vocab_size])
# 偏置, 用来防止log(0)或乘0的情况
self.epsilon = 1e-8
def fit(self, train_dic_path):
print("Loading data...")
train_dic = load_data(train_dic_path)
print("Estimating pi, A and B...")
self.estimate_parameters(train_dic)
# 取log防止计算结果下溢
self.pi = np.log(self.pi)
self.transition = np.log(self.transition)
self.emission = np.log(self.emission)
print("DONE!")
def estimate_parameters(self, train_dic):
# 初始矩阵:p(i_1)
# 转移矩阵:p(i_t+1|i_t)
# 发射矩阵:p(o_t|i_t)
for dic in tqdm(train_dic):
for idx, (char, tag) in enumerate(zip(dic["text"][:-1], dic["label"][:-1])):
cur_char = self.char2idx[char] # 当前字在字典中的索引
cur_tag = self.tag2idx[tag] # 当前字的标签在标签集中的索引
next_tag = self.tag2idx[dic["label"][idx + 1]] # 下一个字的标签在标签集中的索引
self.transition[cur_tag, next_tag] += 1 # 转移概率矩阵
self.emission[cur_tag, cur_char] += 1 # 发射概率矩阵
if idx == 0:
self.pi[0, cur_tag] += 1 # 初始概率矩阵
self.emission[self.tag2idx[dic["label"][-1]], self.char2idx[dic["text"][-1]]] += 1
# 在等于0的位置加上很小的一个值epsilon
self.transition[self.transition == 0] = self.epsilon
self.emission[self.emission == 0] = self.epsilon
self.pi[self.pi == 0] = self.epsilon
# 转移概率
self.transition /= np.sum(self.transition, axis=1, keepdims=True)
# 发射概率
self.emission /= np.sum(self.emission, axis=1, keepdims=True)
# 初始状态概率
self.pi /= np.sum(self.pi, axis=1, keepdims=True)
def emission_prob(self, char):
# 计算发射概率,即p(observation|state)
char_token = self.char2idx.get(char, 0)
# 如果当前字属于未知, 则将p(observation|state)设为均匀分布
if char_token == 0:
return np.log(np.ones(self.tag_size) / self.tag_size)
# 否则,取出发射概率矩阵char对应那一列
else:
return np.ravel(self.emission[:, char_token])
def viterbi(self, text):
# 序列长度
seq_len = len(text)
# 初始化T1表、T2表
T1_table = np.zeros([self.tag_size, seq_len])
T2_table = np.zeros([self.tag_size, seq_len])
# 得到第1时刻的发射概率
start_p_Obs_State = self.emission_prob(text[0])
# 计算第一步初始概率, 填入表中
T1_table[:, 0] = self.pi + start_p_Obs_State
T2_table[:, 0] = np.nan
for i in range(1, seq_len):
# 当前时刻的发射概率
p_Obs_State = self.emission_prob(text[i])
p_Obs_State = np.expand_dims(p_Obs_State, axis=-1) # tag_size * 1
# 前一时刻计算出的概率值
prev_score = np.expand_dims(T1_table[:, i - 1], axis=0) # 1 * tag_size
# 广播
curr_score = prev_score + self.transition.T + p_Obs_State
# 存入T1 T2中
T1_table[:, i] = np.max(curr_score, axis=-1)
T2_table[:, i] = np.argmax(curr_score, axis=-1)
# 回溯
best_tag_id = int(np.argmax(T1_table[:, -1]))
best_tags = [best_tag_id]
for i in range(seq_len - 1, 0, -1):
best_tag_id = int(T2_table[best_tag_id, i])
best_tags.append(best_tag_id)
return list(reversed(best_tags))
def predict(self, text):
# 预测
if len(text) == 0:
raise NotImplementedError("输入文本为空!")
# 维特比算法解码
best_tag_id = self.viterbi(text)
# 用来打印预测结果
for char, tag_id in zip(text, best_tag_id):
print(char + "_" + self.idx2tag[tag_id] + "|", end="")
参考资料
《统计学习方法》
https://www.bilibili.com/video/BV1uJ411u7Ut