2.6 预测标签
在之前的章节中,我们详细地介绍了BiLSTM-CRF模型和CRF损失函数的细节,大家可以采用开源工具(Keras, Chainer, TensorFlow等)完成自己的BiLSTM-CRF模型。模型搭建过程中,非常重要的是反向传播的实现,不要担心,这些框架在训练过程中可以自动的完成反向传播(即,计算梯度、更新模型参数)。而且,有一些框架已经完成的CRF层,此时,添加CRF层就只是一行代码的工作量了。
这节,我们将介绍,当我们的模型已经训练好时,如何预测一句话的标签。
Step1:BiLSTM-CRF的发射和转移得分
依然,假设我们有只有3个单词组成的一句话:
x
=
[
w
0
,
w
1
,
w
2
]
\mathbf{x} = [w_0, w_1, w_2]
x=[w0,w1,w2]。
而且,我们已经从BiLSTM层获得了发射得分矩阵,从CRF层得到的转移得分矩阵,其示例如下表所示:
l 1 \mathbf{l_1} l1 | l 2 \mathbf{l_2} l2 | |
---|---|---|
w 0 \mathbf{w_0} w0 | x 01 x_{01} x01 | x 02 x_{02} x02 |
w 1 \mathbf{w_1} w1 | x 11 x_{11} x11 | x 12 x_{12} x12 |
w 2 \mathbf{w_2} w2 | x 21 x_{21} x21 | x 22 x_{22} x22 |
x i j x_{ij} xij表示单词 w i w_i wi 被标记为 l j l_j lj的得分。
l 1 \mathbf{l_1} l1 | l 2 \mathbf{l_2} l2 | |
---|---|---|
l 1 \mathbf{l_1} l1 | t 11 t_{11} t11 | t 12 t_{12} t12 |
l 2 \mathbf{l_2} l2 | t 21 t_{21} t21 | t 22 t_{22} t22 |
t i j t_{ij} tij从标签 i i i到标签 j j j的转移得分。
Step2:开始预测
如果你对维特比算法比较了解的话,这部分内容就很简单,如果不知道该算法,也不要担心,这里将一步一步地解释该算法,如下所示,我们将对这句话从左向右进行最终的标签预测:
- w 0 w_0 w0
- w 0 w_0 w0 → w 1 w_1 w1
- w 0 w_0 w0 → w 1 w_1 w1 → w 2 w_2 w2
这里,会有两个变量obs 和 previous,previous表示前面所有步骤的结果,obs表示当前单词的信息。
a l p h a 0 \mathbf{alpha_0} alpha0记录最高历史得分, a l p h a 1 \mathbf{alpha_1} alpha1 对应着相应的索引,这两个变量的细节之后会慢慢讲解。现在,请看下图:当一条小狗前往森林时,会在沿途做一些“标记”,上述两个变量就可以看作这些“标记”,这些“标记”的作用就是帮助狗狗返回。
w 0 w_0 w0:
o
b
s
=
[
x
01
,
x
02
]
obs = [x_{01}, x_{02}]
obs=[x01,x02]
p
r
e
v
i
o
u
s
=
N
o
n
e
previous = None
previous=None
开始,我们先观察单词
w
0
w_0
w0,目前给
w
0
w_0
w0标记的最好标签是显而易见的。
假如:
o
b
s
=
[
x
01
=
0.2
,
x
02
=
0.8
]
obs = [x_{01}=0.2, x_{02}=0.8]
obs=[x01=0.2,x02=0.8],则
w
0
w_0
w0的最好标签就是
l
2
l_2
l2。
因为,当前只有一个单词,且没有标签之间的转移,因此没有转移得分。
w 0 w_0 w0 → w 1 w_1 w1:
o
b
s
=
[
x
11
,
x
12
]
obs = [x_{11}, x_{12}]
obs=[x11,x12]
p
r
e
v
i
o
u
s
=
[
x
01
,
x
02
]
previous = [x_{01}, x_{02}]
previous=[x01,x02]
1)将previous扩展为:
p
r
e
v
i
o
u
s
=
(
p
r
e
v
i
o
u
s
[
0
]
p
r
e
v
i
o
u
s
[
0
]
p
r
e
v
i
o
u
s
[
1
]
p
r
e
v
i
o
u
s
[
1
]
)
=
(
x
01
x
01
x
02
x
02
)
previous=\left( \begin{matrix} previous[0] &previous[0] \\previous[1]&previous[1]\end{matrix}\right)=\left( \begin{matrix} x_{01} &x_{01} \\x_{02}&x_{02}\end{matrix}\right)
previous=(previous[0]previous[1]previous[0]previous[1])=(x01x02x01x02)
2)将obs扩展为:
o
b
s
=
(
o
b
s
[
0
]
o
b
s
[
1
]
o
b
s
[
0
]
o
b
s
[
1
]
)
=
(
x
11
x
12
x
11
x
12
)
obs=\left( \begin{matrix} obs[0] &obs[1] \\obs[0]&obs[1]\end{matrix}\right)=\left( \begin{matrix} x_{11} &x_{12} \\x_{11}&x_{12}\end{matrix}\right)
obs=(obs[0]obs[0]obs[1]obs[1])=(x11x11x12x12)
3)将 previous obs和转移得分相加:
s
c
o
r
e
s
=
(
x
01
x
01
x
02
x
02
)
+
(
x
11
x
12
x
11
x
12
)
+
(
t
11
t
12
t
21
t
22
)
scores=\left( \begin{matrix} x_{01} &x_{01} \\x_{02}&x_{02}\end{matrix}\right)+\left( \begin{matrix} x_{11} &x_{12} \\x_{11}&x_{12}\end{matrix}\right)+\left( \begin{matrix} t_{11} &t_{12} \\t_{21}&t_{22}\end{matrix}\right)
scores=(x01x02x01x02)+(x11x11x12x12)+(t11t21t12t22)
最终结果:
s
c
o
r
e
s
=
(
x
01
+
x
11
+
t
11
x
01
+
x
12
+
t
12
x
02
+
x
11
+
t
21
x
02
+
x
12
+
t
22
)
scores=\left( \begin{matrix} x_{01}+x_{11}+t_{11} &x_{01} +x_{12}+t_{12}\\x_{02}+x_{11}+t_{21}&x_{02}+x_{12}+t_{22}\end{matrix}\right)
scores=(x01+x11+t11x02+x11+t21x01+x12+t12x02+x12+t22)
你可能会奇怪,这与之前章节计算所有路径总得分也没啥区别啊,注意了,马上你就能看出区别了。
更新previous:
p
r
e
v
i
o
u
s
=
[
m
a
x
(
s
c
o
r
e
s
[
00
]
,
s
c
o
r
e
s
[
10
]
)
,
m
a
x
(
s
c
o
r
e
s
[
01
]
,
s
c
o
r
e
s
[
11
]
)
]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]
假如,我们的得分是:
s c o r e s = ( x 01 + x 11 + t 11 x 01 + x 12 + t 12 x 02 + x 11 + t 21 x 02 + x 12 + t 22 ) = ( 0.2 0.3 0.5 0.4 ) scores=\left( \begin{matrix} x_{01}+x_{11}+t_{11} &x_{01} +x_{12}+t_{12}\\x_{02}+x_{11}+t_{21}&x_{02}+x_{12}+t_{22}\end{matrix}\right)=\left( \begin{matrix} 0.2&0.3\\0.5&0.4\end{matrix}\right) scores=(x01+x11+t11x02+x11+t21x01+x12+t12x02+x12+t22)=(0.20.50.30.4)
则更新后previous的值为:
p
r
e
v
i
o
u
s
=
[
m
a
x
(
s
c
o
r
e
s
[
00
]
,
s
c
o
r
e
s
[
10
]
)
,
m
a
x
(
s
c
o
r
e
s
[
01
]
,
s
c
o
r
e
s
[
11
]
)
]
=
[
0.5
,
0.4
]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]=[0.5,0.4]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]=[0.5,0.4]
previous的意义就是:其存储了该单词标记为每个标签的最大得分。
[示例:START]
例如:
在语料库中有两个标签
l
a
b
e
l
1
(
l
1
)
label1(l_1)
label1(l1) and
l
a
b
e
l
2
(
l
2
)
label2(l_2)
label2(l2),这两个标签的索引分别是0和1。
p
r
e
v
i
o
u
s
[
0
]
previous[0]
previous[0]是以第0个标签
l
a
b
e
l
1
(
l
1
)
label1(l_1)
label1(l1) 结束时路径的最大得分;
p
r
e
v
i
o
u
s
[
1
]
previous[1]
previous[1]是以第1个标签
l
a
b
e
l
2
(
l
2
)
label2(l_2)
label2(l2)结束时路径的最大得分,在每次迭代中,变量
p
r
e
v
i
o
u
s
previous
previous 存储了以每个标签结束时路径的最大得分,即,在每次迭代中,我们仅保留到每个标签的最好信息
p
r
e
v
i
o
u
s
=
[
m
a
x
(
s
c
o
r
e
s
[
00
]
,
s
c
o
r
e
s
[
10
]
)
,
m
a
x
(
s
c
o
r
e
s
[
01
]
,
s
c
o
r
e
s
[
11
]
)
]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])],较少得分的路径信息直接丢弃。
[示例:END]
言归正传:
同时,我们设置两个变量来存储历史信息(得分和索引):
a
l
p
h
a
0
alpha_0
alpha0 and
a
l
p
h
a
1
alpha_1
alpha1。
这次迭代中,我们将最好的得分存储到
a
l
p
h
a
0
alpha_0
alpha0。为了方便观察,我们将每个标签的最好得分加下划线:
s
c
o
r
e
s
=
(
x
01
+
x
11
+
t
11
x
01
+
x
12
+
t
12
x
02
+
x
11
+
t
21
‾
x
02
+
x
12
+
t
22
‾
)
=
(
0.2
0.3
0.5
‾
0.4
‾
)
scores=\left( \begin{matrix} x_{01}+x_{11}+t_{11} &x_{01} +x_{12}+t_{12}\\\underline{x_{02}+x_{11}+t_{21}}&\underline{x_{02}+x_{12}+t_{22}}\end{matrix}\right)=\left( \begin{matrix} 0.2&0.3\\\underline{0.5}&\underline{0.4}\end{matrix}\right)
scores=(x01+x11+t11x02+x11+t21x01+x12+t12x02+x12+t22)=(0.20.50.30.4)
a
l
p
h
a
0
=
[
(
s
c
o
r
e
s
[
10
]
,
s
c
o
r
e
s
[
11
]
)
]
=
[
(
0.5
,
0.4
)
]
alpha_0=[(scores[10],scores[11])]=[(0.5,0.4)]
alpha0=[(scores[10],scores[11])]=[(0.5,0.4)]
同时相应的列索引将保存到
a
l
p
h
a
1
alpha_1
alpha1:
a
l
p
h
a
1
=
[
(
C
o
l
u
m
n
I
n
d
e
x
(
s
c
o
r
e
s
[
10
]
)
,
C
o
l
u
m
n
I
n
d
e
x
(
s
c
o
r
e
s
[
11
]
)
)
]
=
[
(
1
,
1
)
]
alpha_1=[(ColumnIndex(scores[10]),ColumnIndex(scores[11]))]=[(1,1)]
alpha1=[(ColumnIndex(scores[10]),ColumnIndex(scores[11]))]=[(1,1)]
如上所述,
l
1
l_1
l1 的索引是0,
l
2
l_2
l2的索引是1,所以,
(
1
,
1
)
=
(
l
2
,
l
2
)
(1,1)=(l_2,l_2)
(1,1)=(l2,l2)表明:对于当前单词
w
i
w_i
wi和标签
l
(
i
)
l^{(i)}
l(i):
(
1
,
1
)
(1,1)
(1,1)
=
(
l
2
,
l
2
)
=(l_2,l_2)
=(l2,l2)
=
=
=(we can get the maximum score 0.5 when the path is
l
(
i
−
1
)
=
l
2
‾
\underline{l^{(i-1)}=l_2}
l(i−1)=l2 →
l
(
i
)
=
l
1
‾
\underline{l^{(i)}=l_1}
l(i)=l1) ,
we can get the maximum score 0.4 when the path is
l
(
i
−
1
)
=
l
2
‾
\underline{l^{(i-1)}=l_2}
l(i−1)=l2 →
l
(
i
)
=
l
2
‾
\underline{l^{(i)}=l_2}
l(i)=l2)
l ( i − 1 ) l^{(i-1)} l(i−1)是前个单词 w i − 1 w_{i-1} wi−1的标签。
w 0 w_0 w0 → w 1 w_1 w1 → w 2 w_2 w2:
o
b
s
=
[
x
21
,
x
22
]
obs = [x_{21}, x_{22}]
obs=[x21,x22]
p
r
e
v
i
o
u
s
=
[
0.5
,
0.4
]
previous = [0.5, 0.4]
previous=[0.5,0.4]
1)将previous扩展为:
p
r
e
v
i
o
u
s
=
(
p
r
e
v
i
o
u
s
[
0
]
p
r
e
v
i
o
u
s
[
0
]
p
r
e
v
i
o
u
s
[
1
]
p
r
e
v
i
o
u
s
[
1
]
)
=
(
0.5
0.5
0.4
0.4
)
previous=\left( \begin{matrix} previous[0]&previous[0]\\previous[1]&previous[1]\end{matrix}\right)=\left( \begin{matrix} 0.5&0.5\\0.4&0.4\end{matrix}\right)
previous=(previous[0]previous[1]previous[0]previous[1])=(0.50.40.50.4)
2)将obs扩展为:
o
b
s
=
(
o
b
s
[
0
]
o
b
s
[
1
]
o
b
s
[
0
]
o
b
s
[
1
]
)
=
(
x
21
x
22
x
21
x
22
)
obs=\left( \begin{matrix} obs[0]&obs[1]\\obs[0]&obs[1]\end{matrix}\right)=\left( \begin{matrix} x_{21}&x_{22}\\x_{21}&x_{22}\end{matrix}\right)
obs=(obs[0]obs[0]obs[1]obs[1])=(x21x21x22x22)
3)将previous、obs和转移得分加起来:
s c o r e s = ( 0.5 0.5 0.4 0.4 ) + ( x 21 x 22 x 21 x 22 ) + ( t 11 t 12 t 21 t 22 ) scores=\left( \begin{matrix} 0.5&0.5\\0.4&0.4\end{matrix}\right)+\left( \begin{matrix} x_{21}&x_{22}\\x_{21}&x_{22}\end{matrix}\right)+\left( \begin{matrix} t_{11}&t_{12}\\t_{21}&t_{22}\end{matrix}\right) scores=(0.50.40.50.4)+(x21x21x22x22)+(t11t21t12t22)
最终得分:
s c o r e s = ( 0.5 + x 21 + t 11 x 0 . 5 + x 22 + t 12 0.4 + x 21 + t 21 0.4 + x 22 + t 22 ) scores=\left( \begin{matrix} 0.5+x_{21}+t_{11} &x_0.5+x_{22}+t_{12}\\0.4+x_{21}+t_{21}&0.4+x_{22}+t_{22}\end{matrix}\right) scores=(0.5+x21+t110.4+x21+t21x0.5+x22+t120.4+x22+t22)
更新previous:
p
r
e
v
i
o
u
s
=
[
m
a
x
(
s
c
o
r
e
s
[
00
]
,
s
c
o
r
e
s
[
10
]
)
,
m
a
x
(
s
c
o
r
e
s
[
01
]
,
s
c
o
r
e
s
[
11
]
)
]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]
previous=[max(scores[00],scores[10]),max(scores[01],scores[11])]
则该轮的得分为:
s
c
o
r
e
s
=
(
0.6
0.9
‾
0.8
‾
0.7
)
scores=\left( \begin{matrix} 0.6&\underline{0.9}\\\underline{0.8}&0.7\end{matrix}\right)
scores=(0.60.80.90.7)
因此,更新previous:
s
c
o
r
e
s
=
[
0.8
,
0.9
]
scores=[0.8,0.9]
scores=[0.8,0.9]
事实上,previous[0]和previous[1]之间较大的那个值则是最佳预测路径得分。
同时,每个标签的最大得分和索引添加到相应的
a
l
p
h
a
0
alpha_0
alpha0 和
a
l
p
h
a
1
alpha_1
alpha1:
a
l
p
h
a
0
=
[
(
0.5
,
0.4
)
,
(
s
c
o
r
e
s
[
10
]
,
s
c
o
r
e
s
[
01
]
)
‾
]
alpha_0=[(0.5,0.4),\underline{(scores[10],scores[01])}]
alpha0=[(0.5,0.4),(scores[10],scores[01])]
=
[
(
0.5
,
0.4
)
,
(
0.8
,
0.9
)
‾
]
=[(0.5,0.4),\underline{(0.8,0.9)}]
=[(0.5,0.4),(0.8,0.9)]
a
l
p
h
a
1
=
[
(
1
,
1
)
,
(
1
,
0
)
‾
]
alpha_1=[(1,1),\underline{(1,0)}]
alpha1=[(1,1),(1,0)]
Step3:找到具有最高得分的路径
这是最后一步了,这该步骤中,
a
l
p
h
a
0
alpha_0
alpha0和
a
l
p
h
a
1
alpha_1
alpha1将用来寻找具有最高得分的路径,这一步从后向前做。
w
1
w_1
w1 →
w
2
w_2
w2:
首先,查看
a
l
p
h
a
0
alpha_0
alpha0 和
a
l
p
h
a
1
alpha_1
alpha1的最后元素:
(
0.8
,
0.9
)
(0.8,0.9)
(0.8,0.9) 和
(
1
,
0
)
(1,0)
(1,0). 0.9 是当标签为
l
2
l_2
l2时我们获取到的最高路径得分,
l
2
l_2
l2 的索引是1, therefore check the value of
(
1
,
0
)
[
1
]
=
0
(1,0)[1]=0
(1,0)[1]=0. The index “0” means the previous label is
l
1
l_1
l1(the index of
l
1
l_1
l1 is 0). So we can get the best path of
w
1
w_1
w1 →
w
2
w_2
w2: is
l
1
l_1
l1 →
l
2
l_2
l2.
w
0
w_0
w0 →
w
1
w_1
w1:
我们继续向前移动,获取
a
l
p
h
a
1
alpha_1
alpha1的元素:(1,1),上述中我们知道,
w
1
w_1
w1的标签是
l
1
l_1
l1(索引是0),因此我们检查(1,1)[0]=1,因此,我们可以获取这部分的最佳路径(
w
0
−
>
w
1
w_0->w_1
w0−>w1):
l
2
−
>
l
1
l_2->l_1
l2−>l1。
至此,我们已经获得了最佳路径
l
2
l_2
l2 →
l
1
l_1
l1 →
l
2
l_2
l2 。
代码
https://github.com/createmomo/CRF-Layer-on-the-Top-of-BiLSTM