Mindspore框架CRF条件随机场概率图模型实现文本序列命名实体标注|(一)序列标注与条件随机场的关系
Mindspore框架CRF条件随机场概率图模型实现文本序列命名实体标注|(二)CRF模型构建
Mindspore框架CRF条件随机场概率图模型实现文本序列命名实体标注|(三)双向LSTM+CRF模型构建实现
一、MindSpore框架下实现CRF参数化的形式
首先实现CRF层的前向训练部分,将CRF和损失函数做合并,选择分类问题常用的负对数似然函数(Negative Log Likelihood, NLL),则有:
Loss
=
−
l
o
g
(
P
(
y
∣
x
)
)
(
4
)
\begin{align}\text{Loss} = -log(P(y|x)) \qquad (4)\end{align}
Loss=−log(P(y∣x))(4)
由公式 ( 1 ) (1) (1)可得,
Loss = − l o g ( exp ( Score ( x , y ) ) ∑ y ′ ∈ Y exp ( Score ( x , y ′ ) ) ) ( 5 ) \begin{align}\text{Loss} = -log(\frac{\exp{(\text{Score}(x, y)})}{\sum_{y' \in Y} \exp{(\text{Score}(x, y')})}) \qquad (5)\end{align} Loss=−log(∑y′∈Yexp(Score(x,y′))exp(Score(x,y)))(5)
= l o g ( ∑ y ′ ∈ Y exp ( Score ( x , y ′ ) ) − Score ( x , y ) \begin{align}= log(\sum_{y' \in Y} \exp{(\text{Score}(x, y')}) - \text{Score}(x, y) \end{align} =log(y′∈Y∑exp(Score(x,y′))−Score(x,y)
根据公式 ( 5 ) (5) (5),称被减数为Normalizer,减数为Score,分别实现后相减得到最终Loss。
1.1. score的计算
首先根据公式 ( 3 ) (3) (3)计算正确标签序列所对应的得分,这里需要注意,除了转移概率矩阵 P \textbf{P} P外,还需要维护两个大小为 ∣ T ∣ |T| ∣T∣的向量,分别作为序列开始和结束时的转移概率。同时我们引入了一个掩码矩阵 m a s k mask mask,将多个序列打包为一个Batch时填充的值忽略,使得 Score \text{Score} Score计算仅包含有效的Token。
def compute_score(emissions, tags, seq_ends, mask, trans, start_trans, end_trans):
# emissions: (seq_length, batch_size, num_tags)
# tags: (seq_length, batch_size)
# mask: (seq_length, batch_size)
seq_length, batch_size = tags.shape
mask = mask.astype(emissions.dtype)
# 将score设置为初始转移概率
# shape: (batch_size,)
score = start_trans[tags[0]]
# score += 第一次发射概率
# shape: (batch_size,)
score += emissions[0, mnp.arange(batch_size), tags[0]]
for i in range(1, seq_length):
# 标签由i-1转移至i的转移概率(当mask == 1时有效)
# shape: (batch_size,)
score += trans[tags[i - 1], tags[i]] * mask[i]
# 预测tags[i]的发射概率(当mask == 1时有效)
# shape: (batch_size,)
score += emissions[i, mnp.arange(batch_size), tags[i]] * mask[i]
# 结束转移
# shape: (batch_size,)
last_tags = tags[seq_ends, mnp.arange(batch_size)]
# score += 结束转移概率
# shape: (batch_size,)
score += end_trans[last_tags]
return score
1.2. Normalizer计算
根据公式 ( 5 ) (5) (5),Normalizer是 x x x对应的所有可能的输出序列的Score的对数指数和(Log-Sum-Exp)。此时如果按穷举法进行计算,则需要将每个可能的输出序列Score都计算一遍,共有 ∣ T ∣ n |T|^{n} ∣T∣n个结果。这里我们采用动态规划算法,通过复用计算结果来提高效率。
假设需要计算从第 0 0 0至第 i i i个Token所有可能的输出序列得分 Score i \text{Score}_{i} Scorei,则可以先计算出从第 0 0 0至第 i − 1 i-1 i−1个Token所有可能的输出序列得分 Score i − 1 \text{Score}_{i-1} Scorei−1。因此,Normalizer可以改写为以下形式:
l o g ( ∑ y 0 , i ′ ∈ Y exp ( Score i ) ) = l o g ( ∑ y 0 , i − 1 ′ ∈ Y exp ( Score i − 1 + h i + P ) ) ( 6 ) log(\sum_{y'_{0,i} \in Y} \exp{(\text{Score}_i})) = log(\sum_{y'_{0,i-1} \in Y} \exp{(\text{Score}_{i-1} + h_{i} + \textbf{P}})) \qquad (6) log(y0,i′∈Y∑exp(Scorei))=log(y0,i−1′∈Y∑exp(Scorei−1+hi+P))(6)
其中 h i h_i hi为第 i i i个Token的发射概率, P \textbf{P} P是转移矩阵。由于发射概率矩阵 h h h和转移概率矩阵 P \textbf{P} P独立于 y y y的序列路径计算,可以将其提出,可得:
l o g ( ∑ y 0 , i ′ ∈ Y exp ( Score i ) ) = l o g ( ∑ y 0 , i − 1 ′ ∈ Y exp ( Score i − 1 ) ) + h i + P ( 7 ) log(\sum_{y'_{0,i} \in Y} \exp{(\text{Score}_i})) = log(\sum_{y'_{0,i-1} \in Y} \exp{(\text{Score}_{i-1}})) + h_{i} + \textbf{P} \qquad (7) log(y0,i′∈Y∑exp(Scorei))=log(y0,i−1′∈Y∑exp(Scorei−1))+hi+P(7)
根据公式(7),Normalizer的实现如下:
def compute_normalizer(emissions, mask, trans, start_trans, end_trans):
# emissions: (seq_length, batch_size, num_tags)
# mask: (seq_length, batch_size)
seq_length = emissions.shape[0]
# 将score设置为初始转移概率,并加上第一次发射概率
# shape: (batch_size, num_tags)
score = start_trans + emissions[0]
for i in range(1, seq_length):
# 扩展score的维度用于总score的计算
# shape: (batch_size, num_tags, 1)
broadcast_score = score.expand_dims(2)
# 扩展emission的维度用于总score的计算
# shape: (batch_size, 1, num_tags)
broadcast_emissions = emissions[i].expand_dims(1)
# 根据公式(7),计算score_i
# 此时broadcast_score是由第0个到当前Token所有可能路径
# 对应score的log_sum_exp
# shape: (batch_size, num_tags, num_tags)
next_score = broadcast_score + trans + broadcast_emissions
# 对score_i做log_sum_exp运算,用于下一个Token的score计算
# shape: (batch_size, num_tags)
next_score = ops.logsumexp(next_score, axis=1)
# 当mask == 1时,score才会变化
# shape: (batch_size, num_tags)
score = mnp.where(mask[i].expand_dims(1), next_score, score)
# 最后加结束转移概率
# shape: (batch_size, num_tags)
score += end_trans
# 对所有可能的路径得分求log_sum_exp
# shape: (batch_size,)
return ops.logsumexp(score, axis=1)
1.3.Viterbi算法
在完成前向训练部分后,需要实现解码部分。这里我们选择适合求解序列最优路径的Viterbi算法。与计算Normalizer类似,使用动态规划求解所有可能的预测序列得分。不同的是在解码时同时需要将第 i i i个Token对应的score取值最大的标签保存,供后续使用Viterbi算法求解最优预测序列使用。
取得最大概率得分 Score \text{Score} Score,以及每个Token对应的标签历史 History \text{History} History后,根据Viterbi算法可以得到公式:
P 0 , i = m a x ( P 0 , i − 1 ) + P i − 1 , i P_{0,i} = max(P_{0, i-1}) + P_{i-1, i} P0,i=max(P0,i−1)+Pi−1,i
从第0个至第 i i i个Token对应概率最大的序列,只需要考虑从第0个至第 i − 1 i-1 i−1个Token对应概率最大的序列,以及从第 i i i个至第 i − 1 i-1 i−1个概率最大的标签即可。因此我们逆序求解每一个概率最大的标签,构成最佳的预测序列。
由于静态图语法限制,Viterbi算法求解最佳预测序列的部分将作为后处理函数,不纳入后续CRF层的实现。
def viterbi_decode(emissions, mask, trans, start_trans, end_trans):
# emissions: (seq_length, batch_size, num_tags)
# mask: (seq_length, batch_size)
seq_length = mask.shape[0]
score = start_trans + emissions[0]
history = ()
for i in range(1, seq_length):
broadcast_score = score.expand_dims(2)
broadcast_emission = emissions[i].expand_dims(1)
next_score = broadcast_score + trans + broadcast_emission
# 求当前Token对应score取值最大的标签,并保存
indices = next_score.argmax(axis=1)
history += (indices,)
next_score = next_score.max(axis=1)
score = mnp.where(mask[i].expand_dims(1), next_score, score)
score += end_trans
return score, history
def post_decode(score, history, seq_length):
# 使用Score和History计算最佳预测序列
batch_size = seq_length.shape[0]
seq_ends = seq_length - 1
# shape: (batch_size,)
best_tags_list = []
# 依次对一个Batch中每个样例进行解码
for idx in range(batch_size):
# 查找使最后一个Token对应的预测概率最大的标签,
# 并将其添加至最佳预测序列存储的列表中
best_last_tag = score[idx].argmax(axis=0)
best_tags = [int(best_last_tag.asnumpy())]
# 重复查找每个Token对应的预测概率最大的标签,加入列表
for hist in reversed(history[:seq_ends[idx]]):
best_last_tag = hist[idx][best_tags[-1]]
best_tags.append(int(best_last_tag.asnumpy()))
# 将逆序求解的序列标签重置为正序
best_tags.reverse()
best_tags_list.append(best_tags)
return best_tags_list
1.4.CRF层
完成上述前向训练和解码部分的代码后,将其组装完整的CRF层。考虑到输入序列可能存在Padding的情况,CRF的输入需要考虑输入序列的真实长度,因此除发射矩阵和标签外,加入seq_length
参数传入序列Padding前的长度,并实现生成mask矩阵的sequence_mask
方法。
import mindspore as ms
import mindspore.nn as nn
import mindspore.ops as ops
import mindspore.numpy as mnp
from mindspore.common.initializer import initializer, Uniform
def sequence_mask(seq_length, max_length, batch_first=False):
"""根据序列实际长度和最大长度生成mask矩阵"""
range_vector = mnp.arange(0, max_length, 1, seq_length.dtype)
result = range_vector < seq_length.view(seq_length.shape + (1,))
if batch_first:
return result.astype(ms.int64)
return result.astype(ms.int64).swapaxes(0, 1)
class CRF(nn.Cell):
def __init__(self, num_tags: int, batch_first: bool = False, reduction: str = 'sum') -> None:
if num_tags <= 0:
raise ValueError(f'invalid number of tags: {num_tags}')
super().__init__()
if reduction not in ('none', 'sum', 'mean', 'token_mean'):
raise ValueError(f'invalid reduction: {reduction}')
self.num_tags = num_tags
self.batch_first = batch_first
self.reduction = reduction
self.start_transitions = ms.Parameter(initializer(Uniform(0.1), (num_tags,)), name='start_transitions')
self.end_transitions = ms.Parameter(initializer(Uniform(0.1), (num_tags,)), name='end_transitions')
self.transitions = ms.Parameter(initializer(Uniform(0.1), (num_tags, num_tags)), name='transitions')
def construct(self, emissions, tags=None, seq_length=None):
if tags is None:
return self._decode(emissions, seq_length)
return self._forward(emissions, tags, seq_length)
def _forward(self, emissions, tags=None, seq_length=None):
if self.batch_first:
batch_size, max_length = tags.shape
emissions = emissions.swapaxes(0, 1)
tags = tags.swapaxes(0, 1)
else:
max_length, batch_size = tags.shape
if seq_length is None:
seq_length = mnp.full((batch_size,), max_length, ms.int64)
mask = sequence_mask(seq_length, max_length)
# shape: (batch_size,)
numerator = compute_score(emissions, tags, seq_length-1, mask, self.transitions, self.start_transitions, self.end_transitions)
# shape: (batch_size,)
denominator = compute_normalizer(emissions, mask, self.transitions, self.start_transitions, self.end_transitions)
# shape: (batch_size,)
llh = denominator - numerator
if self.reduction == 'none':
return llh
if self.reduction == 'sum':
return llh.sum()
if self.reduction == 'mean':
return llh.mean()
return llh.sum() / mask.astype(emissions.dtype).sum()
def _decode(self, emissions, seq_length=None):
if self.batch_first:
batch_size, max_length = emissions.shape[:2]
emissions = emissions.swapaxes(0, 1)
else:
batch_size, max_length = emissions.shape[:2]
if seq_length is None:
seq_length = mnp.full((batch_size,), max_length, ms.int64)
mask = sequence_mask(seq_length, max_length)
return viterbi_decode(emissions, mask, self.transitions, self.start_transitions, self.end_transitions)