DL | 序列标注模型-BiLSTM+CRF机理概述

最新推荐文章于 2022-04-05 17:35:00 发布

置顶南瓜派三蔬

最新推荐文章于 2022-04-05 17:35:00 发布

阅读量1.4k

点赞数 1

分类专栏： # 《统计学习方法》笔记 # 《DeepLearning》笔记文章标签：深度学习 nlp

本文链接：https://blog.csdn.net/qq_36810398/article/details/107830748

版权

《DeepLearning》笔记同时被 2 个专栏收录

7 篇文章 2 订阅

订阅专栏

《统计学习方法》笔记

1 篇文章 0 订阅

订阅专栏

文章目录

1.序列标注模型简介
2.BiLSTM+CRF模型流程
3. 几个关键问题

1.序列标注模型简介

序列标注问题包括自然语言处理中的分词，词性标注，命名实体识别，关键词抽取，词义角色标注等等。

例如，命名实体识别（NER）的标注问题就是：对长度为N的输入序列，对其中的每个元素打上标签，得到长度也为N的label，例如人名、地点等标签。

2.BiLSTM+CRF模型流程

2.1 为什么用BiLSTM+CRF模型

CRF是非常经典的序列标注模型，深度学习发展起来之后，深度学习+CRF的模型得到广泛应用。其中的代表就是BiLSTM+CRF。双向LSTM能更好的捕捉序列中上下文的信息，提高标注的准确性。

2.2 一种典型结构

2.2.1 数据预处理

字符串在输入模型之前已经做了数值化处理：一般是根据字符串在字典中的编号，将字符串转化为一个整数数组。

例如 “中华人民共和国”，根据每个字符在字典中的位置，查到这7个字对应的整数值，得到它的数值化结果，例如是[1,2,3,4,5,6,7]

2.2.2 典型结构

还可以在下述结构中添加 dropout等层，此处略去。
在这里插入图片描述

3. 几个关键问题

3.1 BiLSTM的运行机制

一般的LSTM，假设隐层神经元数目为100个，将“中国人”的embedding按次序输入LSTM,一共3个time_step，得到3个向量，它们记为[L0,L1,L2]，分别对应“中”、“国”、“人”。

把“中国人”的embedding按倒序输入LSTM，也得到3个time_step的3个向量，记为[R0,R1,R2]。

将同一个字符对应的前向和后向两个向量拼接起来，就得到BiLSTM层的输出，即为：“中”对应[L0,R2]，“国”对应[L1,R1]，"人"对应[L2,R0]。

更详细的解释请参考详解BiLSTM及代码实现。

3.2 CRF loss的计算方法

3.2.1 标记序列的score

对于一个输入序列，对于给定的一个标记序列label，它的得分定义为：

S=EmissionScore+TransitionScore
————————————————————————————————
在这里插入图片描述简言之，EmissionScore是BilSTM给这个标记的打分（BiLSTM的输出中，每个字符对应其标签的位置）的和。例如BiLSTM输出的维度为[4,6,12]，那么每个字符对应12种标签各有一个score，emit_score是指真实标签对应的那个score。

————————————————————————————————
在这里插入图片描述简言之，TransitionScore就是该序列状态转移矩阵中对应的和（序列的i位置为A，i+1序列为B，这之间对应一个transition score,可以理解成概率）。

3.2.2 CRF loss计算公式

给定一个输入序列，它的标记序列可能有很多。模型的目的是：使真实的序列的score在所有可能的序列的score和的占比最高。
在这里插入图片描述

（1）直接求解 P_RealPath
给定BiLSTM的输出emit_score、CRF的Transition矩阵、一个标签序列，可以根据3.2.1的计算方法计算该序列的score，从而得出P_RealPathde=exp(score)。

（2）动态规划求解 P_1+P_2+…+P_N
这个计算的困难在于，所有可能的path可能太多了（所有可能的标记序列太多了）。
解决办法是可以用动态规划来求解。详细过程请参考 The total score of all the paths 。

3.2.3 代码实现

以下代码摘自 bilstm_crf.py 。

    def cal_loss(self, tags, mask, emit_score):
        """ Calculate CRF loss
        Args:
            tags (tensor): a batch of tags, shape (b, len)
            mask (tensor): mask for the tags, shape (b, len), values in PAD position is 0
            emit_score (tensor): emit matrix, shape (b, len, K)
        Returns:
            loss (tensor): loss of the batch, shape (b,)
        """
        batch_size, sent_len = tags.shape
        # calculate score for the tags
        score = torch.gather(emit_score, dim=2, index=tags.unsqueeze(dim=2)).squeeze(dim=2)  # shape: (b, len)
        score[:, 1:] += self.transition[tags[:, :-1], tags[:, 1:]]
        #### total_score为P_realPath
        total_score = (score * mask.type(torch.float)).sum(dim=1)  # shape: (b,)
        # calculate the scaling factor
        d = torch.unsqueeze(emit_score[:, 0], dim=1)  # shape: (b, 1, K)
        for i in range(1, sent_len):
            n_unfinished = mask[:, i].sum()
            d_uf = d[: n_unfinished]  # shape: (uf, 1, K)
            emit_and_transition = emit_score[: n_unfinished, i].unsqueeze(dim=1) + self.transition  # shape: (uf, K, K)
            log_sum = d_uf.transpose(1, 2) + emit_and_transition  # shape: (uf, K, K)
            max_v = log_sum.max(dim=1)[0].unsqueeze(dim=1)  # shape: (uf, 1, K)
            log_sum = log_sum - max_v  # shape: (uf, K, K)
            d_uf = max_v + torch.logsumexp(log_sum, dim=1).unsqueeze(dim=1)  # shape: (uf, 1, K)
            d = torch.cat((d_uf, d[n_unfinished:]), dim=0)
        d = d.squeeze(dim=1)  # shape: (b, K)
        max_d = d.max(dim=-1)[0]  # shape: (b,)
        #### 用动态规划求得 P_1+P_2+...+P_N
        d = max_d + torch.logsumexp(d - max_d.unsqueeze(dim=1), dim=1)  # shape: (b,)
        llk = total_score - d  # shape: (b,)
        loss = -llk  # shape: (b,)
        return loss

值得一提的是：上述代码中，出现了两次减去最大值，再计算 logsumexp。这是为了防止向上溢出，原理上理解时可以忽略。

3.2.4 得到loss之后，可以更新模型参数

3.3 新样本的infer过程

3.3.1 思路

训练好模型之后，对于一个新的输入，需要给它打标签，用的是基于动态规划的维特比算法。
详细过程请参考 Infer the labels for a new sentence 。

3.3.2 代码实现

以下代码摘自bilstm_crf.py 。

    def predict(self, sentences, sen_lengths):
        """
        Args:
            sentences (tensor): sentences, shape (b, len). Lengths are in decreasing order, len is the length
                                of the longest sentence
            sen_lengths (list): sentence lengths
        Returns:
            tags (list[list[str]]): predicted tags for the batch
        """
        batch_size = sentences.shape[0]
        mask = (sentences != self.sent_vocab[self.sent_vocab.PAD])  # shape: (b, len)
        sentences = sentences.transpose(0, 1)  # shape: (len, b)
        sentences = self.embedding(sentences)  # shape: (len, b, e)
        emit_score = self.encode(sentences, sen_lengths)  # shape: (b, len, K)
        tags = [[[i] for i in range(len(self.tag_vocab))]] * batch_size  # list, shape: (b, K, 1)
        d = torch.unsqueeze(emit_score[:, 0], dim=1)  # shape: (b, 1, K)
        for i in range(1, sen_lengths[0]):
            n_unfinished = mask[:, i].sum()
            d_uf = d[: n_unfinished]  # shape: (uf, 1, K)
            emit_and_transition = self.transition + emit_score[: n_unfinished, i].unsqueeze(dim=1)  # shape: (uf, K, K)
            new_d_uf = d_uf.transpose(1, 2) + emit_and_transition  # shape: (uf, K, K)
            d_uf, max_idx = torch.max(new_d_uf, dim=1)
            max_idx = max_idx.tolist()  # list, shape: (nf, K)
            tags[: n_unfinished] = [[tags[b][k] + [j] for j, k in enumerate(max_idx[b])] for b in range(n_unfinished)]
            d = torch.cat((torch.unsqueeze(d_uf, dim=1), d[n_unfinished:]), dim=0)  # shape: (b, 1, K)
        d = d.squeeze(dim=1)  # shape: (b, K)
        _, max_idx = torch.max(d, dim=1)  # shape: (b,)
        max_idx = max_idx.tolist()
        tags = [tags[b][k] for b, k in enumerate(max_idx)]
        return tags