【Transformers-实践3】——LEBert-CRF用于中文平坦实体（Flat NER）识别

应有光

已于 2022-07-09 14:40:22 修改

阅读量1.3k

点赞数 3

分类专栏：基础深度模型学习文章标签：自然语言处理 Bert Adapter

于 2022-07-08 17:34:14 首次发布

本文链接：https://blog.csdn.net/zeiyousao/article/details/125644839

版权

基础深度模型学习专栏收录该内容

8 篇文章 1 订阅

订阅专栏

LeBert为Lexicon Enhanced Bert，即词表增强的Bert。考虑到NER边界，其实也是分词的边界。因此，自然而然地引入词汇边界信息，有利于提升NER效率。

本文有较好的讲解https://mp.weixin.qq.com/s/1MxTx10_lA5iFvBqkX_Q3A，但是对于个人来说，还需要把工作做的更详细一点，以防止过几天就把模型忘掉了。

1. 任务目标

本文的任务目标在于利用预训练的语言模型，并使用额外的词表，辅助下游的中文语料的平坦命名实体识别任务。

2. 模型结构

具体的解释看论文，其实主要步骤就在于，如何将词表Lexicon的word level信息融入一般的Char level，这里使用了adapter结构将特征融合。此外，adapter具体的放置位置也是值得考虑的问题，具体内容请查看原论文即可。
在这里插入图片描述

3. 数据集简介

本模型采用了四个中文数据集，其命名实体均为平坦实体标注。标注格式为BMES格式，具体的其实使用“BIOS”四个字母即：

B：多元实体的头部
M：多元实体的非头部
I：一元实体
O: 非实体

例如，ontonote4数据集的label.txt如下
在这里插入图片描述
数据集语料以及标注为json文件，内部的格式具体如下，已经将句子切分为字（character），以及对应的各个label。

{"text": ["上", "海", "浦", "东", "开", "发", "与", "法", "制", "建", "设", "同", "步"], "label": ["B-GPE", "I-GPE", "B-GPE", "I-GPE", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"text": ["新", "华", "社", "上", "海", "二", "月", "十", "日", "电", "（", "记", "者", "谢", "金", "虎", "、", "张", "持", "坚", "）"], "label": ["B-ORG", "I-ORG", "I-ORG", "B-GPE", "I-GPE", "O", "O", "O", "O", "O", "O", "O", "O", "B-PER", "I-PER", "I-PER", "O", "B-PER", "I-PER", "I-PER", "O"]}

此外，这里我们使用一个词表，其介绍见本文头部的微信，这里我们就简单展示一下内容格式，除了第一行标注了数据集的元信息，其后都是词语和词向量。其第一个元素是词语，包括标点符号，随后是100或者200维的预训练向量。注意，标点符号、一个字也可以是词语。
第一行元信息为：2000000 200。
在这里插入图片描述

4. 程序结构

项目结构比较复杂，如下图所示：
在这里插入图片描述
其中datasets用于存放NER的四个数据集。losses用的是两种损失函数，一种是focal_loss，还有一种是平滑后的损失函数。Metrics是评价指标。Processor之中存放着数据预处理的各种工具。utils是一些方便文件读写的小包装工具。scipt是bash的命令行包装，不过实际上我们用train.py的argparse替代它的功能了。train.py是主函数。其中，我们只需要关注主程序就行。主程序的结构也非常正统：
通过main()函数进行初始化、训练、验证的pipeline管理。在训练、验证函数中具体操作单步骤的训练。

参数设置、加载tokenzier：set_train_args()、seed_everything(args.seed)
加载数据处理器Processors.processor、模型初始化、优化器初始化、权重初始化；以上1、2均在main函数中完成。
训练train：对应train()函数
验证与测试。使用evaluate()函数完成验证Dataset、测试Dataset的效果测试。

以上的参数设置就略过，重点说一说数据预处理、模型结构、损失函数与评价指标三个部分。

4.1 数据预处理

数据预处理的主要对象为Processors文件夹的Processor。对于不同的模型，需要不同的Processor，这里我们主要关心LeBert-CRF模型。

首先，除了加载词向量外，我们对预训练的词表进构建Trie树，并读取当前数据集的所有句子，获取所有能和预训练词表匹配的词语，如下写入文件。
在这里插入图片描述
这里Processor的功能主要可以归结为词表、嵌入的初始化，以及词语的获取两部分功能。
其功能如下描述。

在这里插入图片描述

4.2 模型搭建

模型LEBert-CRF=LEBert+classifier+CRF，其实包装还是比较简单的。我们之前已经使用过BERT-CRF模型，因此大体结构确实比较简单。但是这里我们用到的是LEBert，因此主要结构在于其中的Adapter fusion。

class LEBertCrfForNer(BertPreTrainedModel):
    def __init__(self, config):
        super(LEBertCrfForNer, self).__init__(config)
        self.word_embeddings = nn.Embedding(config.word_vocab_size, config.word_embed_dim)
        self.bert = LEBertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=config.num_labels, batch_first=True)
        self.init_weights()

    def forward(self, input_ids, attention_mask, token_type_ids, word_ids, word_mask, labels=None):
        word_embeddings = self.word_embeddings(word_ids)
        outputs = self.bert(
            input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
            word_embeddings=word_embeddings, word_mask=word_mask
        )
        sequence_output = outputs[0]
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        outputs = (logits,)
        if labels is not None:
            loss = self.crf(emissions=logits, tags=labels, mask=attention_mask)
            outputs = (-1 * loss,) + outputs
        return outputs  # (loss), scores

以下我们主要关注LEBert的内部搭建与使用。

4.2.1 LEBert

其实LEBert和正常的Bert相比，就是多了一层词语的Adapter fusion。因此，整体结构看上去和原本的Bert是一样的。具体实现上，由Bert的BertEmbeddings、BertEncoder、BertPooler三层构成。而Adapter被嵌入在BertEncoder之中。


class LEBertModel(BertPreTrainedModel):
    """

    The model can behave as an encoder (with only self-attention) as well
    as a decoder, in which case a layer of cross-attention is added between
    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,
    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

    To behave as an decoder the model needs to be initialized with the
    :obj:`is_decoder` argument of the configuration set to :obj:`True`.
    To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
    argument and :obj:`add_cross_attention` set to :obj:`True`; an
    :obj:`encoder_hidden_states` is then expected as an input to the forward pass.

    .. _`Attention is all you need`:
        https://arxiv.org/abs/1706.03762

    """

    def __init__(self, config):
        super().__init__(config)
        self.config = config

        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)

        self.init_weights()

简单回顾一下三个层次的功能：

BertEmbedding的功能在于将token转化为embedding，其中包括位置编码、token编码、句编码（Next Sentence Prediction)任务。
BertEncoder其实就是Transformer的Encoder layer组成的，也并不复杂。
BertPooler其实就是为了最后的任务所做的简单的分类头。

我们可以大概看一下BertPooler的实现，确实很简单。

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

4.2.2 Adapter fusion

虽然对Adapter这种“微调”，或者说”融合“方式不太熟悉，但是具体在此的结构和意思还是比较明白的：
在这里插入图片描述
Adapter做了四步骤：

维度对齐：将词向量与字符向量进行维度对齐。
权重计算：对于每个字符，计算它所匹配到的每个词向量的权重。
加权求和：对于每个字符，将词向量进行加权求和，得到该字符的加权词语向量。
特征融合：字符向量与加权词语向量相加，得到Lexicon Adapter的输出。

在本模型中，直接在BertEncoder之中加入Adapter，并在Encoder的最后一层后，进行Adapter操作。
在这里插入图片描述
我们注意一下BertEncoder的层数，是12层。

具体的Adapter的代码比较简单，分别对应如下步骤：

Project, 将词向量投影后，与字符向量进行维度对齐。
Bilinear Pooling，根据各个character与其对应的若干word，计算logits。
Attention score computing。
Fusion： Add + Dropout + LN。

class WordEmbeddingAdapter(nn.Module):
    
    def __init__(self, config):
        super(WordEmbeddingAdapter, self).__init__()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.tanh = nn.Tanh()

        self.linear1 = nn.Linear(config.word_embed_dim, config.hidden_size)
        self.linear2 = nn.Linear(config.hidden_size, config.hidden_size)

        attn_W = torch.zeros(config.hidden_size, config.hidden_size)
        self.attn_W = nn.Parameter(attn_W)
        self.attn_W.data.normal_(mean=0.0, std=config.initializer_range)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    def forward(self, layer_output, word_embeddings, word_mask):
        """
        :param layer_output:bert layer的输出,[b_size, len_input, d_model]
        :param word_embeddings:每个汉字对应的词向量集合,[b_size, len_input, num_word, d_word]
        :param word_mask:每个汉字对应的词向量集合的attention mask, [b_size, len_input, num_word]
        """

        # transform
        # 1.Project,将词向量，与字符向量进行维度对齐
        word_outputs = self.linear1(word_embeddings)
        word_outputs = self.tanh(word_outputs)
        word_outputs = self.linear2(word_outputs)
        word_outputs = self.dropout(word_outputs)   # word_outputs：[b_size, len_input, num_word, d_model]
        
        # layer_output = layer_output.unsqueeze(2)    # layer_output：[b_size, len_input, 1, d_model]
        # 2. Bilinear Pooling
        socres = torch.matmul(layer_output.unsqueeze(2), self.attn_W)  # [b_size, len_input, 1, d_model]
        socres = torch.matmul(socres, torch.transpose(word_outputs, 2, 3))  # [b_size, len_input, 1, num_word]
        socres = socres.squeeze(2)  # [b_size, len_input, num_word]
        
        # 3. Attention score computing:
        socres.masked_fill_(word_mask, -1e9)  # 将pad的注意力设为很小的数,此行报错
        socres = F.softmax(socres, dim=-1)  # [b_size, len_input, num_word]
        attn = socres.unsqueeze(-1)  # [b_size, len_input, num_word, 1]

        # 4. Fusion： Add + Dropout + LN
        weighted_word_embedding = torch.sum(word_outputs * attn, dim=2)  # [N, L, D]   # 加权求和，得到每个汉字对应的词向量集合的表示
        layer_output = layer_output + weighted_word_embedding

        layer_output = self.dropout(layer_output)
        layer_output = self.layer_norm(layer_output)

        return layer_output

4.3 损失函数与评价指标

这里用到了一些损失函数，包括Focalloss、软标签loss、交叉熵。但是它们没有用在含有CRF的模型中，而是用在最后一层label tagger是Softmax层的模型中。因此就不做过多解释了。不过说实话，这里LEBert-CRF的损失函数是什么情况，还真一下子没看懂，随后补上。

作为分类问题，评价指标就是最为正统的P、R和F1，不过召回率的分母要注意一下，别弄错了。
$\over predicted } ,R={right \over original}, F1={{2PR} \over {P+R}}$

注意到，我们一旦知道预测的tag完全正确的数目、原本的tag数目和预测的tag总数目，我们就可以计算上面三个值。因此，本问题的管理如下，将数据集的结果数目记录即可（这里其实还有点浪费内存的，每次记录不同tag的数目就行）。

class SeqEntityScore(object):
    def __init__(self, id2label,markup='bios'):
        self.id2label = id2label
        self.markup = markup
        self.reset()

    def reset(self):
        self.origins = []
        self.founds = []
        self.rights = []

    def compute(self, origin, found, right):
        recall = 0 if origin == 0 else (right / origin)
        precision = 0 if found == 0 else (right / found)
        f1 = 0. if recall + precision == 0 else (2 * precision * recall) / (precision + recall)
        return recall, precision, f1

    def result(self):
        class_info = {}
        origin_counter = Counter([x[0] for x in self.origins]) # {type:count}
        found_counter = Counter([x[0] for x in self.founds])
        right_counter = Counter([x[0] for x in self.rights])
        for type_, count in origin_counter.items():
            origin = count
            found = found_counter.get(type_, 0)
            right = right_counter.get(type_, 0)
            recall, precision, f1 = self.compute(origin, found, right)
            class_info[type_] = {"acc": round(precision, 4), 'recall': round(recall, 4), 'f1': round(f1, 4)}
        origin = len(self.origins)
        found = len(self.founds)
        right = len(self.rights)
        recall, precision, f1 = self.compute(origin, found, right)
        return {'acc': precision, 'recall': recall, 'f1': f1}, class_info

    def update(self, label_paths, pred_paths):
        '''
        labels_paths: [[],[],[],....]
        pred_paths: [[],[],[],.....]

        :param label_paths:
        :param pred_paths:
        :return:
        Example:
            >>> labels_paths = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
            >>> pred_paths = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
        '''
        for label_path, pre_path in zip(label_paths, pred_paths):
            label_entities = get_entities(label_path, self.id2label,self.markup)
            pre_entities = get_entities(pre_path, self.id2label,self.markup) # 获取[tag,begin,end]三元组
            self.origins.extend(label_entities)
            self.founds.extend(pre_entities) #
            self.rights.extend([pre_entity for pre_entity in pre_entities if pre_entity in label_entities]) #记录预测的tag组，有多少是和真实答案完全匹配的。