【NLP】(task8)Transformers完成抽取式问答+多选问答任务(更新ing)

学习总结

(1)学习用BERT模型解决抽取式问答任务的方法及步骤,步骤主要分为加载数据、数据预处理、微调预训练模型和模型评估。

  • 在加载数据阶段中,使用SQUAD数据集;
  • 在数据预处理阶段中,对tokenizer分词器的建模,处理长文本,并完成数据集中所有样本的预处理;
  • 在微调预训练模型阶段,通过对模型训练参数进行设置,训练并保存模型;
  • 在模型评估阶段,通过对模型预测的输出结果进行处理,解决无答案情况,最后使用squad评测方法,基于预测和标注对评测指标进行计算。

(2)抽取式问答:
输入层:问题Q和篇章P(均经过WordPiece分词后得到)拼接得到BERT的原始输入序列X
X = [ C L S ] q 1 q 2 ⋯ q n [ S E P ] p 1 p 2 ⋯ p m [ S E P ] X=[\mathrm{CLS}] q_{1} q_{2} \cdots q_{n}[\mathrm{SEP}] p_{1} p_{2} \cdots p_{m}[\mathrm{SEP}] X=[CLS]q1q2qn[SEP]p1p2pm[SEP] v = InputRepresentation ⁡ ( X ) \boldsymbol{v}=\operatorname{InputRepresentation}(X) v=InputRepresentation(X)BERT编码层:输入表示v经过多层Transformer的编码,借助自注意力机制充分学习篇章和问题之间的语义关联,并最终得到上下文语义表示 h ∈ R N × d \boldsymbol{h} \in \mathbb{R}^{N \times d} hRN×d 其中d为BERT的隐含层维度,N为文本序列长度。 h = BERT ⁡ ( v ) \boldsymbol{h}=\operatorname{BERT}(\boldsymbol{v}) h=BERT(v) 答案输出层:得到输入序列的上下文语义表示h后,通过全连接层,将每个分量(对应输入序列的每个位置)压缩为一个标量,并通过Softmax函数预测每个时刻成为答案起始位置的概率以及终止位置的概率。
通过交叉熵损失函数学习模型参数,将起始位置和终止位置的交叉熵损失平均,得到模型最终的总损失。

解码方法:使用基于Top-k的答案抽取方法获得答案。

(3)抽取式问答如何解决超长的文本context?(参考天国大佬)

  • 使用truncationpadding对超长文本进行切片,允许相邻切片之间有交集
  • 使用overflow_to_sample_mappingoffset_mapping,映射切片前的原始位置,用于找到答案的起始和结束位置
  • 对于所有切片进行遍历
    1)对于无答案的context,使用CLS所在的位置标注答案位置
    2)对于有答案的context,找到切片前的起始和结束位置,找到切片后token的起始和结束位置
    3)检测答案是否超出文本长度,超出则用CLS位置标注,没有超出,找到答案token的start和end位置
  • 返回tokenizer预处理之后的数据,满足预训练模型输入格式

(4)如何选择tokenizer和pre-trained模型?
可以参考huggingface的官网关于tokenizer的介绍:https://huggingface.co/transformers/tokenizer_summary.html


本文涉及的jupter notebook在 篇章4代码库中

建议直接使用google colab notebook打开本教程,可以快速下载相关数据集和模型。
如果您正在google的colab中打开这个notebook,您可能需要安装Transformers和🤗Datasets库。将以下命令取消注释即可安装。

!pip install datasets transformers

任务一:在抽取式问答任务上微调transformer模型

在这个notebook中,我们将学习到如何微调🤗 Transformers的transformer模型来解决机器问答任务。本文主要解决的是抽取式问答任务:给定一个问题和一段文本,从这段文本中找出能回答该问题的文本片段(span)。通过使用Trainer API和dataset包,我们将轻松加载数据集,然后微调transformers。下图给出了一个简单的例子
在这里插入图片描述

Note: 注意:本文的问答任务是从文本中抽取答案,并不是直接生成答案!

本notebook设计的例子可以用来解决任何和SQUAD 1和SQUAD 2类似的抽取式问答任务,并且可以使用模型库Model Hub的任何模型checkpoint,只要这些模型包含了一个token classification head 和 一个fast tokenizer。关于模型和fast tokenizer的对应关系见:这个表格

如果您的数据集和本notebook有所不同,只需要微调的调整就可以直接使用本notebook。当然,根据硬件设备(电脑内存、显卡大小),需要合理的调整batch size大小,避免out-of-memory的错误。
Set those three parameters, then the rest of the notebook should run smoothly:

# squad_v2等于True或者False分别代表使用SQUAD v1 或者 SQUAD v2。
# 如果您使用的是其他数据集,那么True代表的是:模型可以回答“不可回答”问题,也就是部分问题不给出答案,而False则代表所有问题必须回答。
squad_v2 = False
# 加载预处理BERT模型
model_checkpoint = "distilbert-base-uncased"
# 根据GPU调整batch_size大小,避免显存溢出
batch_size = 16

回顾李宏毅课程:Extraction-based Question Answering (QA)

作业7是一个问题回答系统。也就是说,在机器读完一篇文章后,你问它一个问题,它将给你一个答案。但是,这里的问题和答案稍有限制。这是Extraction-based的QA。也就是说,我们假设答案必须出现在文章中。答案必须是文章中的一个片段。

在这个任务中,一个输入序列包含一篇文章一个问题,文章和问题都是一个序列。对于中文来说,每个d代表一个汉字,每个q代表一个汉字。你把d和q放入QA模型中,我们希望它输出两个正整数s和e。根据这两个正整数,我们可以直接从文章中截取一段,它就是答案。这个片段就是正确的答案。
在这里插入图片描述

这听起来很疯狂,但是,这是现在使用的一个相当标准的方法。但是无论如何,这是今天一个非常普遍的方法。
更具体地说,这里有一个问题和一篇文章,正确答案是 “gravity”。机器如何输出正确答案?
在这里插入图片描述
你的保证模型应该输出,s等于17,e等于17,来表示gravity。因为它是整篇文章中的第17个词,所以s等于17,e等于17,意味着输出第17个词作为答案。或者举另一个例子,答案是,“within a cloud”,这是文章中的第77至79个词。你的模型要做的是,输出77和79这两个正整数,那么文章中从第77个词到第79个词的分割应该是最终的答案。这就是作业7要你做的。

当然,我们不是从头开始训练QA模型,为了训练这个QA模型,我们使用BERT预训练的模型
在这里插入图片描述

这个解决方案是这样的。对于BERT来说,你必须向它展示一个问题,一篇文章(上图其实和Natural Language Inference的case类似,那个也是两个句子,一个是前提,一个是结论/假设;这边是一个是文章,一个是问题 ),以及在问题和文章之间的一个特殊标记,然后我们在开头放一个CLS标记。

在这个任务中,你唯一需要从头训练的只有两个向量(这里说的"从头训练 "是指随机初始化)。这里我们用橙色向量和蓝色向量来表示,这两个向量的长度与BERT的输出相同。假设BERT的输出是768维的向量,这两个向量也是768维的向量。那么,如何使用这两个向量?
在这里插入图片描述

  • 首先,计算这个橙色向量和那些与文件相对应的输出向量的内积,由于有3个代表文章的标记,它将输出三个向量,计算这三个向量与橙色向量的内积,你将得到三个值,然后将它们通过softmax函数,你将得到另外三个值。

    这个内积和attention很相似,你可以把橙色部分看成是query,黄色部分看成是key,这是一个attention,那么我们应该尝试找到分数最大的位置,就是这里,橙色向量和d2的内积,如果这是最大值,s应该等于2,你输出的起始位置应该是2

  • 蓝色部分做的是完全一样的事情。

在这里插入图片描述

蓝色部分代表答案的终点,我们计算这个蓝色向量与文章对应的黄色向量的内积,然后,我们在这里也使用softmax,最后,找到最大值,如果第三个值是最大的,e应该是3,正确答案是d2和d3,即找到正确答案在文章中的起始位置和结束位置

因为答案必须在文章中,如果答案不在文章中,你就不能使用这个技巧。这就是一个QA模型需要做的。注意,这两个向量是随机初始化的,而BERT是通过它预先训练的权重初始化的。

一、加载数据集

我们将会使用🤗 Datasets 库来下载数据,并且得到我们需要的评测指标(和benchmark基准进行比较)。

使用函数load_datasetload_metric即可简单完成这两项任务。

from datasets import load_dataset, load_metric

举个例子,我们将会在这个notebook中使用SQUAD 数据集。同样,本notebook也适配所有dataset库中提供的所有问答数据集。

SQuAD数据集

SQuAD-1.1数据集介绍将SQuAD数据集可视化如下草图,每一个元素表示的就是答案,问题,以及问题编号。答案中给出了答案的起始位置和答案的文本字符串。
我们要的数据包括context,这代表一个段落,而对于这个段落会有几个问题和对应的答案,所以还需要question和text以及answer start,text就是question的答案。这个数据集一个question只有一个答案。result中的字段除了id外其余的就是我们训练需要的字段。
另外SQuAD 2.0数据集可以通过NLP 高引论文解读两篇 | BERT模型、SQuAD数据集了解。
在这里插入图片描述

如果使用的是自己的数据集(json或者csv格式),请查看Datasets 文档学习如何加载自定义的数据集。可能需要调整每列使用的名字。

# 下载SQUAD数据集
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

这个datasets对象是DatasetDict结构,训练、验证、测试分别对应这dict的一个key。

# 查看以下datasets及其属性
datasets
    DatasetDict({
        train: Dataset({
            features: ['id', 'title', 'context', 'question', 'answers'],
            num_rows: 87599
        })
        validation: Dataset({
            features: ['id', 'title', 'context', 'question', 'answers'],
            num_rows: 10570
        })
    })

无论是训练集、验证集还是测试集,对于每一个问答数据样本都会有“context", "question"和“answers”三个key。

我们可以使用一个下标来选择一个样本。

# 查看训练集的第一条数据
datasets["train"][0]
# answers代表答案
# context代表文本片段
# question代表问题
    {'answers': {'answer_start': [515], 
    'text': ['Saint Bernadette Soubirous']},
     'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
     'id': '5733be284776f41900661182',
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
     'title': 'University_of_Notre_Dame'}

注意answers的标注。answers除了给出了文本片段里的答案文本之外,还给出了该answer所在位置(以character开始计算,上面的例子是第515位)。

为了能够进一步理解数据长什么样子,下面的函数将从数据集里随机选择几个例子进行展示。

from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
	"""从数据集中随机选择几条数据"""
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))
show_random_elements(datasets["train"], num_examples=2)
answerscontextidquestiontitle
0{'answer_start': [185], 'text': ['diesel fuel']}In Alberta, five bitumen upgraders produce synthetic crude oil and a variety of other products: The Suncor Energy upgrader near Fort McMurray, Alberta produces synthetic crude oil plus diesel fuel; the Syncrude Canada, Canadian Natural Resources, and Nexen upgraders near Fort McMurray produce synthetic crude oil; and the Shell Scotford Upgrader near Edmonton produces synthetic crude oil plus an intermediate feedstock for the nearby Shell Oil Refinery. A sixth upgrader, under construction in 2015 near Redwater, Alberta, will upgrade half of its crude bitumen directly to diesel fuel, with the remainder of the output being sold as feedstock to nearby oil refineries and petrochemical plants.571b074c9499d21900609be3Besides crude oil, what does the Suncor Energy plant produce?Asphalt
1{'answer_start': [191], 'text': ['the GIOVE satellites for the Galileo system']}Compass-M1 is an experimental satellite launched for signal testing and validation and for the frequency filing on 14 April 2007. The role of Compass-M1 for Compass is similar to the role of the GIOVE satellites for the Galileo system. The orbit of Compass-M1 is nearly circular, has an altitude of 21,150 km and an inclination of 55.5 degrees.56e1161ccd28a01900c6757bThe purpose of the Compass-M1 satellite is similar to the purpose of what other satellite?BeiDou_Navigation_Satellite_System

二、预处理训练数据

import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import default_data_collator
import torch
import collections
from tqdm.auto import tqdm

2.1 构建模型对应的tokenizer

在将数据喂入模型之前,我们需要对数据进行预处理。预处理的工具叫TokenizerTokenizer首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token ID,再转化为模型需要的输入格式。

为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer,这样可以确保:

  • 我们得到一个与预训练模型一一对应的tokenizer。
  • 使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来,从而再次使用的时候不会重新下载。

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

以下代码要求tokenizer必须是transformers.PreTrainedTokenizerFast类型,因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性(比如多线程快速tokenizer)。

几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。

import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
# 如果我们想要看到tokenizer预处理之后的文本格式,我们仅使用tokenizer的tokenize方法,
# add special tokens意思是增加预训练模型所要求的特殊token。
print("单个文本tokenize: {}".format(tokenizer.tokenize("What is your name?"), add_special_tokens=True))
print("2个文本tokenize: {}".format(tokenizer.tokenize("My name is Sylvain.", add_special_tokens=True)))
# 预训练模型输入格式要求的输入为token IDs,还需要attetnion mask。可以使用下面的方法得到预训练模型格式所要求的输入。
单个文本tokenize: ['what', 'is', 'your', 'name', '?']
2个文本tokenize: ['[CLS]', 'my', 'name', 'is', 'sy', '##lva', '##in', '.', '[SEP]']

tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式

# 对单个文本进行预处理
tokenizer("What is your name?")
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
# 对2个文本进行预处理,可以看到tokenizer在开始添加了101 token ID,中间用102token ID区分两段文本,末尾用102结尾。这些规则都是预训练模型是所设计的。
tokenizer("What is your name?", "My name is Sylvain.")
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

上面看到的token IDs也就是input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字一致,那么tokenizer预处理的输入格式就会满足model需求的。关于预处理更多内容参考这个教程

2.2 处理长文本

现在我们还需要思考预训练机器问答模型们是如何处理非常长的文本的。一般来说预训练模型输入有最大长度要求,所以我们通常将超长的输入进行截断。

但是,如果我们将问答数据三元组<question, context, answer>中的超长context截断,那么我们可能丢掉答案(因为我们是从context中抽取出一个小片段作为答案)。

为了解决这个问题,下面的代码找到一个超过长度的例子,然后演示如何进行处理。

我们把超长的输入切片为多个较短的输入,每个输入都要满足模型最大长度输入要求。由于答案可能存在与切片的地方,因此我们需要允许相邻切片之间有交集,代码中通过doc_stride参数控制

机器问答预训练模型通常将question和context拼接之后作为输入,然后让模型从context里寻找答案。

max_length = 384 # 输入feature的最大长度,question和context拼接之后
doc_stride = 128 # 2个切片之间的重合token数量。

for循环遍历数据集,寻找一个超长样本,本notebook例子模型所要求的最大输入是384(经常使用的还有512)

for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

如果不截断的话,那么输入的长度是396

len(tokenizer(example["question"], example["context"])["input_ids"])
396

现在如果我们截断成最大长度384,将会丢失超长部分的信息

len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])
384

注意,一般来说,我们只对context进行切片,不会对问题进行切片,由于context是拼接在question后面的,对应着第2个文本,所以使用only_second控制.tokenizer使用doc_stride控制切片之间的重合长度。

# 准备训练数据并转换为feature
tokenized_example = tokenizer(
    example["question"], # 问题文本
    example["context"],  # 篇章文本
    max_length=max_length, 
    truncation="only_second", # 截断只发生在第二部分,即篇章
    return_overflowing_tokens=True,  # 超出最大长度的标记,将篇章切成多片
    stride=doc_stride # 设定篇章切片步长
)

由于对超长输入进行了切片,我们得到了多个输入,这些输入input_ids对应的长度是

[len(x) for x in tokenized_example["input_ids"]]
[384, 157]

我们可以将预处理后的token IDsinput_ids还原为文本格式:

for i, x in enumerate(tokenized_example["input_ids"][:2]):
    print("切片: {}".format(i))
    print(tokenizer.decode(x))
切片: 0
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP]
切片: 1
[CLS] how many wins does the notre dame men's basketball team have? [SEP] championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]

由于我们对超长文本进行了切片,我们需要重新寻找答案所在位置(相对于每一片context开头的相对位置)。机器问答模型将使用答案的位置(答案的起始位置和结束位置,start和end)作为训练标签(而不是答案的token IDS)。所以切片需要和原始输入有一个对应关系,每个token在切片后context的位置和原始超长context里位置的对应关系。

在tokenizer里可以使用return_offsets_mapping参数得到这个对应关系的map:

tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
# 打印切片前后位置下标的对应关系
print(tokenized_example["offset_mapping"][0][:100])
[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374, 379), (379, 380), (381, 384), (385, 389), (390, 393), (394, 406), (407, 408), (409, 415), (416, 418)]
[0, 0]

上面打印的是tokenized_example第0片的前100个tokens在原始context片里的位置。注意第一个token是[CLS]设定为(0, 0)是因为这个token不属于qeustion或者answer的一部分。第2个token对应的起始和结束位置是0和3。我们可以根据切片后的token id转化对应的token;然后使用offset_mapping参数映射回切片前的token位置,找到原始位置的tokens。由于question拼接在context前面,所以直接从question里根据下标找就行了。

first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])
how How

因此,我们得到了切片前后的位置对应关系。我们还需要使用sequence_ids参数来区分question和context。

sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]

None对应了special tokens,然后0或者1分表代表第1个文本和第2个文本,由于我们qeustin第1个传入,context第2个传入,所以分别对应question和context。最终我们可以找到标注的答案在预处理之后的features里的位置:

answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# 找到当前文本的Start token index.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# 找到当前文本的End token idnex.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# 检测答案是否在文本区间的外部,这种情况下意味着该样本的数据标注在CLS token位置。
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # 将token_start_index和token_end_index移动到answer所在位置的两侧.
    # 注意:答案在最末尾的边界条件.
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print("start_position: {}, end_position: {}".format(start_position, end_position))
else:
    print("The answer is not in this feature.")
start_position: 23, end_position: 26

我们需要对答案的位置进行验证,验证方式是:使用答案所在位置下标,取到对应的token ID,然后转化为文本,然后和原始答案进行但对比。

print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])
over 1, 600
over 1,600

有时候question拼接context,而有时候是context拼接question,不同的模型有不同的要求,因此我们需要使用padding_side参数来指定。

pad_on_right = tokenizer.padding_side == "right" #context在右边

2.2.1 合并以上步骤

现在,把所有步骤合并到一起。对于context中无答案的情况,我们直接将标注的答案起始位置和结束位置放置在CLS的下标处。如果allow_impossible_answers这个参数是False的化,那这些无答案的样本都会被扔掉。为了简洁起见,我们先扔掉把。

def prepare_train_features(examples):
    # 既要对examples进行truncation(截断)和padding(补全)还要还要保留所有信息,所以要用的切片的方法。
    # 每一个一个超长文本example会被切片成多个输入,相邻两个输入之间会有交集。
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # 我们使用overflow_to_sample_mapping参数来映射切片片ID到原始ID。
    # 比如有2个expamples被切成4片,那么对应是[0, 0, 1, 1],前两片对应原来的第一个example。
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # offset_mapping也对应4片
    # offset_mapping参数帮助我们映射到原始输入,由于答案标注在原始输入上,所以有助于我们找到答案的起始和结束位置。
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # 重新标注数据
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # 对每一片进行处理
        # 将无答案的样本标注到CLS上
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # 区分question和context
        sequence_ids = tokenized_examples.sequence_ids(i)

        # 拿到原始的example 下标.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # 如果没有答案,则使用CLS所在的位置为答案.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # 答案的character级别Start/end位置.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # 找到token级别的index start.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # 找到token级别的index end.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # 检测答案是否超出文本长度,超出的话也适用CLS index作为标注.
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # 如果不超出则找到答案token的start和end位置。.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

以上的预处理函数可以处理一个样本,也可以处理多个样本exapmles。如果是处理多个样本,则返回的是多个样本被预处理之后的结果list。

features = prepare_train_features(datasets['train'][:5])
# 处理5个样本

2.3 对数据集datasets所有样本进行预处理

接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。

tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

更好的是,返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。

  • datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。
  • 上面使用到的batched=True这个参数是tokenizer的特点,以为这会使用多线程同时并行对输入进行处理。

三、Fine-tuning微调模型

3.1 加载预训练模型

目前,我们已经预处理好了训练/微调需要的数据,现在我们下载预训练的模型。由于我们要做的是机器问答任务,于是我们使用这个类AutoModelForQuestionAnswering。和tokenizer相似,model也是使用from_pretrained方法进行加载。

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
Downloading: 100%|██████████| 268M/268M [00:46<00:00, 5.79MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

由于我们微调的任务是机器问答任务,而我们加载的是预训练的语言模型,那么上面会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数(预训练语言模型的神经网络head被扔掉了,同时随机初始化了机器问答的神经网络head)。

3.2 设定训练参数

正因为有这些随机初始化的参数,所以我们要在新的数据集上重新fine-tune我们的模型。
为了能够得到一个Trainer训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。同时它需要一个文件夹的名字。这个文件夹会被用来保存模型和其他模型配置。

# 训练设定参数
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5, #学习率
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3, # 训练的轮次
    weight_decay=0.01,
)

上面evaluation_strategy = "epoch"参数告诉训练代码:我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

我们使用一个default_data_collator将预处理好的数据喂给模型。

from transformers import default_data_collator

data_collator = default_data_collator

3.3 训练模型

训练的时候,我们将只会计算loss。根据评测指标评估模型将会放在下一节。

只需要把模型,训练参数,数据,之前使用的tokenizer,和数据投递工具default_data_collator传入Tranier即可。

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

调用train方法开始训练

trainer.train()

由于训练时间很长,如果是在本地mac训练,每个epcoh大约需要2消失,所以每次训练完保存以下模型。

trainer.save_model("test-squad-trained")

四、Evaluation评估

4.1 得到模型预测输出结果

模型评估会稍微优点复杂,我们需要将模型的输出后处理成我们需要的文本格式。模型本身预测的是answer所在start/end位置的logits。如果我们评估时喂入模型的是一个batch,那么输出如下:

import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

模型的输出是一个像dict的数据结构,包含了loss(因为提供了label,所有有loss),answer start和end的logits。我们在输出预测结果的时候不需要看loss,直接看logits就好了。

output.start_logits.shape, output.end_logits.shape
(torch.Size([16, 384]), torch.Size([16, 384]))

每个feature里的每个token都会有一个logit。

预测answer的方法:
方法1:(最简单的方法)

选择start的logits里最大的下标最为answer其实位置,end的logits里最大下标作为answer的结束位置。

output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)
(tensor([ 46,  57,  78,  43, 118,  15,  72,  35,  15,  34,  73,  41,  80,  91,
         156,  35], device='cuda:0'),
 tensor([ 47,  58,  81,  55, 118, 110,  75,  37, 110,  36,  76,  53,  83,  94,
         158,  35], device='cuda:0'))
方法2:

以上策略大部分情况下都是不错的。但是,如果我们的输入告诉我们找不到答案:比如start的位置比end的位置下标大,或者start和end的位置指向了question。

这个时候,简单的方法是我们继续需要选择第2好的预测作为我们的答案了,实在不行看第3好的预测,以此类推。

方法3:

由于上面的方法不太容易找到可行的答案,我们需要思考更合理的方法:
(1)我们将start和end的logits相加得到新的打分,然后去看最好的n_best_size个start和end对。
——从n_best_size个start和end对里推出相应的答案,然后检查答案是否有效,最后将他们按照打分进行怕苦,选择得分最高的作为答案。

n_best_size = 20
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# 收集最佳的start和end logits的位置:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # 如果start小雨end,那么合理的
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # 后续需要根据token的下标将答案找出来
                }
            )

(2)随后我们对根据scorevalid_answers进行排序,找到最好的那一个。

(3)最后还剩一步判断是:检查start和end位置对应的文本是否在context里面而不是在question里面。

为了完成这件事情,我们需要添加以下两个信息到validation的features里面:

  • 产生feature的example的ID。由于每个example可能会产生多个feature,所以每个feature/切片的feature需要知道他们对应的example。
  • offset mapping: 将每个切片的tokens的位置映射会原始文本基于character的下标位置。

所以我们又重新处理了以下validation验证集。和处理训练的时候的prepare_train_features稍有不同。

4.2 对验证集进行处理

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

和之前一样将prepare_validation_features函数应用到每个验证集合的样本上。

validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)
HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

使用Trainer.predict方法获得所有预测结果

raw_predictions = trainer.predict(validation_features)

这个 Trainer 隐藏了 一些模型训练时候没有使用的属性(这里是 example_idoffset_mapping,后处理的时候会用到),所以我们需要把这些设置回来:

validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

当一个token位置对应着question部分时候,prepare_validation_features函数将offset mappings设定为None,所以我们根据offset mapping很容易可以鉴定token是否在context里面啦。我们同样也根绝扔掉了特别长的答案。

4.3 得到验证结果

max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

结果为:

    [{'score': 16.706663, 'text': 'Denver Broncos'},
     {'score': 14.635585,
      'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
     {'score': 13.234194, 'text': 'Carolina Panthers'},
     {'score': 12.468662, 'text': 'Broncos'},
     {'score': 11.709289, 'text': 'Denver'},
     {'score': 10.397583,
      'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
     {'score': 10.104669,
      'text': 'American Football Conference (AFC) champion Denver Broncos'},
     {'score': 9.721636,
      'text': 'The American Football Conference (AFC) champion Denver Broncos'},
     {'score': 9.007437,
      'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
     {'score': 8.834958,
      'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina'},
     {'score': 8.38701,
      'text': 'Denver Broncos defeated the National Football Conference (NFC)'},
     {'score': 8.143825,
      'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title.'},
     {'score': 8.03359,
      'text': 'American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
     {'score': 7.832466,
      'text': 'Denver Broncos defeated the National Football Conference (NFC'},
     {'score': 7.650557,
      'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
     {'score': 7.6060467, 'text': 'Carolina Panthers 24–10'},
     {'score': 7.5795317,
      'text': 'Denver Broncos defeated the National Football Conference'},
     {'score': 7.433568, 'text': 'Carolina'},
     {'score': 6.742434,
      'text': 'Carolina Panthers 24–10 to earn their third Super Bowl title.'},
     {'score': 6.71136,
      'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24'}]

将预测答案和真实答案进行比较即可:

datasets["validation"][0]["answers"]
    {'answer_start': [177, 177, 177],
     'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}

可以看到模型做对了!

如同上面的例子所言,由于第1个feature一定是来自于第1个example,所以相对容易。对于其他的fearures来说,我们需要一个features和examples的一个映射map。同样,由于一个example可能被切片成多个features,所以我们也需要将所有features里的答案全部收集起来。

以下的代码就将exmaple的下标和features的下标进行map映射。

import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

对于后处理过程基本上已经全部完成了。

最后一点事情是:

如何解决无答案的情况(squad_v2=True的时候)。

以上的代码都只考虑了context里面的asnwers,所以我们同样需要将无答案的预测得分进行搜集(无答案的预测对应的CLS token的start和end)。如果一个example样本又多个features,那么我们还需要在多个features里预测是不是都无答案。所以无答案的最终得分是所有features的无答案得分最小的那个。

只要无答案的最终得分高于其他所有答案的得分,那么该问题就是无答案。

把所有事情都合并起来:

from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

将后处理函数应用到原始预测上:

final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)
Post-processing 10570 example predictions split into 10784 features.
HBox(children=(FloatProgress(value=0.0, max=10570.0), HTML(value='')))

4.4 评测指标的计算

然后我们加载评测指标:

metric = load_metric("squad_v2" if squad_v2 else "squad")

然后我们基于预测和标注对评测指标进行计算。为了合理的比较,我们需要将预测和标注的格式。对于squad2来说,评测指标还需要no_answer_probability参数(由于已经无答案直接设置成了空字符串,所以这里直接将这个参数设置为0.0)

if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)
{'exact_match': 76.74550614947965, 'f1': 85.13412652023338}

最后别忘了,查看如何上传模型 ,上传模型到🤗 Model Hub。随后就可以像这个notebook一开始一样,直接用名字就能使用自己的模型啦。

任务二:在多选问答任务上微调transformer模型

本文涉及的jupter notebook在篇章4代码库中

如果在colab上打开这个jupyter笔记本,您需要安装🤗Trasnformers和🤗datasets。具体命令如下(取消注释并运行,如果速度慢请切换国内源,加上第二行的参数)。
在运行单元格之前,建议按照本项目readme中提示,建立一个专门的python环境用于学习。

#! pip install datasets transformers 
# -i https://pypi.tuna.tsinghua.edu.cn/simple

可以在这里找到这个jupyter笔记本的具体的python脚本文件,还可以通过分布式的方式使用多个gpu或tpu来微调您的模型。

在当前jupyter笔记本中,下面将说明如何通过微调任意🤗Transformers 模型来构建多选任务,该任务是在给定的多个答案中选择最合理的一个。我们使用的数据集是SWAG,当然你也可以将预处理过程用于其他多选数据集或者你自己的数据。SWAG是一个关于常识推理的数据集,每个样本描述一种情况,然后给出四个可能的选项。

model_checkpoint = "bert-base-uncased"
batch_size = 16

一、加载数据集

我们将使用🤗Datasets库来下载数据。这一过程可以很容易地用函数load_dataset来完成。

from datasets import load_dataset, load_metric

load_dataset 将缓存数据集以避免下次运行时再次下载它。

datasets = load_dataset("swag", "regular")
Reusing dataset swag (/home/sgugger/.cache/huggingface/datasets/swag/regular/0.0.0/f9784740e0964a3c799d68cec0d992cc267d3fe94f3e048175eca69d739b980d)

除此之外,你也可以从我们提供的链接下载数据并解压,将解压后的3个csv文件复制到到docs/篇章4-使用Transformers解决NLP任务/datasets/swag目录下,然后用下面的代码进行加载。

import os

data_path = './datasets/swag/'
cache_dir = os.path.join(data_path, 'cache')
data_files = {'train': os.path.join(data_path, 'train.csv'), 'val': os.path.join(data_path, 'val.csv'), 'test': os.path.join(data_path, 'test.csv')}
datasets = load_dataset(data_path, 'regular', data_files=data_files, cache_dir=cache_dir)
Using custom data configuration regular-2ab2d66f12115abf


Downloading and preparing dataset swag/regular (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to ./datasets/swag/cache/swag/regular-2ab2d66f12115abf/0.0.0/a16ae67faa24f4cdd6d1fc6bfc09bdb6dc15771716221ff8bacbc6cc75533614...

Dataset swag downloaded and prepared to ./datasets/swag/cache/swag/regular-2ab2d66f12115abf/0.0.0/a16ae67faa24f4cdd6d1fc6bfc09bdb6dc15771716221ff8bacbc6cc75533614. Subsequent calls will reuse this data.

dataset对象本身是DatasetDict,它包含用于训练、验证和测试集的键值对(mnli是一个特殊的例子,其中包含用于不匹配的验证和测试集的键值对)。

datasets
DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

To access an actual element, you need to select a split first, then give an index:

datasets["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

为了了解数据是什么样子的,下面的函数将显示数据集中随机选取的一些示例。

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
show_random_elements(datasets["train"])
ending0ending1ending2ending3fold-indgold-sourcelabelsent1sent2startphrasevideo-id
0are seated on a field.are skiing down the slope.are in a lift.are pouring out in a man.16668gold1A man is wiping the skiboard.Group of peopleA man is wiping the skiboard. Group of peopleanetv_JmL6BiuXr_g
1performs stunts inside a gym.shows several shopping in the water.continues his skateboard while talking.is putting a black bike close.11424gold0The credits of the video are shown.A ladyThe credits of the video are shown. A ladyanetv_dWyE0o2NetQ
2is emerging into the hospital.are strewn under water at some wreckage.tosses the wand together and saunters into the marketplace.swats him upside down.15023gen1Through his binoculars, someone watches a handful of surfers being rolled up into the wave.SomeoneThrough his binoculars, someone watches a handful of surfers being rolled up into the wave. Someonelsmdc3016_CHASING_MAVERICKS-6791
3spies someone sitting below.opens the fridge and checks out the photo.puts a little sheepishly.staggers up to him.5475gold3He tips it upside down, and its little umbrella falls to the floor.Back inside, someoneHe tips it upside down, and its little umbrella falls to the floor. Back inside, someonelsmdc1008_Spider-Man2-75503
4carries her to the grave.laughs as someone styles her hair.sets down his glass.stares after her then trudges back up into the street.6904gen1Someone kisses her smiling daughter on the cheek and beams back at the camera.SomeoneSomeone kisses her smiling daughter on the cheek and beams back at the camera. Someonelsmdc1028_No_Reservations-83242
5stops someone and sweeps all the way back from the lower deck to join them.is being dragged towards the monstrous animation.beats out many events at the touch of the sword, crawling it.reaches into a pocket and yanks open the door.14089gen1But before he can use his wand, he accidentally rams it up the troll's nostril.The angry trollBut before he can use his wand, he accidentally rams it up the troll's nostril. The angry trolllsmdc1053_Harry_Potter_and_the_philosophers_stone-95867
6sees someone's name in the photo.gives a surprised look.kneels down and touches his ripped specs.spies on someone's clock.8407gen1Someone keeps his tired eyes on the road.Glancing over, heSomeone keeps his tired eyes on the road. Glancing over, helsmdc1024_Identity_Thief-82693
7stops as someone speaks into the camera.notices how blue his eyes are.is flung out of the door and knocks the boy over.flies through the air, its a fireball.4523gold1Both people are knocked back a few steps from the force of the collision.SheBoth people are knocked back a few steps from the force of the collision. Shelsmdc0043_Thelma_and_Luise-68271
8sits close to the river.have pet's supplies and pets.pops parked outside the dirt facility, sending up a car highway to catch control.displays all kinds of power tools and website.8112gold1A guy waits in the waiting room with his pet.A pet store and its vanA guy waits in the waiting room with his pet. A pet store and its vananetv_9VWoQpg9wqE
9the slender someone, someone turns on the light., someone gives them to her boss then dumps some alcohol into dough.liquids from a bowl, she slams them drunk.wags his tail as someone returns to the hotel room.10867gold3Inside a convenience store, she opens a freezer case.DolceInside a convenience store, she opens a freezer case. Dolcelsmdc3090_YOUNG_ADULT-43871

数据集中的每个示例都有一个上下文,它是由第一个句子(字段sent1)和第二个句子的简介(字段sent2)组成。然后给出四种可能的结尾(字段ending0ending1ending2ending3),然后让模型从中选择正确的一个(由字段label表示)。下面的函数让我们更直观地看到一个示例:

def show_one(example):
    print(f"Context: {example['sent1']}")
    print(f"  A - {example['sent2']} {example['ending0']}")
    print(f"  B - {example['sent2']} {example['ending1']}")
    print(f"  C - {example['sent2']} {example['ending2']}")
    print(f"  D - {example['sent2']} {example['ending3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")
show_one(datasets["train"][0])
Context: Members of the procession walk down the street holding small horn brass instruments.
  A - A drum line passes by walking down the street playing their instruments.
  B - A drum line has heard approaching them.
  C - A drum line arrives and they're outside dancing and asleep.
  D - A drum line turns the lead singer watches the performance.

Ground truth: option A
show_one(datasets["train"][15])
Context: Now it's someone's turn to rain blades on his opponent.
  A - Someone pats his shoulder and spins wildly.
  B - Someone lunges forward through the window.
  C - Someone falls to the ground.
  D - Someone rolls up his fast run from the water and tosses in the sky.

Ground truth: option C

二、数据预处理

在将这些文本输入到模型之前,我们需要对它们进行预处理。这是由🤗transformer的Tokenizer完成的,正如它的名字所暗示的那样,它将输入表示为一系列token,然后通过查找预训练好的词汇表,将它们转换为相应的id。最后转换成模型所期望的格式,同时生成模型所需的其他输入。

为了做到这一切,我们使用AutoTokenizerfrom_pretrained方法实例化我们的tokenizer,它将确保:

-我们得到一个对应于我们想要使用的模型架构的tokenizer,
-我们下载好了预训练这个特定模型时使用的词表。

同时,该词表将被缓存,因此下次运行时不会再次下载它。

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

我们将use_fast=True作为参数入,以使用🤗tokenizers库中的一个快速tokenizer(它由Rust支持的)。这些快速tokenizer几乎适用于所有模型,但如果您在前面的调用中出现错误,请删除该参数。

你可以直接在一个句子或一个句子对上调用这个tokenizer:

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

根据您选择的模型,您将在上面单元格返回的字典中看到不同的键值对。它们对于我们在这里所做的并不重要,只需要知道它们是我们稍后实例化的模型所需要的。如果您对此感兴趣,可以在本教程中了解更多关于它们的信息。

如下面的字典所示,为了对数据集进行预处理,我们需要知道包含句子的列的名称:

我们可以写一个函数来预处理我们的样本。在调用tokenizer之前,最棘手的部分是将所有可能的句子对放在两个大列表中,然后将结果拉平,以便每个示例有四个输入id、注意力掩码等。

当调用tokenizer时,我们传入参数truncation=True。这将确保比所选模型所能处理的更长的输入将被截断为模型所能接受的最大长度。

ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"]
    second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]
    
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists of lists for each key: a list of all examples (here 5), then a list of all choices (4) and a list of input IDs (length varying here since we did not apply any padding):

这个函数可以使用一个或多个示例。在传入多个示例时,tokenizer将为每个键返回一个列表的列表:所有示例的列表(长度为5),然后是所有选项的列表(长度为4)以及输入id的列表(长度不同,因为我们没有应用任何填充):

examples = datasets["train"][:5]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])
5 4 [30, 25, 30, 28]

让我们解码一下给定示例的输入:

idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]
['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']

我们可以将它和之前生成的ground truth进行比较:

show_one(datasets["train"][3])
Context: A drum line passes by walking down the street playing their instruments.
  A - Members of the procession are playing ping pong and celebrating one left each in quick.
  B - Members of the procession wait slowly towards the cadets.
  C - Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.
  D - Members of the procession play and go back and forth hitting the drums while the audience claps for them.

Ground truth: option D

这似乎没问题。我们可以将这个函数应用到我们数据集的所有示例中,只需要使用我们之前创建的dataset对象的map方法。这将应用于dataset对象的所有切分的所有元素,所以我们的训练,验证和测试数据将以相同的方式进行预处理。

encoded_datasets = datasets.map(preprocess_function, batched=True)
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/swag/regular/0.0.0/f9784740e0964a3c799d68cec0d992cc267d3fe94f3e048175eca69d739b980d/cache-975c81cf12e5b7ac.arrow
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/swag/regular/0.0.0/f9784740e0964a3c799d68cec0d992cc267d3fe94f3e048175eca69d739b980d/cache-d4806d63f1eaf5cd.arrow
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/swag/regular/0.0.0/f9784740e0964a3c799d68cec0d992cc267d3fe94f3e048175eca69d739b980d/cache-258c9cd71b0182db.arrow

更好的是,结果会被🤗Datasets库自动缓存,以避免下次运行时在这一步上花费时间。🤗Datasets库通常足够智能,它可以检测传递给map的函数何时发生更改(此时不再使用缓存数据)。例如,它将检测您是否在第一个单元格中更改了任务并重新运行笔记本。当🤗Datasets使用缓存文件时,它提示相应的警告,你可以在调用map中传入load_from_cache_file=False从而不使用缓存文件,并强制进行预处理。

请注意,我们传递了batched=True以批量对文本进行编码。这是为了充分利用我们前面加载的快速tokenizer的优势,它将使用多线程并发地处理批中的文本。

三、Fine-tuning微调模型

现在我们的数据已经准备好了,我们可以下载预训练好的模型并对其进行微调。因为我们的任务是关于多项选择的,所以我们使用AutoModelForMultipleChoice类。与tokenizer一样,from_pretrained方法将为我们下载并缓存模型。

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

这个警告告诉我们,我们正在丢弃一些权重(vocab_transformvocab_layer_norm层),并随机初始化其他一些参数(pre_classifierclassifier层)。这是完全正常的情况,因为我们舍弃了在预训练模型时用于掩码语言建模的头,代之以一个新的多选头,并且我们没有其预训练好的权重,所以这个警告告诉我们使用这个模型来推理之前需要微调,而这正是我们要做的。

为了实例化一个Trainer,我们需要定义另外三个东西。最重要的是TrainingArguments,它是一个包含所有用于训练的属性的类。它需要传入一个文件夹名,用于保存模型的检查点,而所有其他参数都是可选的:

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

在这里,我们设置在每个epoch的末尾进行评估,调整学习速率,使用在jupyter笔记本顶部定义的batch_size,并定制用于训练的epoch的数量,以及权重衰减。

然后,我们需要告诉我们的Trainer如何从预处理的输入数据中构造批数据。我们还没有做任何填充,因为我们将填充每个批到批内的最大长度(而不是使用整个数据集的最大长度)。这将是data collator的工作。它接受示例的列表,并将它们转换为一个批(在我们的示例中,通过应用填充)。由于在库中没有data collator来处理我们的特定问题,这里我们根据DataCollatorWithPadding自行改编一个:

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

当传入一个示例的列表时,它会将大列表中的所有输入/注意力掩码等都压平,并传递给tokenizer.pad方法。这将返回一个带有大张量的字典(其大小为(batch_size * 4) x seq_length),然后我们将其展开。

我们可以在特征列表上检查data collator是否正常工作,在这里,我们只需要确保删除所有不被我们的模型接受的输入特征(这是Trainer自动为我们做的):

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

再次强调,所有这些压平的、未压平的都可能是潜在错误的来源,所以让我们对输入进行另一个完整性检查:

[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(4)]
['[CLS] someone walks over to the radio. [SEP] someone hands her another phone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone takes the drink, then holds it. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone looks off then looks at someone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] someone walks over to the radio. [SEP] someone stares blearily down at the floor. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
show_one(datasets["train"][8])
Context: Someone walks over to the radio.
  A - Someone hands her another phone.
  B - Someone takes the drink, then holds it.
  C - Someone looks off then looks at someone.
  D - Someone stares blearily down at the floor.

Ground truth: option D

所有的都正常运行!

最后要为Trainer定义如何根据预测计算评估指标。我们需要来定义一个函数,它将使用我们之前加载的metric,我们必须做的唯一预处理是取我们预测的logits的argmax:

import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

然后,我们只需要将所有这些以及我们的数据集一起传入Trainer

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

现在,我们可以通过调用train方法来微调模型:

trainer.train()

在这里插入图片描述

最后,不要忘记将你的模型上传🤗 模型中心

Reference

(1)datawhale course
(2)进击的BERT:https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html
(3)https://relph1119.github.io/my-team-learning/#/transformers_nlp28/task08
(4)某同学的笔记(补充了其他细节):https://ifwind.github.io/2021/08/30/BERT%E5%AE%9E%E6%88%98%E2%80%94%E2%80%94%EF%BC%884%EF%BC%89%E9%97%AE%E7%AD%94%E4%BB%BB%E5%8A%A1-%E6%8A%BD%E5%8F%96%E5%BC%8F%E9%97%AE%E7%AD%94/

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

山顶夕景

小哥哥给我买个零食可好

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值