深入Bert实战(Pytorch)----问答 fine-Tuning

最新推荐文章于 2024-02-16 16:33:51 发布

名字填充中

最新推荐文章于 2024-02-16 16:33:51 发布

阅读量5k

点赞数 4

分类专栏： nlp huggingface笔记深度学习基础文章标签：深度学习

nlp 同时被 3 个专栏收录

28 篇文章

订阅专栏

深度学习基础

16 篇文章

订阅专栏

huggingface笔记

8 篇文章

订阅专栏

https://www.bilibili.com/video/BV1K5411t7MD?p=5
https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw/videos
深入BERT实战(PyTorch) by ChrisMcCormickAI
这是ChrisMcCormickAI在油管BERT的Question Answering with a Fine-Tuned BERT的讲解的代码，在油管视频下有cloab地址，如果不能翻墙的可以留下邮箱我全部看完整理后发给你。但是在fine-tuning最好还是在cloab上运行

BERT的Fine-Tuned-----Question Answering

对BERT来说，实现“问答方面的人类水平表现”意味着什么? BERT是史上最伟大的搜索引擎，能找到我们提出的任何问题的答案吗?

在本文的第1部分中，我将解释将BERT应用于QA的真正意义，并阐述细节。

第2部分中，包含示例代码——我们将下载一个模型，该模型already been fine-tuned问题回答，并在我们自己的文本上尝试它!

类似于前面文本分类，你肯定希望在自己的数据集上微调BERT。然而，对于问答来说，似乎你可以用一个已经在SQUAD基准上进行了微调的模型来，能获得不错的结果。在这篇Notebook里，我们会做的很好，并看到它在SQUAD数据集中的文本上表现的很好。

Links

My video walkthrough on this topic.
The blog post version.
The Colab Notebook.

Part 1: 如何将BERT应用于QA

SQuAD v1.1 基准

当有人提到“问答”是BERT的一个应用时，他们实际上指的是将BERT应用于斯坦福问题回答数据集(SQuAD)。SQuAD设定的任务和你想象的有点不同。给定一个问题和一段包含答案的文本，BERT需要突出与正确答案对应的文本“跨度”。

SQuAD的主页上有一个极好的工具，可以用来探索这个数据集的问题和参考文本的关系，甚至可以显示顶级模型做出的预测。

例如，这里有一些有趣的例子关于第50届超级碗的话题。

BERT的输入格式

为了向BERT提供QA任务，我们将问题和文本打包到输入中。

在这里插入图片描述

这两段文本由特殊的标记[SEP]分隔。

BERT还使用了"Segment Embeddings"来区分问题和文本。这只是BERT学习到的两个嵌入(对于片段“A”和“B”)，在将它们输入到输入层之前，它将它们添加到token embeddings中。

(答案的)开始和结束 token分类器

BERT需要高亮显示包含答案的文本“span”(答案)——这表示为简单地预测哪个token标志答案的开始，哪个token标志答案的结束。

在这里插入图片描述

对于在文档中的每个单词，返回一个最终的嵌入到分类器中。开始tokens 的分类器只有一组权值(由上图中的蓝色“strat”矩形表示)它应用于每个单词。

在获取输出嵌入和“开始”权重之间的点积之后，我们应用softmax激活来生成所有单词的概率分布。我们选择的是概率最高的单词作为起始tokens。

我们对结束token重复这个过程——我们有一个独立的权值向量。

End token classification

Part 2: 实例代码

在下面的示例代码中，我们将下载一个“already been fine-tuned 的问答模型，并在我们自己的文本中尝试它。如果您确实想对自己的数据集进行微调，可以对BERT进行微调，以便自己回答问题。可以查看run_squad.py在transformers库中。然而，你可能会发现下面的"fine-tuned-on-squad"模型已经很不错了，即使你的文本来自不同的领域。

Note:本notebook中的示例代码是transformers文档中提供的简短示例的注释和扩展版本here.

1. 安装 huggingface transformers 库

这个例子使用了huggingface的transformerslibrary。我们将从安装这个包开始。

!pip install transformers

import torch

2. 加载Fine-Tune过的BERT-large模型

对于QA问题来说，使用transformers 库的 BertForQuestionAnswering类

这个类支持fine-tuning,，但在这个示例中，我们将使事情变得更简单，并加载一个已经为SQuAD基准测试进行了微调过的BERT模型。

transformers库有大量预训练模型，可以通过名称方便地引用和加载这些模型。完整的列表在他们的文档here中。

为了回答问题，他们有一个版本的BERT-large，已经被SQuAD为球队的基准。

BERT-large真的巨巨巨巨大。有24层，1024大小的embedding，总共340M个参数！总的来说，它是1.34GB，所以预计它需要几分钟才能下载到您的Colab实例。

(注意，这个下载并没有使用自己的网络带宽——它在谷歌实例和模型存储在web上的任何地方之间)。

注:我认为这个模型是在SQuAD-1.1中训练的，因为它并没有输出问题是否"impossible" 从文本中回答(这是SQuAD-2.0中任务的一部分)。

from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

也加载tokenizer。

边注：显然，该模型的词汇表与未加大小写的bert-base-uncased中的词汇表相同。您可以从bert-base-uncased加载标记器，这也可以工作得很好。

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

3. 回答问题

现在我们准备好输入一个示例了!

一个QA例子包括一个问题和一段包含该问题答案的文本。

让我们在本教程中尝试一个使用文本的例子!

question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

对于 question 和 answer_text运行BERT tokenizer。返回BERT的输入，我们实际上将它们连接在一起，并在中间放置特殊的[SEP]标记。

# Apply the tokenizer to the input text, treating them as a text-pair.
# 对输入文本应用标记器(tokenizer)，将它们视为文本对。
input_ids = tokenizer.encode(question, answer_text)

print('The input has a total of {:} tokens.'.format(len(input_ids)))

为了准确地查看标记器(tokenizer)在做什么，让我们打印出带有id的标记。

# BERT only needs the token IDs, but for the purpose of inspecting the 
# tokenizer's behavior, let's also get the token strings and display them.
# BERT只需要tokens 的id，但是为了检查token生成器的行为，让我们也获取token的字符串并显示它们。
tokens = tokenizer.convert_ids_to_tokens(input_ids)    # 转换为字符

# For each token and its id...
for token, id in zip(tokens, input_ids):
    
    # If this is the [SEP] token, add some space around it to make it stand out.
    # 如果这是[SEP]标记，在其周围添加一些空格，使其突出。
    if id == tokenizer.sep_token_id:
        print('')
    
    # Print the token string and its ID in two columns.
    # 打印两列
    print('{:<12} {:>6,}'.format(token, id))

    if id == tokenizer.sep_token_id:
        print('')

我们拼接question 和 answer_text在一起，但BERT仍然需要一种方法来区分它们。bert有两个特殊的"Segment" 嵌入。一个是A，一个是B。在单词嵌入进入BERT层之前，段A嵌入需要添加到question标记中，段B嵌入需要添加到每个answer_text标记中。

这些添加的内容由transformers 库为我们处理，我们所需要做的就是为每个令牌指定一个’0’或’1’。

注：在transformers库中，huggingface喜欢将这些称为token_type_ids，但我将使用segment_ids ，因为这看起来更清楚，并且与BERT的论文相一致。

# 在input_ids中搜索`[SEP]`标记的第一个实例。
sep_index = input_ids.index(tokenizer.sep_token_id)    # 在sep出现的位置

# 段A标记的数量包括[SEP]标记本身。
num_seg_a = sep_index + 1    # sep后面的位置

# The remainder are segment B.
num_seg_b = len(input_ids) - num_seg_a    # 剩余的是B

# Construct the list of 0s and 1s.
segment_ids = [0]*num_seg_a + [1]*num_seg_b

# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)    # 每个输入令牌都应该有一个segment_id。

Side Note: Where’s the padding?

原始的example code不执行任何填充。我怀疑这是因为我们只在输入一个例子single example。如果我们输入一批示例，则需要将批中的所有示例填充或截断为一个长度，并提供一个注意掩码来告诉BERT忽略填充标记。

我们已经准备好将示例输入到模型中了!

这里我没有对他的代码稍微修改了一下，他的代码可能是版本问题，模型的输出不一样了，有知道为什么可以留言告诉我修正。 具体可以参考这里的cloab

# 在模型中运行我们的示例。
output = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                 token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

输出是一个字典类型的

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-6.4849, -6.4358, -8.1077, -8.8489, -7.8751, -8.0522, -8.4684, -8.5295,
         -7.7074, -9.2464, -6.4849, -2.7303, -6.3473, -5.7299, -7.7780, -7.0391,
         -6.3331, -7.3153, -7.3048, -7.4121, -2.2534, -5.3971, -0.9424, -7.3584,
         -5.4575, -7.0769, -4.4887, -3.9272, -5.6967, -5.9506, -5.0059, -5.9812,
          0.0530, -5.5968, -4.7093, -4.5750, -6.1786, -2.2294, -0.1904, -0.2327,
         -2.7331,  6.4256, -2.6543, -4.5655, -4.9872, -4.9834, -5.9110, -7.8402,
         -1.8986, -7.2123, -4.1543, -6.2354, -8.0953, -7.2329, -6.4411, -6.8384,
         -8.1032, -7.0570, -7.7332, -6.8711, -7.1045, -8.2966, -6.1939, -8.0817,
         -7.5501, -5.9695, -8.1008, -6.8849, -8.2273, -6.4850]],
       grad_fn=<SqueezeBackward1>), end_logits=tensor([[-2.0629, -6.3878, -6.2450, -6.3605, -7.0722, -7.6281, -7.1160, -6.8674,
         -7.1313, -7.1495, -2.0628, -5.0858, -4.7276, -3.5955, -6.3050, -7.1109,
         -4.4975, -4.7221, -5.4760, -5.5441, -6.1391, -5.8593, -0.4636, -4.3720,
         -1.0411, -5.3359, -6.2969, -6.1156, -5.1736, -4.6145, -4.8274, -6.3638,
         -4.2078, -5.2329, -4.7127,  0.7953, -0.7376, -4.5555, -5.2985, -3.6082,
         -3.7726,  2.7501,  5.4644,  4.1220,  1.2127, -5.5042, -5.8367, -6.0745,
         -3.8426, -5.8273, -1.9782, -1.3083, -2.4872, -5.3204, -6.5550, -6.3885,
         -6.8736, -6.3949, -7.0454, -6.0590, -4.5225, -6.6687, -4.0074, -6.9146,
         -6.9742, -6.5173, -4.8760, -4.4629, -4.7580, -2.0631]],
       grad_fn=<SqueezeBackward1>), hidden_states=None, attentions=None)

现在我们可以通过查看最有可能的开头和结尾单词来标出答案。

# Find the tokens with the highest `start` and `end` scores.
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)

# Combine the tokens in the answer and print it out.
answer = ' '.join(tokens[answer_start:answer_end+1])

print('Answer: "' + answer + '"')

结果是对的! Awesome 😃

*注:为开始和结束选择最高分数有点幼稚——如果它预测的结束单词在开始单词之前呢?!正确的实现是选择结束>= start.*的最高总分

只要再稍加努力，我们就可以重建被分解成子词的任何单词。

# Start with the first token.
answer = tokens[answer_start]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start + 1, answer_end + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if tokens[i][0:2] == '##':
        answer += tokens[i][2:]
    
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + tokens[i]

print('Answer: "' + answer + '"')

# Answer: "340m"

4. 可视化的分数

我很好奇，想知道所有单词的分数是多少。下面的单元格生成条形图，显示输入中每个单词的开始和结束分数。

import matplotlib.pyplot as plt
import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')    # (灰色背景+白网格)

# Increase the plot size and font size.
#sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (16,8)

检索所有的开始和结束分数，并使用所有标记作为x轴标签。

# Pull the scores out of PyTorch Tensors and convert them to 1D numpy arrays.
# 类型转换
start_scores = output.start_logits
end_scores = output.end_logits
s_scores = start_scores.detach().numpy().flatten()
e_scores = end_scores.detach().numpy().flatten()

# We'll use the tokens as the x-axis labels. In order to do that, they all need
# to be unique, so we'll add the token index to the end of each one.
token_labels = []
for (i, token) in enumerate(tokens):
    token_labels.append('{:} - {:>2}'.format(token, i))

创建一个条形图，显示每个输入单词作为“start”单词的得分。

# Create a barplot showing the start word score for all of the tokens.
ax = sns.barplot(x=token_labels, y=s_scores, ci=None)

# Turn the xlabels vertical.
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")

# Turn on the vertical grid to help align words to scores.
ax.grid(True)

plt.title('Start Word Scores')

plt.show()

在这里插入图片描述

创建第二个条形图，显示作为“end”单词的每个输入单词的得分。

# Create a barplot showing the end word score for all of the tokens.
# 创建一个条形图，显示所有token的最终单词分数。
ax = sns.barplot(x=token_labels, y=e_scores, ci=None)

# 将xlabel垂直翻转。
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")

# Turn on the vertical grid to help align words to scores.
ax.grid(True)

plt.title('End Word Scores')

plt.show()

在这里插入图片描述

其他视图 : 这里作者展示了个合并的，效果不好，要看的可以自己运行。

5. 写做一个函数

将QA过程转换成一个函数，这样我们就可以轻松地尝试其他例子。

def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    设定`question`和`answer_text`(包含答案)字符串，定义单词的答案
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Report how long the input sequence is.
    print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    output = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    start_scores, end_scores = output.start_logits, output.end_logits
    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    print('Answer: "' + answer + '"')

作为我们的参考文献，我取了[BERT论文]的摘要(https://arxiv.org/pdf/1810.04805.pdf)。

import textwrap

# Wrap text to 80 characters.
wrapper = textwrap.TextWrapper(width=80) 

bert_abstract = "We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement)."

print(wrapper.fill(bert_abstract))

询问

question = "What does the 'B' in BERT stand for?"

answer_question(question, bert_abstract)

回答

Query has 258 tokens.

Answer: "bidirectional encoder representations from transformers"

问问BERT关于它自身应用的例子 😃

这个问题的答案来自这篇摘要文章:

“…BERT model can be finetuned with just one additional output
layer to create state-of-the-art models for a wide range of tasks, such as
question answering and language inference, without substantial taskspecific
architecture modifications.”

question = "What are some example applications of BERT?"

answer_question(question, bert_abstract)

回答

Query has 255 tokens.

Answer: "question answering and language inference"