HuggingfaceNLP笔记6.3Fast tokenizers in the QA pipeline

NJU_AI_NB

于 2024-04-27 14:40:53 发布

阅读量668

点赞数 18

文章标签： nlp

本文链接：https://blog.csdn.net/aa12367/article/details/138249358

版权

本文讲述了如何在Transformer的QuestionAnsweringPipeline中，处理长文本上下文，利用offset定位答案，同时处理文本截断情况，展示了模型的运作机制和关键步骤。

摘要由CSDN通过智能技术生成

接下来，我们将深入question-answering管道，学习如何利用offset来从上下文中获取与问题相关的确切答案，就像我们在上一节中处理分组实体一样。然后，我们将探讨如何处理非常长的上下文，即使它们被截断，也能找到答案。如果你对问答任务不感兴趣，可以跳过这一节。

Using the question-answering pipeline

如第1章所述，我们可以像这样使用question-answering管道来获取问题的答案：

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

与其他管道不同，它们不能截断和分割超出模型接受的最大长度的文本（可能会错过文档末尾的信息），而这个管道可以处理非常长的上下文，并且即使答案在末尾，也能返回答案：

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

现在，让我们看看它是如何做到这一切的！

Using a model for question answering

像其他管道一样，我们首先对输入进行分词，然后将其通过模型。默认情况下，question-answering管道使用的检查点是distilbert-base-cased-distilled-squad（模型名称中的“squad”来自模型训练所用的数据集；我们将在第7章[https://huggingface.co/course/chapter7/7]中更多地讨论SQuAD数据集）：

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

将问题和上下文作为一对进行分词处理，首先处理问题。
在这里插入图片描述

模型对于问答任务的工作方式与我们之前看到的略有不同。以上面的图片为例，模型已经训练好预测答案开始的索引（这里为21）和答案结束的索引（这里为24）。这就是为什么这些模型不会返回一个包含所有类别的logits，而是两个：一个用于答案开始的logits，一个用于答案结束的logits。由于我们的输入包含66个令牌，我们得到：

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([1, 66]) torch.Size([1, 66])

为了将这些logits转换为概率，我们将应用softmax函数，但在那之前，我们需要确保我们屏蔽掉不希望预测的索引。我们的输入是[CLS] question [SEP] context [SEP]，所以需要屏蔽问题中的令牌以及[SEP]令牌。我们将保留[CLS]令牌，因为有些模型使用它来表示答案不在上下文中。

由于我们将应用softmax，我们只需要将要屏蔽的logits替换为一个大的负数，这里我们使用-10000：

import torch

sequence_ids = 输入.sequence_ids()
# 屏蔽除上下文令牌之外的所有内容
mask = [i != 1 for i in sequence_ids]
# 解除`[CLS]`令牌的屏蔽
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

现在，我们已经正确地屏蔽了不希望预测的logits，我们可以应用softmax：

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

此时，我们可以取start和end概率的最大值，但我们可能会得到一个start索引大于end索引的情况，所以我们需要多加小心。我们将计算所有start_index和end_index，其中start_index <= end_index的概率，然后取概率最高的元组 (start_index, end_index) 。

假设“答案开始于start_index”和“答案结束于end_index”这两个事件是独立的，答案开始于start_index并结束于end_index的概率是：
start_probabilities[start_index] * end_probabilities[end_index]。

因此，为了计算所有分数，我们只需要计算所有start_probabilities[start_index] * end_probabilities[end_index]，其中start_index <= end_index。

首先，我们计算所有可能的乘积：

scores = start_probabilities[:, None] * end_probabilities[None, :]

然后，我们将start_index > end_index的情况设置为0（其他概率都是正数）。torch.triu()函数返回输入二维张量的上三角部分，所以它会帮我们完成这个屏蔽：

scores = torch.triu(scores)

现在我们只需要获取最大值的索引。由于PyTorch会返回平铺后的索引，我们需要使用地板除法//和模运算%来获取start_index和end_index：

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我们还没有完成，但至少我们已经有了答案的正确分数（你可以将其与上一节中的第一个结果进行比较）：

0.97773

📝 动手试试！ 计算前五个最可能答案的开始和结束索引。

我们已经有了答案的start_index和end_index，以令牌为单位，现在我们需要将它们转换为上下文中的字符索引。这时，偏移量会非常有用。我们可以获取它们，然后像在序列标注任务中那样使用它们：

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

现在，我们只需要格式化这些信息，以得到最终结果：

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch和TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了！这和我们的第一个例子是一样的！

📝 试试看！ 使用之前计算的最佳分数，展示五个最可能的答案。要检查结果，回到第一个管道，并在调用它时传入top_k=5。

Handling long contexts

如果我们尝试对之前作为示例的长问题和上下文进行分词，我们将得到的令牌数量将超过question-answering管道使用的最大长度（这里是384）：

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

因此，我们需要将输入截断到这个最大长度。有几种方法可以做到这一点，但我们只想截断上下文，而不截断问题。由于上下文是第二句话，我们将使用"only_second"截断策略。然后的问题是，截断后的上下文可能不包含答案。例如，我们选择了一个问题，答案在上下文的末尾，当截断时，答案就不见了：

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

这意味着模型将很难找到正确的答案。为了解决这个问题，question-answering管道允许我们将上下文分割成小块，指定最大长度。为了确保我们不会在错误的地方分割上下文，使其仍然有可能找到答案，它还包括了块之间的重叠。

我们可以让分词器（快速或慢速）为我们完成这个任务，通过设置return_overflowing_tokens=True，并指定我们想要的重叠量，使用stride参数。这里是一个例子，使用一个较短的句子：

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

我们可以看到，句子被分割成了多个部分，每个inputs["input_ids"]中的条目最多有6个令牌（最后一个条目可能需要填充，使其与其它条目长度相同），并且每个条目之间有2个令牌的重叠。

让我们仔细看看分词结果：

print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如预期，我们得到了输入ID和注意力掩码。最后一个键overflow_to_sample_mapping是一个映射，告诉我们每个结果对应哪个句子——这里我们有7个结果，都来自我们传递给分词器的（唯一）句子：

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]

当一起分词多个句子时，这会更有用。例如：

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

输出：

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

这意味着第一个句子被分割成7个部分，接下来的4个部分来自第二个句子。

现在回到我们的长上下文。默认情况下，question-answering管道使用最大长度为384（如前所述），步长为128，这对应于模型的微调方式（可以通过调用管道时传递max_seq_len和stride参数来调整）。我们将使用这些参数进行分词。我们还会添加填充（使样本具有相同的长度，以便构建张量），并要求偏移量：

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

这些inputs将包含模型期望的输入ID和注意力掩码，以及我们刚刚讨论的偏移量和overflow_to_sample_mapping。由于这两个不是模型使用的参数，我们将它们从inputs中移除（并且我们不会存储映射，因为它在这里没有用）：

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])

我们的长上下文被分割成了两部分，这意味着经过模型处理后，我们将有两个开始和结束的logits：

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])

像之前一样，我们首先对不属于上下文的令牌进行掩码处理，然后对softmax取值。我们还会对所有填充令牌（由注意力掩码标记）进行掩码处理：

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

接下来，我们可以使用softmax函数将logits转换为概率：

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步类似于我们在小上下文中所做的，但我们需要为每个块重复这个过程。我们为所有可能的答案范围分配一个分数，然后选择分数最高的那个：

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867), (173, 184, 0.97149)]

这两个候选答案对应于模型在每个块中找到的最佳答案。模型对第二个答案更自信（这是个好迹象）。现在我们只需要将这两个token范围映射到上下文中的字符范围（我们只需要映射第二个范围，因为我们只需要答案，但查看模型在第一个块中选择的内容也很有趣）。

📝 动手试试！ 调整上面的代码，返回五个最可能的答案的分数和范围（总共，而不是每个块）。

我们之前获取的offsets实际上是一个列表，每个列表对应文本块的一个范围：

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}