斯坦福问答数据集（Stanford Question Answering Dataset, SQuAD）

彬彬侠

于 2025-02-26 17:18:20 发布

阅读量1.1k

点赞数 14

分类专栏：自然语言处理基础文章标签： SQuAD 斯坦福问答数据集机器阅读理解问答系统 pytorch python 自然语言处理

本文链接：https://blog.csdn.net/u013172930/article/details/145883387

版权

自然语言处理基础专栏收录该内容

69 篇文章

订阅专栏

斯坦福问答数据集（Stanford Question Answering Dataset, SQuAD）

斯坦福问答数据集（Stanford Question Answering Dataset, SQuAD） 是一个广泛用于 机器阅读理解（Machine Reading Comprehension, MRC） 和 问答系统（Question Answering, QA） 研究的高质量数据集。该数据集由斯坦福大学的研究人员创建，旨在推动 NLP 领域 基于文本的问答系统 的发展。

1. SQuAD 数据集简介

SQuAD 任务 旨在让机器模型从一个给定的 段落（context） 中找到 答案（answer），并回答相应的问题（question）。数据集包含：

阅读理解文本（Context）：来自 Wikipedia 的段落。
问题（Question）：基于文本段落提出的问题。
答案（Answer）：答案是从段落中抽取的 连续片段（span extraction），即答案的起始和终止位置在文本中明确存在。

SQuAD 主要版本

版本	任务类型	规模	是否包含不可回答问题
SQuAD 1.1	机器阅读理解（MRC），答案可在文本中找到	100,000+	否
SQuAD 2.0	机器阅读理解（MRC），包含不可回答的问题	150,000+	是

2. SQuAD 1.1 数据集

SQuAD 1.1 是第一个版本，由 Wikipedia 文章的片段和人工标注的问题-答案对组成。每个问题的答案都是 文章中的一个子串（text span），不会出现无法回答的问题。

SQuAD 1.1 数据格式

SQuAD 1.1 的 JSON 文件结构如下：

{
  "data": [
    {
      "title": "Super_Bowl_50",
      "paragraphs": [
        {
          "context": "Super Bowl 50 was an American football game ...",
          "qas": [
            {
              "id": "56be4db0acb8001400a502ec",
              "question": "What was the name of the football game?",
              "answers": [
                {"text": "Super Bowl 50", "answer_start": 0}
              ]
            }
          ]
        }
      ]
    }
  ]
}

SQuAD 1.1 示例

Context（文本片段）：

Super Bowl 50 was an American football game that determined the champion of the National Football League (NFL) for the 2015 season.

Question（问题）：

What was the name of the football game?

Answer（答案）：
```
Super Bowl 50
```
- answer_start: 0 表示答案的起始位置。

3. SQuAD 2.0 数据集

SQuAD 2.0 在 SQuAD 1.1 的基础上增加了 不可回答的问题（unanswerable questions），即：

一些问题在文章中没有答案。
目的是让模型学会识别 何时应该回答，何时应该拒绝回答。

SQuAD 2.0 数据格式

与 SQuAD 1.1 类似，但包含不可回答问题：

{
  "context": "Super Bowl 50 was an American football game ...",
  "qas": [
    {
      "id": "56be4db0acb8001400a502ec",
      "question": "What was the name of the football game?",
      "answers": [
        {"text": "Super Bowl 50", "answer_start": 0}
      ]
    },
    {
      "id": "56be4db0acb8001400a502ed",
      "question": "Who won Super Bowl 51?",
      "answers": []
    }
  ]
}

SQuAD 2.0 示例

可回答问题：
- Question: What was the name of the football game?
- Answer: Super Bowl 50
不可回答问题：
- Question: Who won Super Bowl 51?
- Answer: No answer

SQuAD 2.0 训练模型不仅要找到正确答案，还要学会 何时不回答。

4. SQuAD 任务的挑战

答案是文本片段：
- 传统 QA 任务可能允许自由生成答案，但 SQuAD 任务的答案必须是文本中的某个 span（片段）。
不可回答问题（SQuAD 2.0）：
- 训练模型拒绝回答错误的问题比回答已知问题更难。
上下文推理：
- 许多问题需要跨句推理，模型必须能 理解上下文信息 而不仅仅是模式匹配。
同义词和变体表达：
- 问题和答案可能使用不同的表达方式，比如：
  - Who is the CEO of Tesla?
  - Elon Musk is the chief executive officer of Tesla.

5. SQuAD 的评估指标

SQuAD 使用以下两个指标来评估模型的表现：

准确匹配（Exact Match, EM）：
- 计算预测的答案是否与真实答案完全匹配（不区分大小写和标点）。
F1 分数（F1 Score）：
- 计算预测答案和真实答案的 词汇重叠度，即 2 × (Precision × Recall) / (Precision + Recall)。

6. 如何使用 SQuAD 数据集

6.1 下载 SQuAD 数据

使用 Hugging Face 的 datasets 库：

from datasets import load_dataset

# 加载 SQuAD 2.0 数据集
dataset = load_dataset("squad_v2")

# 访问数据
print(dataset)

6.2 预处理数据

train_data = dataset["train"]
print(train_data[0])

示例输出：

{
  "context": "Super Bowl 50 was an American football game ...",
  "question": "What was the name of the football game?",
  "answers": {"text": ["Super Bowl 50"], "answer_start": [0]}
}

6.3 使用 BERT 训练 SQuAD 任务

SQuAD 任务可以使用 BERT、RoBERTa、ALBERT 等模型进行训练。

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# 加载预训练的 BERT 模型和 tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

# 示例输入
question = "What was the name of the football game?"
context = "Super Bowl 50 was an American football game that determined the champion of the National Football League (NFL) for the 2015 season."

# 编码输入
inputs = tokenizer(question, context, return_tensors="pt")

# 预测答案
with torch.no_grad():
    outputs = model(**inputs)
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1

# 解码答案
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
print("Predicted Answer:", answer)

示例输出：

Predicted Answer: Super Bowl 50

7. SQuAD 与其他 QA 数据集的对比

数据集	任务类型	是否包含不可回答问题	规模	来源
SQuAD 1.1	机器阅读理解（MRC）	否	100K	Wikipedia
SQuAD 2.0	机器阅读理解（MRC）	是	150K	Wikipedia
Natural Questions（NQ）	真实问答（Real QA）	是	300K	Google Search
HotpotQA	多跳推理（Multi-Hop QA）	是	100K	Wikipedia