SuperGLUE：比GLUE更难的NLP评测基准，评估大规模预训练语言模型

彬彬侠

于 2025-03-09 16:16:29 发布

阅读量829

点赞数 13

分类专栏：大模型文章标签： SuperGLUE NLP 自然语言处理 Hugging Face python

本文链接：https://blog.csdn.net/u013172930/article/details/146133868

版权

大模型专栏收录该内容

98 篇文章

订阅专栏

SuperGLUE（Super General Language Understanding Evaluation）

SuperGLUE（Super General Language Understanding Evaluation）是 比 GLUE 更难的 NLP 评测基准，用于评估 大规模预训练语言模型（如 T5、GPT-3、PaLM） 的 自然语言理解能力。

它是 GLUE Benchmark（2018） 的升级版，专为 更先进的 NLP 模型 设计，涵盖 多种复杂任务，如 多轮对话推理、常识推理、阅读理解 等。

1. 为什么需要 SuperGLUE？

GLUE 是 2018 年提出的 NLP 评测基准，适用于 BERT、RoBERTa 等模型，但：

2019 年后，BERT 级别的模型已在 GLUE 上超越人类水平。
GLUE 任务相对简单，无法有效衡量 GPT-3、T5、PaLM 这样的 超大规模 NLP 模型。
SuperGLUE 提供更复杂的推理任务，更接近人类认知水平。

2. SuperGLUE vs GLUE

对比项	GLUE	SuperGLUE
任务数量	9	8
任务类型	句子分类、文本蕴含	复杂推理、常识推理、多轮问答
难度	较简单（适合 BERT）	更难（适合 GPT-3、T5、PaLM）
适用模型	BERT、RoBERTa、T5	T5、GPT-3、PaLM
人类基准	87.1	89.8

SuperGLUE 任务比 GLUE 更贴近真实 NLP 应用，比如：

需要 复杂的推理
涉及 跨句关系
更接近人类思维模式

3. SuperGLUE 任务

SuperGLUE 包含 8 个更具挑战性的 NLP 任务，主要涉及 多轮对话、常识推理、阅读理解。

任务	任务类型	训练样本数	评估指标
BoolQ	判断问题能否从文章中回答（二分类）	16K	准确率
CB	文本蕴含（判断句子逻辑关系）	250	F1 / 准确率
COPA	因果推理（选择正确的因果关系）	400	准确率
MultiRC	多选阅读理解（判断多个答案的正确性）	27K	F1 / 准确率
ReCoRD	阅读理解（填空题形式）	120K	F1 / 准确率
RTE	文本蕴含（类似 MNLI）	2.5K	准确率
WiC	同义词消歧（判断一个单词在两句话中的含义是否相同）	6K	准确率
WSC	Winograd 指代消解（判断代词指代的对象）	554	准确率

4. SuperGLUE 任务示例

BoolQ（布尔问答）

任务：判断一个 Yes/No 问题 是否可以从文章中得到答案（二分类任务）。

文章	问题	答案
The Eiffel Tower is located in Paris, France.	Is the Eiffel Tower in Germany?	No
The Amazon Rainforest is the largest tropical rainforest on Earth.	Is the Amazon the smallest forest?	No

CB（CommitmentBank，文本蕴含）

任务：判断两个句子是否具有 蕴含、矛盾、中立 关系（三分类任务）。

句子 1	句子 2	关系
A man is playing a guitar.	A musician is performing.	Entailment（蕴含）
A child is reading a book.	A person is swimming.	Contradiction（矛盾）

COPA（因果推理）

任务：给定一个事件，从 两个选项 中选择 最可能的因果关系（二分类任务）。

事件	选项 A	选项 B	选择
The man fell off his bike.	He lost his balance.	He bought a new phone.	A
The child cried.	The child won a prize.	The child dropped their ice cream.	B

WSC（Winograd 指代消解）

任务：判断句子中的代词指代的是哪个名词（二分类任务）。

句子	代词	指代对象
The city council refused the demonstrators a permit because they* feared violence.*	they	city council
The trophy doesn’t fit into the brown suitcase because it is too large.	it	trophy

5. SuperGLUE Benchmark 评分

SuperGLUE 评分 综合 8 个任务的表现，以下是 NLP 预训练模型的 SuperGLUE 评分：

模型	SuperGLUE 总分
人类（Human）	89.8
BERT (2018)	69.0
RoBERTa (2019)	80.5
T5-11B (2020)	88.9
GPT-3 (2020)	85.5
PaLM-540B (2022)	90.9

GPT-3 和 PaLM 等 大规模 NLP 模型 已在 SuperGLUE 上超过人类水平。

6. 如何在 SuperGLUE 上测试 NLP 模型

使用 Hugging Face `datasets`

from datasets import load_dataset

# 加载 COPA（SuperGLUE 任务之一）
dataset = load_dataset("super_glue", "copa")

# 查看数据
print(dataset["train"][0])

使用 `transformers` 微调 BERT

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# 加载预训练 BERT 模型
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 处理数据
def preprocess(examples):
    return tokenizer(examples["premise"], examples["choice1"], truncation=True, padding="max_length")

encoded_dataset = dataset.map(preprocess, batched=True)

# 训练参数
training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch")

trainer = Trainer(model=model, args=training_args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset["validation"])

# 开始训练
trainer.train()