【LLM 评估】GLUE benchmark：NLU 的多任务 benchmark

最新推荐文章于 2025-03-09 12:43:17 发布

yubinCloud

最新推荐文章于 2025-03-09 12:43:17 发布

阅读量1.7k

点赞数 7

分类专栏： LLM Research 文章标签：自然语言处理语言模型深度学习人工智能 LLM 评估

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_45668004/article/details/140070052

版权

论文：GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

⭐⭐⭐⭐

arXiv:1804.07461, ICLR 2019

Site: https://gluebenchmark.com/

文章目录

一、论文速读

GLUE benchmark 包含 9 个 NLU 任务来评估 NLP 模型的语义理解能力。这些任务均为 sentence or sentence-pair NLU tasks，语言均为英语。

二、GLUE 任务列表

下图是各个任务的一个统计：

在这里插入图片描述

2.1 CoLA（Corpus of Linguistic Acceptability）

单句子分类任务。每个 sentence 被标注为是否合乎语法的单词序列，是一个二分类任务。

样本个数：训练集 8551 个，开发集 1043 个，测试集 1063 个。

label = 1（合乎语法）的 examples：

She is proud.

she is the mother.

Will John not go to school?

label = 0（不合乎语法）的 examples：

Mary wonders for Bill to come.

Yes, she used.

Mary sent.

注意到，这里面的句子看起来不是很长，有些错误是性别不符，有些是缺词、少词，有些是加s不加s的情况，各种语法错误。但我也注意到，有一些看起来错误并没有那么严重，甚至在某些情况还是可以说的通的。

2.2 SST-2（The Stanford Sentiment Treebank）

单句子分类任务：给定一个 sentence（电影评论中的句子），预测其情感是 positive 还是 negative，是一个二分类任务。

样本个数：训练集 67350 个，开发集 873 个，测试集 1821 个。

label = 1（positive）的 examples：

two central performances

against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape

a better movie

label = 0（negative）的 exampl

最低0.47元/天解锁文章

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。