详细介绍请阅读:GLUE基准数据集介绍及下载 - 知乎
自然语言处理(NLP)主要自然语言理解(NLU)和自然语言生成(NLG)。为了让NLU任务发挥最大的作用,来自纽约大学、华盛顿大学等机构创建了一个多任务的自然语言理解基准和分析平台,也就是GLUE(General Language Understanding Evaluation)。
GLUE包含九项NLU任务,语言均为英语。GLUE九项任务涉及到自然语言推断、文本蕴含、情感分析、语义相似等多个任务。像BERT、XLNet、RoBERTa、ERINE、T5等知名模型都会在此基准上进行测试。目前,大家要把预测结果上传到官方的网站上,官方会给出测试的结果。
GLUE的论文为:GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[1]
GLUE的官网为:https://gluebenchmark.com/
本文的目的在于针对GLUE的九个任务分别做一个相对详细的说明,给出一些样例,有一个相对整体确切的感受,同时提供一个可以方便下载GLUE数据集的链接,供读者使用。
The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:
- CoLA (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a dataset containing sentences labeled grammatically correct or not.
- MNLI (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- MRPC (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- QNLI (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- QQP (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- RTE (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- SST-2 (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- STS-B (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)
数据下载地址: