持续更新
MS MARCO
全称是 Microsoft MAchine Reading Comprehension Dataset
这是一系列数据集的合
microsoft/MSMARCO-Question-Answeringgithub.com问答数据集:就是问答的数据集、jsonl格式如下、注意其中有一部分是人生成的、而大部分是 span based的
{
"answers":["A corporation is a company or group of people authorized to act as a single entity and recognized as such in law."],
"passages":[
{
"is_selected":0,
"url":"http://www.wisegeek.com/what-is-a-corporation.htm",
"passage_text":"A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non-stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly."},
...
}],
"query":". what is a corporation?",
"query_id":1102432,
"query_type":"DESCRIPTION",
"wellFormedAnswers":"[]"
}
这个是排序的、主要得到与问题相关的自然段、可以认为是上一个QA的上游任务。
数据集有排序的、有三元组的:一个相关自然段一个不相关自然段二选一
SQuAD 2.0
由十万个问题增加了一些对抗的
openKP
microsoft/OpenKPgithub.com