[paper] DuReader

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Paper: https://arxiv.org/abs/1711.05073

Page: http://ai.baidu.com/broad/subordinate?dataset=dureader

Code: https://github.com/baidu/DuReader/

DuReader,一个新的大型开放中文机器阅读理解数据集。

DuReader 与以前的 MRC 数据集相比有三个优势:

  1. 数据来源:问题和文档均基于百度搜索和百度知道; 答案是手动生成的。

  2. 问题类型:它为更多的问题类型提供了丰富的注释,特别是是非类和观点类问题。

  3. 规模:包含 200K 个问题,420K 个答案和 1M 个文档; 是目前最大的中文 MRC 数据集。

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC.

DuReader has three advantages over previous MRC datasets:

  1. data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated.

  2. question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community.

  3. scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far.

Introduction

DatasetLangQue.DocsSource of Que.Source of DocsAnswer Type
CNN/DMEN1.4M300KSynthetic clozeNewsFill in entity
HLF-RCZH100K28KSynthetic clozeFairy/NewsFill in word
CBTEN688K108Synthetic clozeChildren’s booksMulti. choices
RACEEN870K50KEnglish examEnglish examMulti. choices
MCTestEN2K500Crowd-sourcedFictional storiesMulti. choices
NewsQAEN100K10KCrowd-sourcedCNNSpan of words
SQuADEN100K536Crowd-sourcedWiki.Span of words
SearchQAEN140K6.9MQA siteWeb doc.Span of words
TrivaQAEN40K660KTrivia websitesWiki./Web doc.Span/substring of words
NarrativeQAEN46K1.5KCrowd-sourcedBook&movieManual summary
MS-MARCOEN100K200KUser logsWeb doc.Manual summary
DuReaderZH200k1MUser logsWeb doc./CQAManual summary

表 1: 机器阅读理解数据集对比

Pilot Study

ExamplesFactOpinion
Entityiphone哪天发布2017最好看的十部电影
-On which day will iphone be releasedTop 10 movies of 2017
Description消防车为什么是红的丰田卡罗拉怎么样
-Why are firetrucks redHow is Toyota Carola
YesNo39.5度算高烧吗学围棋能开发智力吗
-Is 39.5 degree a high feverDoes learning to play go improve intelligence

表 2: 中文六类问题的例子

Scaling up from the Pilot to DuReader

Data Collection and Annotation

Data Collection

DuReader 的样本可用四元组表示: [Math Processing Error] { q , t , D , A } ,其中 [Math Processing Error] q 是问题,t 是问题类型,[Math Processing Error] D 是相关文档集合,A 是由人类标注产生的答案集合。

The DuReader is a sequence of 4-tuples: [Math Processing Error] { q , t , D , A } , where [Math Processing Error] q is a question, t is a question type, [Math Processing Error] D is a set of relevant documents, and A is an answer set produced by human annotators.

Question Type Annotation
Answer Annotation

众包

Crowd-sourcing

Quality Control
Training, Development and Test Sets
数量训练集开发集测试集
问题181K10K10K
文档855K45K46K
答案376K20K21K

The training, development and test sets consist of 181K, 10K and 10K questions, 855K, 45K and 46K documents, 376K, 20K and 21K answers, respectively.

DuReader is (Relatively) Challenging

challenges:

  1. The number of answers.

    图 1. 答案数量分布

  2. The edit distance.

    人类生成的答案和源文档之间的差异很大。

    the difference between the human generated answers and the source documents is large.

  3. The document length.

    问题平均 4.8 词,答案平均 69.6 词,文档平均 396 词。

    In DuReader, questions tend to be short (4.8 words on average) compared to answers (69.6 words), and answers tend to be short compared to documents (396 words on average).

Experiments

Baseline Systems

  1. 从每个文件中选择一个最相关的段落

  2. 在选定的段落中应用最先进的 MRC 模型

our designed systems have two steps:

  1. select one most related paragraph from each document

  2. apply the state-of-the-art MRC models on the selected paragraphs

Paragraph Selection

在训练阶段,我们从文档中选择与人类生成答案重叠最大的段落作为最相关段落。

In training stage, we select one paragraph from a document as the most relevant one, if the paragraph has the largest overlap with human generated answer.

在测试阶段,由于我们没有人类生成答案,我们选择与问题重叠最大的段落作为最相关段落。

In testing stage, since we have no human generated answer, we select the most relevant paragraph that has the largest overlap with the corresponding question.

Answer Span Selection

  • Match-LSTM

    要在段落中找到答案,它会按顺序遍历段落,并动态地将注意力加权问题表示与段落的每个标记进行匹配。

    最后,使用答案指针层来查找段落中的答案范围。

    To find an answer in a paragraph, it goes through the paragraph sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the paragraph.

    Finally, an answer pointer layer is used to find an answer span in the paragraph.

  • BiDAF

    它使用上下文对问题的关注和问题对上下文的关注,以突出问题和上下文中的重要部分。

    之后,使用注意流层来融合所有有用的信息,以获得每个位置的向量表示。

    It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.

    After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.

Results and Analysis

评价方法:BLEU-4, Rouge-L

We evaluate the reading comprehension task via character-level BLEU-4 and Rouge-L.

SystemsBaidu Search-Baidu Zhidao-All-
-BLEU-4%Rouge-L%BLEU-4%Rouge-L%BLEU-4%Rouge-L%
Selected Paragraph15.822.616.538.316.430.2
Match-LSTM23.131.242.548.031.939.2
BiDAF23.131.142.247.531.839.0
Human55.154.457.160.756.157.4

Table 6: Performance of typical MRC systems on the DuReader.

Question typeDescription-Entity-YesNo-
-BLEU-4%Rouge-L%BLEU-4%Rouge-L%BLEU-4%Rouge-L%
Match-LSTM32.840.029.538.55.97.2
BiDAF32.639.729.838.45.57.5
Human58.158.044.652.056.257.4

Table 8: Performance on various question types.

Opinion-aware Evaluation

Question typeFact-Opinion-
-BLEU-4%Rouge-L%BLEU-4%Rouge-L%
Opinion-unaware6.38.35.07.1
Opinion-aware12.013.98.08.9

Table 9: Performance of opinion-aware model on YesNo questions.

Discussion

Conclusion

提出了 DuReader 数据集,提供了几个 baseline。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值