Analysis on MS MARCO leaderboard

My github project: https://github.com/IndexFziQ/MSMARCO-MRC-Analysis. Welcome to star my work!
Keep updating …

Analysis on MS MARCO leaderboard

Analysis on the MS-MARCO leaderboard, including V1 and V2, regarding the machine reading comprehension task.

Contributed by Yuqiang Xie and Luxi Xing, National Engineering Laboratory for Information Security Technologies, IIE, CAS.

Introduction

MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. The current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. For more details, see the project: MS MARCO V2.

MS MARCO is a benchmark dataset for multi-passage MRC or so-called “generative MRC”. Besides, different from other MRC datasets, MS MARCO has the following advantages (reference MS MARCO V2):

  1. Real questions: All questions have been sample from real anonymized bing queries.
  2. Real Documents: Most Url’s that we have source the passages from contain the full web documents (abstract). These can be used as extra contextual information to improve systems or be used to compete in our expert task.
  3. Human Generated Answers: All questions have an answer written by a human (not just from a passage span). If there was no answer in the passages the judge read they have written ‘No Answer Present.’
  4. Human Generated Well-Formed: Some questions contain extra human evaluation to create well formed answers that could be used by intelligent agents like Cortana, Siri, Google Assistant, and Alexa.
  5. Dataset Size: At over 1 million queries the dataset is large enough to train the most complex systems and also sample the data for specific applications.

In this work, submissions to MS MARCO will be almost listed. This work will be updated persistently.

Tips: The official evaluation metrics include ROUGE-L and BLEU-1.

V1 Leaderboard - Model List

Task Definition: Given a question and a set of passages (top 10) retrieved
by search engines, the task is to find the best concise answer to the question.

RankModelOrg.Rouge-LBleu-1
1MARSYUANFUDAO research NLP49.7248.02
2Human Performance-47.0046.00
3V-NetBaidu NLP46.1544.46
4S-NetMicrosoft AI and Research45.2343.78
5R-NetMicrosoft AI and Research42.8942.22
6ReasoNetMicrosoft AI and Research38.8139.86
7PredictionSingapore Management University37.6733.93
8FastQA_ExtDFKI German Research Center for AI33.6733.93

V2 Leaderboard - Model List

For this version, we focus on Q&A + Natural Language Generation, where human performance is 63.21 on Rouge-L.

Task Definition: Given a query and 10 passages provide the best answer available in natural language that could be used by a smart device/digital assistant.

RankModelOrg.Rouge-LBleu-1
1Human Performance-63.2153.03
2Masque NLGEN StyleNTT Media Intelligence Laboratories49.6150.13
3V-NetBaidu NLP48.3746.75
4BERT+ Multi-Pointer-GeneratorTongjun Li of the ColorfulClouds Tech and BUPT47.3745.09
5SNET + CES2SSYSU45.0440.62
6Reader-WriterMicrosoft Business Applications Group AI Research43.8942.59
7ConZNetSamsung Research42.1438.62

Description of Models

MARS

Multi-Attention ReaderS Network. Jingming Liu. [ video ]

  • Motivation
    • Transfer learning tasks like CoVe and ELMo store more generalized information in the lower layer (encoder).
    • The answers from each site are quite different and have their own features.
    • Multi-task learning may bring an improvement for the MRC task.
    • The score for multi-passage needs more factors, not just the simple counting.
    • Through training, the train set is large and the batch is small, which leads to the distribution between real samples and each batch inconsistent. The output of model is related to the order of the batch.
  • Contribution
    • Enhance Glove with CoVe, ELMo, POS, NER and word-match features, word dropout is 5%.
      [ nearly 2.0 improvement ]
    • Embed each site to a representation and combine it with the passage representation in the prediction layer.
      [ nearly 0.3-0.5 improvement ]
    • Apply multi-task learning into MRC.
      • Main task:
        • Golden span (the highest rouge(>0.8) span in every passage).
      • Auxiliary tasks: [ weight: 0.1~0.2 ]
        • If a word is in the answer;
        • If a passage contains a golden span;
        • If a sentence contains a golden span.
    • The score of answer contains span score and vote score.
    • Introduce EMA (Exponential Moving Average) into the training for weakening the effect of batch order.
      [ nearly 0.3 improvement ]
  • Overview

在这里插入图片描述

V-Net

Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification. Yizhong Wang, Kai Liu, Jing Liu, Wei He, Yajuan Lyu, Hua Wu, Sujian Li and Haifeng Wang. ACL 2018. [ pdf ]

  • Motivation

    • The correct answers could occur more frequently in those passages and usually share some commonalities, while incorrect answers are usually different from one another.
  • Contribution

    • Leverage the answer candidates from different passages to verify the final correct answer and rule out the noisy incorrect answers.
    • Jointly trained in an end-to-end framework.
  • Overview

    在这里插入图片描述

S-Net

S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension. JChuanqi Tan, Furu Wei, Nan Yang, Bowen Du, Weifeng Lv and Ming Zhou. AAAI 2018. [ pdf ]

  • Motivation

    • The extraction based approach is not suitable for MS MARCO.
    • In some examples, the answers need to be synthesized or generated from the question and passage.
  • Contribution

    • Present an extraction-then-synthesis framework for generated MRC
      • Extracts evidence snippets by matching the question and passage.
      • Generates the answer by synthesizing the question, passage, and evidence snippets.
    • Utilize multi-task learning to help pure answer span prediction.
    • First apply seq2seq model to generate answer with extracted evidence, question and passages.
  • Overview

    • The whole extraction-then-synthesis framework:

      在这里插入图片描述

    • The Evidence Extraction Model:

      在这里插入图片描述

    • The Answer Synthesis Model:

      在这里插入图片描述

R-Net

Gated Self-Matching Networks for Reading Comprehension and Question Answering. Wenhui Wang and Nan Yang and Furu Wei and Baobao Chang and Ming Zhou. ACL 2017. [ pdf , new ]

ReasoNet

ReasoNet: Learning to Stop Reading in Machine Comprehension. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. KDD 2017. [ pdf ]

Prediction

Machine Comprehension Using Match-LSTM and Answer Pointer. Shuohang Wang, and Jing Jiang. ICLR 2017(under review). [ pdf ]

FastQA_Ext

Making Neural QA as Simple as Possible but not Simpler. Dirk Weissenborn and Georg Wiese and Laura Seiffe. CoNLL 2017. [ pdf ]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

IndexFziQ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值