文献阅读 - Reading Wikipedia to Answer Open-Domain Questions

最新推荐文章于 2020-05-05 11:00:03 发布

K5niper

最新推荐文章于 2020-05-05 11:00:03 发布

阅读量610

点赞数 1

分类专栏：文献阅读

本文链接：https://blog.csdn.net/zhaoyin214/article/details/103231406

版权

DrQA是一款基于维基百科的开放域问答系统，它包括文档检索和文档阅读两个部分。检索模块利用bigram哈希和TF-IDF匹配找到相关文章，阅读模块则运用多层递归神经网络在段落中识别答案。

摘要由CSDN通过智能技术生成

Reading Wikipedia to Answer Open-Domain Questions

D. Q. Chen, A. Fisch, J. Weston, A. Bordes, Reading Wikipedia to Answer Open-Domain Questions, ACL (2017)

摘要

开放域问题回答（open domain question answering）

知识来源（knowledge source）：维基百科、且唯一（unique）

任意事实性问题（factoid question）的答案：维基百科文章的文本张成（a text span in a Wikipedia article）。

大规模机器阅读（machine reading at scale）：（1）文档检索（document retrieval），即相关文章查找（relevant articles）；（2）机器阅读理解（machine comprehension of text），即根据文章内容识别答案。

本文从维基百科（Wikipedia）的文章段落中查找问题答案，包含两个模块：（1）基于bigram哈希（bigram hashing）和TF-IDF匹配（matching）的搜索（search）组件；（2）多层递归神经网络（a multi-layer recurrent neural network）。（combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs）

1 引言

本文以维基百科作为唯一知识源（unique knowledge source），回答开放域（an open-domain setting）事实性问题（factoid questions）

知识库（knowledge bases，KBs）：易处理、但过于稀疏，不适合开放域问题（easier for computers to process but too sparsely populated for open-domain question answering）。

以以维基百科作为知识源的问答（question answering，QA）系统需要解决：（1）大规模开放域问答（large-scale
open-domain QA）；（2）机器文本阅读（machine comprehension of text）。前者检索相关文档（retrieve the few relevant articles），后者从相关文档中标识答案（identify the answer）。本文将其称为大规模机器阅读（machine reading at scale，MRS）。

在这里插入图片描述

2 相关工作

开放域问答定义：从非结构化文档集合中查找答案（open-domain QA was originally defined as finding answers in collections of unstructured documents）。

知识库的局限性：不完整（incompleteness）、架构固定（fixed schemas）。

机器文本理解（machine comprehension of text）：通过阅读短文、故事回答问题（answering questions after reading a short text or story）。

3 DrQA系统

DrQA系统组件：（1）文档检索（Document Retriever），查找相关文章；（2）文档阅读（Document Reader），机器理解（machine comprehension）模型，从单个文档或文档集中抽取答案（extracting answers from a single document or a small collection of documents）。

3.1 文档检索（Document Retriever）

非学习（non-machine learning）类文档检索（document retrieval system）：比较文章与问题的二元模型计数（bigram counts）；通过无符号murmur3哈希（an unsigned murmur3 hash）将二元模型映射到 $2^{24}$ 个区间上（bins）；对每个问题返回5篇维百文章。

3.2 文档阅读（Document Reader）

给定 $l$ 个词条（token）的问题（question）， $\{ q_{1}, \dots, q_{l} \}$ ；及 $m$ 个词条的段落（paragraph） $\{ p_{1}, \dots, p_{m} \}$ ，本文将各个段落依次输入RNN模型，并将预测结果汇总（aggregate the predicted answers）