Reading Wikipedia to Answer Open-Domain Questions
D. Q. Chen, A. Fisch, J. Weston, A. Bordes, Reading Wikipedia to Answer Open-Domain Questions, ACL (2017)
摘要
开放域问题回答(open domain question answering)
知识来源(knowledge source):维基百科、且唯一(unique)
任意事实性问题(factoid question)的答案:维基百科文章的文本张成(a text span in a Wikipedia article)。
大规模机器阅读(machine reading at scale):(1)文档检索(document retrieval),即相关文章查找(relevant articles);(2)机器阅读理解(machine comprehension of text),即根据文章内容识别答案。
本文从维基百科(Wikipedia)的文章段落中查找问题答案,包含两个模块:(1)基于bigram哈希(bigram hashing)和TF-IDF匹配(matching)的搜索(search)组件;(2)多层递归神经网络(a multi-layer recurrent neural network)。(combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs)
1 引言
本文以维基百科作为唯一知识源(unique knowledge source),回答开放域(an open-domain setting)事实性问题(factoid questions)
知识库(knowledge bases,KBs):易处理、但过于稀疏,不适合开放域问题(easier for computers to process but too sparsely populated for open-domain question answering)。
以以维基百科作为知识源的问答(question answering,QA)系统需要解决:(1)大规模开放域问答(large-scale
open-domain QA);(2)机器文本阅读(machine comprehension of text)。前者检索相关文档(retrieve the few relevant articles),后者从相关文档中标识答案(identify the answer)。本文将其称为大规模机器阅读(machine reading at scale,MRS)。
2 相关工作
开放域问答定义:从非结构化文档集合中查找答案(open-domain QA was originally defined as finding answers in collections of unstructured documents)。
知识库的局限性:不完整(incompleteness)、架构固定(fixed schemas)。
机器文本理解(machine comprehension of text):通过阅读短文、故事回答问题(answering questions after reading a short text or story)。
3 DrQA系统
DrQA系统组件:(1)文档检索(Document Retriever),查找相关文章;(2)文档阅读(Document Reader),机器理解(machine comprehension)模型,从单个文档或文档集中抽取答案(extracting answers from a single document or a small collection of documents)。
3.1 文档检索(Document Retriever)
非学习(non-machine learning)类文档检索(document retrieval system):比较文章与问题的二元模型计数(bigram counts);通过无符号murmur3哈希(an unsigned murmur3 hash)将二元模型映射到 2 24 2^{24} 224个区间上(bins);对每个问题返回5篇维百文章。
3.2 文档阅读(Document Reader)
给定 l l l个词条(token)的问题(question), q = { q 1 , … , q l } q = \{ q_{1}, \dots, q_{l} \} q={ q1,…,ql};及 m m m个词条的段落(paragraph) p = { p 1 , … , p m } p = \{ p_{1}, \dots, p_{m} \} p={ p1,…,pm},本文将各个段落依次输入RNN模型,并将预测结果汇总(aggregate the predicted answers)
段落编码(paragraph encoding)
将段落 p p p中的词条 p i p_{i} pi表示为特征向量 p ~ i ∈ R d \tilde{\mathbf{p}}_{i} \in \R^{d} p~i∈Rd的序列(a sequence of feature vectors),并输入递归神经网络(recurrent neural network):
{ p 1 , … , p m } = RNN ( { p ~ 1 , … , p ~ m } ) \{ \mathbf{p}_{1}, \dots, \mathbf{p}_{m} \} = \text{RNN}(\{ \tilde{\mathbf{p}}_{1}, \dots, \tilde{\mathbf{p}}_{m} \}) { p1,…,pm}=RNN({ p~1,…