【论文总结】开放域问答,纯文本召回与精排的另一种思路
- 前言
- SCIVER: Verifying Scientific Claims with Evidence
- Reading Wikipedia to Answer Open-Domain Questions (2017)
- Dense Passage Retrieval for Open-Domain Question Answering (2020)
- Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (2020)
- DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING (2021)
- 题外话
前言
这篇文章是年前欠下来的,当时在选择比赛的项目时,SDP @NAACL 的第二项任务SCIVER: Verifying Scientific Claims with Evidence,和开放域问答系统类型非常相似,因此调研了一些开放域问答的经典文献和SOTA方法,在这里做一个总结。通过这4篇论文,可以对开放域问答任务的研究方向有一个基本的了解,从统计特征到可训练特征再到无样本学习,从span抽取到直接生成。
涉及论文:
- Reading Wikipedia to Answer Open-Domain Questions
- Dense Passage Retrieval for Open-Domain Question Answering
- Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
- DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING
SCIVER: Verifying Scientific Claims with Evidence
任务介绍
Due to the rapid growth in scientific literature, it is difficult for scientists to stay up-to-date on the latest findings. This challenge is especially acute during pandemics due to the risk of making decisions based on outdated or incomplete information. There is a need for AI systems that can help scientists with information overload and support scientific fact checking and evidence synthesis.
在SCIVER共享任务中,我们将构建以下形式的系统:
- 以科学主张为输入
- 识别大型语料库中的所有相关摘要
- 将其标记为支持或驳回
- 选择句子作为标签的证据
关键步骤
- 识别大型语料库中的相关:考虑到效率和运行时间的要求,当我们得到一个query (此处为科学主张)后我们无法将大型语料库中成千上万的句子分别做一次分类识别,因此我们需要现根据query对语料库中相关的文档进行召回,尽可能减少我们的候选集。
- 选择句子作为标签的证据:对少量候选集中的文档中的句子进行精细分类,来判断其时候可用以支持或驳回主张。
数据样例
{
"id": 52,
"claim": "ALDH1 expression is associated with poorer prognosis for breast cancer primary tumors.",
"evidence": {
"11": [ // 2 evidence sets in document 11 support the claim.
{"sentences": [0, 1], // Sentences 0 and 1, taken together, support the claim.
"label": "SUPPORT"},
{"sentences": [11], // Sentence 11, on its own, supports the claim.
"label": "SUPPORT"}
],
"15": [ // A single evidence set in document 15 supports the claim.
{"sentences": [4],
"label": "SUPPORT"}
]
},
"cited_doc_ids": [11, 15]
}
开放域问答
上述任务其实是NLP中经典问题:开放域问答的变体,开放域问答以wikipedia知识回答为例,其思路分两个步骤,1. 首先从大量的文章里检索出相关的文章,2. 从相关的文章里定位到答案。
解决该类问题的关键是如何设计一个高效的召回系统,以及如何从召回少量候选集中精准锁定关键句子。
Reading Wikipedia to Answer Open-Domain Questions (2017)
作为开放域问答的经典论文,Danqi Chen提出的DrQA,搭建了开放域问答解决系统的基本框架:
(1) the Document Retriever module for finding relevant articles and
(2) a machine comprehension model, Document Reader, for extracting answers from a single document or a small collection of documents.
Document Retriever
- TF-IDF: 通过构建问题和文档的TF-IDF特征向量,计算两者的相似度来作为召回标准。为了提高召回的速度和内存效率,n-grams词组往往包含更多的词语顺序信息,作为召回特征也更加明确。因此作者通过Murmur Hash3将bigrams映射到 2^24 桶中, 以此提高检索效率。
- 具体实现上,每次召回5篇最相关的维基百科文档交给Document Reader处理。
- 此外,作者还比较了Okapi BM25、 word embeddings space + 余弦相似度等方式构建问题和文章召回特征的方式,结果表现更差。
- 召回实验结果:
Document Reader
- Paragraph encoding: p 1 , . . . , p m = R N N ( p ˜ 1 , . . . , p ˜ m ) {p1, . . . , pm} = RNN({p˜1, . . . , p˜m}) p1,...,pm=RNN(p˜1,...,p˜m),通过BiLSTM对召回文档的段落进行encoding,pi为作为BiLSTM的前后隐藏单元的concat输出。p˜i为构建的输入token的特征,具体如下:
- Word embeddings: 300-dimensional Glove word embeddings trained from 840B Web crawl data. 作者固定了词向量的embedding在模型训练时只fine-tune前1000高频的词向量,原因在于认为像what, how, which 这种高频的提问词可能对于QA systems来说更加关键。
- Exact match: 作者使用了01向量的embedding来标记段落中的哪些词是与问题完全匹配的,并在后续实验中证明了这个特征的有效性。
- Token features: 添加了 Token的part-of-speech (POS) and named entity recognition (NER) tags and its (normalized) term frequency (TF).等人工特征。
- Aligned question embedding: 在每个输入token中融合该token与question的attention特征,具体实现如下: f ( p i ) = ∑ j a i j 2 E ( q j ) f(pi) = \sum_{j}a_{ij}^2E(q_j) f(pi)=∑jaij2E(qj) 其中: a i j = e x p ( α ( E ( p i ) ) ⋅ α ( E ( q j ) ) ) ∑ j ′ e x p ( α ( E ( p i ) ) ⋅ α ( E ( q j ′ ) ) ) a_{ij} =\frac{exp (α(E(p_i)) · α(E(q_j )))} {\sum_{j'}exp (α(E(p_i)) · α(E(q_j' )))} aij=∑j′exp(α(E(pi))⋅α(E(qj′)))exp(α(E(pi))⋅α(E(qj))) , E ( q j ) E(q_j) E(qj)是question每个token的embedding, E ( p i ) E(p_i) E(pi)是段落中每个token的embedding,α(·) 是单个dense层接非线性激活函数Relu。与Exact match这一特征不同是,Aligned question embedding可以捕获到段落token与问题token中的近义词/词义关系,如(e.g., car and vehicle).
- Question encoding:question的encoding较为简单ÿ