【论文阅读】开放域问答论文总结，文本召回与问答的另一种思路

最新推荐文章于 2024-04-19 11:34:54 发布

置顶

是算法不是法术

最新推荐文章于 2024-04-19 11:34:54 发布

阅读量1.8k

点赞数 7

分类专栏： NLP 文章标签：自然语言处理神经网络数据挖掘

本文链接：https://blog.csdn.net/weixin_45839693/article/details/115420085

版权

本文介绍了开放域问答任务的研究，包括SCIVER任务的背景和关键步骤，以及4篇相关论文的总结。文章讨论了如何通过文档召回和阅读理解模型来解决问题，特别强调了DPR在文本召回中的优势和Fusion-in-Decoder在答案生成中的创新。此外，还介绍了知识蒸馏技术在训练无监督Retriever中的应用。

摘要由CSDN通过智能技术生成

【论文总结】开放域问答，纯文本召回与精排的另一种思路

前言
SCIVER: Verifying Scientific Claims with Evidence
Reading Wikipedia to Answer Open-Domain Questions (2017)
Dense Passage Retrieval for Open-Domain Question Answering (2020)
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (2020)
- Encoder
- Decoder
- result
- 个人总结
DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING (2021)
题外话

前言

这篇文章是年前欠下来的，当时在选择比赛的项目时，SDP @NAACL 的第二项任务SCIVER: Verifying Scientific Claims with Evidence，和开放域问答系统类型非常相似，因此调研了一些开放域问答的经典文献和SOTA方法，在这里做一个总结。通过这4篇论文，可以对开放域问答任务的研究方向有一个基本的了解，从统计特征到可训练特征再到无样本学习，从span抽取到直接生成。

涉及论文：

Reading Wikipedia to Answer Open-Domain Questions
Dense Passage Retrieval for Open-Domain Question Answering
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING

SCIVER: Verifying Scientific Claims with Evidence

任务介绍

Due to the rapid growth in scientific literature, it is difficult for scientists to stay up-to-date on the latest findings. This challenge is especially acute during pandemics due to the risk of making decisions based on outdated or incomplete information. There is a need for AI systems that can help scientists with information overload and support scientific fact checking and evidence synthesis.

在SCIVER共享任务中，我们将构建以下形式的系统：

以科学主张为输入
识别大型语料库中的所有相关摘要
将其标记为支持或驳回
选择句子作为标签的证据

关键步骤

识别大型语料库中的相关：考虑到效率和运行时间的要求，当我们得到一个query (此处为科学主张）后我们无法将大型语料库中成千上万的句子分别做一次分类识别，因此我们需要现根据query对语料库中相关的文档进行召回，尽可能减少我们的候选集。
选择句子作为标签的证据：对少量候选集中的文档中的句子进行精细分类，来判断其时候可用以支持或驳回主张。

数据样例

       {
         "id": 52,
         "claim": "ALDH1 expression is associated with poorer prognosis for breast cancer primary tumors.",
         "evidence": {
            "11": [                     // 2 evidence sets in document 11 support the claim.
               {"sentences": [0, 1],    // Sentences 0 and 1, taken together, support the claim.
                "label": "SUPPORT"},
               {"sentences": [11],      // Sentence 11, on its own, supports the claim.
                "label": "SUPPORT"}
            ],
            "15": [                     // A single evidence set in document 15 supports the claim.
               {"sentences": [4], 
                "label": "SUPPORT"}
            ]
         },
         "cited_doc_ids": [11, 15]
       }

开放域问答

上述任务其实是NLP中经典问题：开放域问答的变体，开放域问答以wikipedia知识回答为例，其思路分两个步骤，1. 首先从大量的文章里检索出相关的文章，2. 从相关的文章里定位到答案。
解决该类问题的关键是如何设计一个高效的召回系统，以及如何从召回少量候选集中精准锁定关键句子。

Reading Wikipedia to Answer Open-Domain Questions (2017)

作为开放域问答的经典论文，Danqi Chen提出的DrQA，搭建了开放域问答解决系统的基本框架：
(1) the Document Retriever module for finding relevant articles and
(2) a machine comprehension model, Document Reader, for extracting answers from a single document or a small collection of documents.

Document Retriever

TF-IDF: 通过构建问题和文档的TF-IDF特征向量，计算两者的相似度来作为召回标准。为了提高召回的速度和内存效率，n-grams词组往往包含更多的词语顺序信息，作为召回特征也更加明确。因此作者通过Murmur Hash3将bigrams映射到 2^24 桶中, 以此提高检索效率。
具体实现上，每次召回5篇最相关的维基百科文档交给Document Reader处理。
此外，作者还比较了Okapi BM25、 word embeddings space + 余弦相似度等方式构建问题和文章召回特征的方式，结果表现更差。
召回实验结果：

Document Reader

Paragraph encoding： ${p1, . . . , pm} = RNN({p˜1, . . . , p˜m})$ ,通过BiLSTM对召回文档的段落进行encoding，pi为作为BiLSTM的前后隐藏单元的concat输出。p˜i为构建的输入token的特征，具体如下：
1. Word embeddings: 300-dimensional Glove word embeddings trained from 840B Web crawl data. 作者固定了词向量的embedding在模型训练时只fine-tune前1000高频的词向量，原因在于认为像what, how, which 这种高频的提问词可能对于QA systems来说更加关键。
2. Exact match: 作者使用了01向量的embedding来标记段落中的哪些词是与问题完全匹配的，并在后续实验中证明了这个特征的有效性。
3. Token features: 添加了 Token的part-of-speech (POS) and named entity recognition (NER) tags and its (normalized) term frequency (TF).等人工特征。
4. Aligned question embedding: 在每个输入token中融合该token与question的attention特征，具体实现如下： $\sum_{j}a_{ij}^2E(q_j)$ 其中: $a_{ij} =\frac{exp (α(E(p_i)) · α(E(q_j )))} {\sum_{j'}exp (α(E(p_i)) · α(E(q_j' )))}$ ， $E(q_j)$ 是question每个token的embedding， $E(p_i)$ 是段落中每个token的embedding，α(·) 是单个dense层接非线性激活函数Relu。与Exact match这一特征不同是，Aligned question embedding可以捕获到段落token与问题token中的近义词/词义关系，如(e.g., car and vehicle).
Question encoding：question的encoding较为简单ÿ