在对话式段落检索中实现Query Resolution and Reading Comprehension
我们的passage retrieval pipeline 如上图 Figure 1 : Given the original current turn query Q and the conversation history H, 首先perform query resolution, 得到扩展后的查询:resolved query [8]. Next, we perform initial retrieval using
to get a list of top-k passages P. Finally, for each passage in P, we combine the scores of a reranking module(BERT) and a reading comprehension module to obtain the final ranked list R. The interpolation weight w(应该是比例权重) is optimized on the TREC CAsT 2019 dataset [1].
query resolution :QuReTeC, a binary term classification query resolution model, which
uses BERT to classify each term in the conversation history as relevant or not,and adds the relevant terms to the original current turn query
Re-ranking (BERT). 对每一个passage[4]应用bert model. 模型用 bert-large初始化,然后在 MS MARCO passage retrieval dataset[6] 上微调。
Reading Comprehension. The model is a RoBERTa-Large model ,模型在passage中predict a text span 或者是 “No Answer”It is fine-tuned on the MRQA dataset [3].
我们用the sum of the predicted start and end span logits: (lstart+ lend).作为这个模型的分数。
Quantitative analysis
在我们的 pipeline中, passage retrieval 性能依赖于query resolution module. 所以我们需要分别评估这两个模块。具体来说,我们使用Original queries、the QuReTeC-resolved queries、Human rewritten queries来比较passage retrieval performance。
错误分类: (i)ranking error, (ii) query resolution error and (iii) no error. 为了简化分析, we first choose a ranking metric m (e.g., NDCG@3) and a threshold t.如果使用人工重写的查询得到不好的ranking performance (m <= t) 就属于(i)ranking error。 如果Human rewritten query has performance m > t, but for which the QuReTeC resolved query has performance m <= t就是(ii) query resolution error 。
表2显示了分析结果:因为我们假设人工重写总是被很好地指定的,13.5%的人工查询重写错误被归为ranking error。61.0%被QuReTeC正确解决,25.5%没有。这说明会话语篇检索的查询分辨率有较大的提高空间。此外,我们观察到(0 + 1 + 2 + 39)/208≈20%的查询在数据集中不需要解析,即当使用这些查询时,我们可以在在top-3检索到至少一个相关的段落。
表三中右边的每一列都一共包含208个查询,分别表示每一个阈值下的分组情况。可以看出随着性能阈值的增加,排名错误的数量也在增加,这说明通道排名模块还有很大的改进空间。
结论:
绿色是查询解析模块优化的区间,橙色是查询解析模块有待提高的区间,蓝色是检索模块有待提高的区间。
References
1. Dalton, J., Xiong, C., Kumar, V., Callan, J.: Cast-19: A dataset for conversational
information seeking. In: SIGIR (2020)
2. Elgohary, A., Peskov, D., Boyd-Graber, J.: Can you unpack that? learning to rewrite
questions-in-context. In: EMNLP (2019)
3. Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., Chen, D.: MRQA 2019 shared
task: Evaluating generalization in reading comprehension. In: MRQA (2019)
4. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint
arXiv:1901.04085 (2019)
5. Vakulenko, S., Longpre, S., Tu, Z., Anantha, R.: A wrong answer or a wrong ques-
tion? an intricate relationship between question reformulation and answer selection
in conversational question answering. In: SCAI (2020)
6. Vakulenko, S., Longpre , S., Tu, Z., Anantha, R.: Question rewriting for conversa-
tional question answering. In: WSDM (2021)
7. Vakulenko, S., Voskarides, N., Tu, Z., Longpre, S.: A comparison of question rewrit-
ing methods for conversational passage retrieval. In: ECIR (2021)
8. Voskarides, N., Li, D., Ren, P., Kanoulas, E., de Rijke, M.: Query resolution for
conversational search with limited supervision. In: SIGIR (2020)