一、文章背景
文章题目《VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions》
二、文章导读
摘要部分:
Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations . We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.
现在VQA的大部分工作都能提高预测答案的精度,而却丢失了它的可解释性 。作者认为,对答案的解释可能比答案本身更重要,它能使问答的过程更好的理解和追踪。本文主作者提出了VQA-E ,能为每一个预测答案生成一个解释。作者首先提出了一个新的数据集VQA-E dataset,然后基于多任务学习模型来处理VQA。
该数据集VQA-E dataset来自于VQA v2,通过看图说话的方式来添加解释,并用用户学习的方式来验证合成的解释。结果表明,这些解释作为额外的监督信息,不仅能够处理深刻的文本句子来评判答案,还能够改善答案预测的结果。
三、文章详细介绍
VQA近些年来的主要进展在于使用了注意力机制和多模态融合以预测答案。尽管他们的性能在显著提高,然而人类在没有任何解释的情况下,并没有真正理解模型后面的决策机制。一种比较好的思路就是用attention,找到关注的attended regions,但是这样做也不能够清楚的对注意力视觉区域进行判断,比如尽管得到正确的regions但是却得到了错误的答案。因此这篇文章主要关注VQA模型的可解释性。
文本解释的另一个优点在于能够为答案提供更多的信息,如下图所示&#