目录
这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收货。如有不足,随时欢迎交流和探讨。
一、文献摘要介绍
In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.
在本文中,作者旨在提高对视觉问题解答(VQA)任务的关注。提供监督以引起注意是一项挑战。作者所做的观察是,通过类激活映射(特别是Grad-CAM)获得的视觉解释(旨在解释各种网络的性能)可以形成一种监管手段。但是,由于注意力图的分布和Grad-CAM的分布不同,因此不适合将其直接用作监督形式。相反,作者建议使用区分器,以区分视觉解释和注意图的样本。使用注意力区域的对抗训练作为注意力和解释之间的两人游戏,可以使注意力图和视觉解释的分布更加接近。重要的是,我们观察到,提供这种监管手段还可以使注意力图与人的注意力更加紧密相关,从而大大改善了基线堆叠注意力网络(SAN)模型。这也导致VQA任务的等级相关度量得到了很好的改善。该方法也可以与最近基于MCB的方法结合使用,从而获得一致的改进。作者还提供了与其他学习分布方式的比较,例如基于相关比对(Coral),最大平均差异(MMD)和均方误差(MSE)损失,并观察到对抗损失优于学习注意图的其他形式。结果的可视化也证实了我们的假设,即使用这种形式的监督可以改善注意力图。
二、网络框架介绍
作者解决视觉问题解答(VQA)的方法的主要重点是利用从视觉解释方法(例如Grad-