用于视觉问答的相互注意融合模型《Reciprocal Attention Fusion for Visual Question Answering》

最新推荐文章于 2025-03-16 20:49:07 发布

Tiám青年

最新推荐文章于 2025-03-16 20:49:07 发布

阅读量1.2k

点赞数

分类专栏：计算机视觉 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/104152324

版权

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

Existing attention mechanisms either attend to local image-grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and grid-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9% to 68.2% on VQAv1 and from 65.7% to 67.4% on VQAv2, demonstrating a significant boost.

作者认为现有的视觉问答系统（VQA）的注意力机制要么涉及局部图像网格，要么涉及对象级特征。通过观察发现，问题可以与对象实例及其部分相关，作者提出了一种新颖的注意力机制，该机制共同考虑了两个视觉细节级别之间的相互关系。这样产生的自下而上的注意力将与自上而下的信息进一步结合，以仅关注与给定问题最相关的场景元素。我们的设计通过有效的张量分解方案在层次上融合了多模态信息，即语言，对象和网格级别的特征。提出的模型将最新的单模型性能从VQAv1的67.9％提高到68.2％，将VQAv2的性能从65.7％提高到67.4％，显示出明显的提升，图1展示将注意力应用于相互的视觉特征，允许VQA模型获得回答给定视觉问题所需的最相关信息。