用于视觉问答的问题-引导混合卷积模型《Question-Guided Hybrid Convolution for Visual Question Answering》

本文链接：https://blog.csdn.net/xiasli123/article/details/104147898

本文提出了一种名为QGHC的新网络，用于视觉问答任务。QGHC通过问题引导的卷积捕捉文本和视觉关系，同时解决了过度拟合问题。通过与双线性池化和注意力机制结合，提高了VQA性能。实验表明，QGHC在网络中表现出色。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Experiments on VQA datasets validate the effectiveness of QGHC.

作者在本文提出了一种新的用于视觉问答（VQA）的问题引导混合卷积（QGHC）网络。目前最先进的VQA方法大多融合了神经网络的高水平文本特征和视觉特征，在学习多模态特征时放弃了视觉空间信息。为了解决这些问题，从输入问题生成的以问题为指导的内核被设计为与视觉特征进行卷积，以便在早期捕获文本和视觉关系。以问题为导向的卷积可以将文本和视觉信息紧密耦合，但在学习内核时也可以引入更多参数。我们应用由与问题无关的内核和与问题相关的内核组成的组卷积来减小参数大小并缓解过度拟合。混合卷积可以以较少的参数产生有区别的多模态特征。所提出的方法也与现有的双线性池融合和基于注意力的VQA方法相补充。通过与它们的集成，我们的方法可以进一步提高性能。在VQA数据集上的实验验证了QGHC的有效性。