目录
这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收货。如有不足,随时欢迎交流和探讨。
一、文献摘要介绍
In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Experiments on VQA datasets validate the effectiveness of QGHC.
作者在本文提出了一种新的用于视觉问答(VQA)的问题引导混合卷积(QGHC)网络。目前最先进的VQA方法大多融合了神经网络的高水平文本特征和视觉特征,在学习多模态特征时放弃了视觉空间信息。为了解决这些问题,从输入问题生成的以问题为指导的内核被设计为与视觉特征进行卷积,以便在早期捕获文本和视觉关系。以问题为导向的卷积可以将文本和视觉信息紧密耦合,但在学习内核时也可以引入更多参数。我们应用由与问题无关的内核和与问题相关的内核组成的组卷积来减小参数大小并缓解过度拟合。混合卷积可以以较少的参数产生有区别的多模态特征。所提出的方法也与现有的双线性池融合和基于注意力的VQA方法相补充。通过与它们的集成,我们的方法可以进一步提高性能。在VQA数据集上的实验验证了QGHC的有效性。
二、网络框架介绍
通常,采用卷积神经网络(CNN)来学习视觉特征,而循环神经网络(RNN)(例如长短期记忆(LSTM)或门控循环单元(GRU))对输入问题进行编码,即
其中和