Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.
Comments: 11 pages, 7 figures, 3 tables in 2016 Conference on Neural Information Processing Systems (NIPS)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as: arXiv:1606.00061 [cs.CV]
  (or arXiv:1606.00061v5 [cs.CV] for this version)

Submission history

From: Jiasen Lu [ view email
[v1] Tue, 31 May 2016 22:02:01 GMT (4284kb,D)
[v2] Thu, 2 Jun 2016 01:51:13 GMT (3549kb,D)
[v3] Wed, 26 Oct 2016 02:15:57 GMT (3669kb,D)
[v4] Fri, 13 Jan 2017 16:18:03 GMT (3669kb,D)
[v5] Thu, 19 Jan 2017 05:03:33 GMT (3669kb,D)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值