用于视觉问答的防御网格特征模型《In Defense of Grid Features for Visual Question Answering》

这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收获。如有不足,随时欢迎交流和探讨。

一、文献摘要介绍

Popularized as ‘bottom-up’ attention [2], bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA and find they can work surprisingly well-running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion). Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.

基于边界框(或区域)的视觉特征已广为“自下而上”的关注[2],最近已经超过了基于vanilla 网格的卷积特征,成为视觉和语言任务(如视觉问题解答(VQA))的事实上的标准。但是,尚不清楚区域的优势(例如更好的本地化)是否是自下而上获得成功的关键原因。在本文中,我们重新审视了VQA的网格特征,发现它们可以令人惊讶地以良好的运行速度运行,并且精度相同(例如,如果以相似的方式进行预训练),可以更快地运行一个数量级。通过广泛的实验,我们验证了该观察结果在不同的VQA模型中均适用(在VQA 2.0 test-std上报告了最新的准确性,即72.71)数据集,并且可以很好地推广到其他模型图像字幕之类的任务。由于网格特征使模型的设计和训练过程变得更加简单,这使我们能够端到端对其进行训练,并使用更灵活的网络设计。我们从像素直接到答案,端到端地学习了VQA模型,并表明在不进行任何预训练的情况下,无需使用任何区域注释就可以实现强大的性能。我们希望我们的

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值