这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收获。如有不足,随时欢迎交流和探讨。
一、文献摘要介绍
Popularized as ‘bottom-up’ attention [2], bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA and find they can work surprisingly well-running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion). Through extensive experiments, we verify that this observation holds true across different VQA models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71), datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.
基于边界框(或区域)的视觉特征已广为“自下而上”的关注[2],最近已经超过了基于vanilla 网格的卷积特征,成为视觉和语言任务(如视觉问题解答(VQA))的事实上的标准。但是,尚不清楚区域的优势(例如更好的本地化)是否是自下而上获得成功的关键原因。在本文中,我们重新审视了VQA的网格特征,发现它们可以令人惊讶地以良好的运行速度运行,并且精度相同(例如,如果以相似的方式进行预训练),可以更快地运行一个数量级。通过广泛的实验,我们验证了该观察结果在不同的VQA模型中均适用(在VQA 2.0 test-std上报告了最新的准确性,即72.71)数据集,并且可以很好地推广到其他模型图像字幕之类的任务。由于网格特征使模型的设计和训练过程变得更加简单,这使我们能够端到端对其进行训练,并使用更灵活的网络设计。我们从像素直接到答案,端到端地学习了VQA模型,并表明在不进行任何预训练的情况下,无需使用任何区域注释就可以实现强大的性能。我们希望我们的