用于视觉问答的图形推理网络模型《Graph Reasoning Networks for Visual Question Answering》

最新推荐文章于 2023-02-15 17:44:11 发布

Tiám青年

最新推荐文章于 2023-02-15 17:44:11 发布

阅读量2k

点赞数

分类专栏：计算机视觉 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/104115347

版权

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

The interaction between language and visual information has been emphasized in visual question answering (VQA) with the help of attention mechanism. However, the relationship between words in question has been underestimated, which makes it hard to answer questions that involve the relationship between multiple entities, such as comparison and counting. In this paper, we develop the graph reasoning networks to tackle this problem. Two kinds of graphs are investigated, namely inter-graph and intra-graph. The inter-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The intra-graph exchanges information between these output nodes from inter-graph to amplify implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus our resulting model can reason the relationship and dependence between objects, which leads to realization of multi-step reasoning. Experimental results on the GQA v1.1 dataset demonstrate the reasoning ability of our method to handle compositional questions about real-world images. We achieve state-of-the-art performance, boosting accuracy to 57.04%. On the VQA 2.0 dataset, we also receive a promising improvement on overall accuracy, especially on counting problem.

作者认为在视觉问答系统（VQA）中，语言和视觉信息之间的交互作用一直受到重视。然而，有关词语之间的关系被低估了，这使得人们很难回答涉及多个实体之间关系的问题，例如比较和计数。为了解决这一问题，本文开发了图形推理网络。研究了两类图，即图间图和图内图。图间将被检测对象的特征传递给相关的查询词，使得输出节点同时具有语义和事实信息。图内从图间交换这些输出节点之间的信息，以放大对象之间隐含但重要的关系。这两种图相互协作，从而我们得到的模型能够推理对象之间的关系和依赖关系，从而实现多步推理。在GQA v1.1数据集上的实验结果证明了我们的方法处理真实图像合成问题的推理能力。我们实现了最先进的性能，精度提高到57.04%。在VQA 2.0数据集上，我们也得到了一个有希望的整体精度改进，特别是在计数问题上。

二、网络框架介绍

VQA任务的目标是根据图像I回答给定的问题Q。使用对象检测器Faster-RCNN，我们将输入图像 $I$ 转换为对象特征，其中，其中 $n$ 是检测到的对象的数量， $D$ 是特征维度。问题是 $m$ 个单词的序列，可以使用LSTM将其编码为，其中，，下图1是网络模型框架。

引入BAN 可以同时减少两个输入通道，并获得问题特征 $Q$ 和图像特征 $V$ 的统一表示。它首先计算 $Q$ 和 $V$ 之间的双线性注意图，并在此条件下生成联合嵌入 $z$ ，如下所示：

注意图G定义为：

其中，是要学习的变量，

最低0.47元/天解锁文章

Tiám青年

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
3
评论
用于视觉问答的图形推理网络模型《Graph Reasoning Networks for Visual Question Answering》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。一、文献摘要介绍The interaction between language and visual information has been emphasized in visual question ans...
复制链接

扫一扫