用于视觉问答的关系感知图注意力网络模型《Relation-Aware Graph Attention Network for Visual Question Answering》

最新推荐文章于 2024-03-18 20:09:45 发布

VIP文章 Tiám青年

最新推荐文章于 2024-03-18 20:09:45 发布

阅读量3.4k

点赞数 5

分类专栏： VQA 计算机视觉

本文链接：https://blog.csdn.net/xiasli123/article/details/102937712

版权

本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址

一、文献摘要介绍

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

作者认为视觉问答模型需要完全理解图像中的视觉场景，尤其是不同对象之间的动态交互。为此提出了一种关系感知图注意力网络（ReGAT），该网络将每个图像编码成一个图，并通过图注意力机制对多种类型的对象间关系进行建模，以学习自适应问题的关系表示。实验表明，ReGAT优于现有的技术，并且可以用作通用的关系编码器以提高VQA的模型性能。

二、网络框架介绍

下图展示了作者提出的Relation-Aware Graph Attention Network(ReGAT)的整体流程，使用Faster R-CNN来检测一组目标区域,然后将这些区域级别的特征输入到不同的关系编码器中，以学习关系感知的问题自适应视觉特征，它将与问题表示法相结合来预测答案。

该模型同时考虑了显式关系（语义关系和空间关系）和隐式关系，提出的关系编码器，通过图注意力获取问题自适应对象之间的交互，得到的特征利用双线性多模态的融合方法进行融合，最后权衡显示关系和隐式关系的预测概率，以进行答案的预测。下面进行详细分析该框架。

2.1Graph Construction

1）Fully-connected Relation Graph

通过把图像中的每个物体 $Vi$ 看作一个顶点，我们可以构造一个完全连通的无向图，其中 $\large \varepsilon$ 是 $\large K×(K-1)$ 边的集合。每个边代表两个对象之间的隐式关系，它可以通过图的注意力分配给每个边的学习权重来反映。无需任何先验知识即可学习所有权重，我们将基于该图的关系编码器命名为隐式关系编码器。

2）Pruned Graph with Prior Knowledge

另一方面，如果顶点之间存在显式关系，则可以很容易地通过修剪不存在相应显示关系的边，将完全连通图转换为显式关系图。对于每对对象 $i$ , $j$ , 如果< $i$ - $p$ - $j$ >是有效的关系，则从 $i$ 到 $j$ 创建一条边，边标签为 $p$ 。此外，我们为每个对象节点 $i$ 指定一个自循环边，并将该边标记为与从 $i$ 到 $j$ 创建一条边标签相同。这样，图形变得稀疏，并且每个边对图像中的一个对象间(inter-object)关系的先验知识进行编码。我们将基于此图的关系编码器命名为显式关系编码器。

3）Spatial Graph

令 $\large spa\: _i,_j$ = $<object\, _i-predicate-object\, _j>$ 表示对象 $\large object\, _i$ 相对于对象 $\large object\, _j$ 的相对几何位置的空间关系。为了构造空间图，在给定两个对象区域建议

最低0.47元/天解锁文章

Tiám青年

关注

5
点赞
踩
17

收藏

觉得还不错? 一键收藏
5
评论
用于视觉问答的关系感知图注意力网络模型《Relation-Aware Graph Attention Network for Visual Question Answering》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址一、文献摘要介绍In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model n...
复制链接

扫一扫