用于视觉问答的多模态关系推理的模型《Multimodal Relational Reasoning for Visual Question Answering》

最新推荐文章于 2024-06-03 10:34:34 发布

Tiám青年

最新推荐文章于 2024-06-03 10:34:34 发布

阅读量1.5k

点赞数

分类专栏：计算机视觉深度学习 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/102794884

版权

一、文献摘要介绍

    Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks.
    In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps.
    We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 andTDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context.

论文作者认为多模态注意力网络是目前最先进的涉及真实图像的视觉问答（vqa）任务模型。尽管注意力可以集中在与问题相关的可视化内容上，但这种简单的机制显然不足以为vqa或其他高级任务所需的复杂推理功能建模。针对这个问题，于是作者引入了murel单元，这是一个原子推理原语，通过丰富的向量表示来表示问题和图像区域之间的交互，并使用成对组合来建模区域关系。其次，将murel单元整合到一个完整的murel网络中，该网络逐步完善视觉和问题交互，并可用于定义比仅仅注意地图更精细的可视化方案，实验表明，该方案比最先进的结果更具竞争力或更好。

二、网络框架介绍

该论文采用了向量化表示方法代替了传统的注意力框架，对每个区域的视觉内容和问题进行双线性融合，然后进行成对关系建模。此外，还在表示中加入了空间和语义环境的概念，即通过视觉嵌入和空间坐标的交互来表示成对的图像区域，整体架构如下图所示。
在这里插入图片描述
下面对该框架进行分析。

2.1 MuRel approach

在这里插入图片描述

其中，Pθ是我们可训练的模型。在我们的系统中，图像由一组向量{vi} i∈[1，N]表示，其中每个vi对应于图像中检测到的目标。我们还使用每个区域的空间坐标bi=[x，y，w，h]，其中(x，y)是边界框左上角的坐标，h和w对应于边界框的高度和宽度。而x和w(各自的y和h)是规范化的。对于问题，我们使用一个门控循环单元来提供一个语句嵌入q。

2.2 MuRel cell

在本论文方案设计中，MuRel network是由推理单元MuRel Cell迭代实现的，下图是MuRel cell。
在这里插入图片描述
MuRel cell首先以N个可视特征作为输入，这些特征都带着坐标bi。它有两个模块组成，第一个是双线性混合模型(Bilinear Fusion)，将每个图像区域特征（由目标检测网络得到）都分别与问题文本特征融合得到多模态embedding,第二个是成对关系建模(Pairwise Relational Modeling)对这些embedding进行成对的关系建模。另外，注意到这里面还有一个残差的设计，作者解释这是为了避免梯度消失问题,下面分别讲解这两个模块，代码如下。

class MuRelCell(nn.Module):

    def __init__(self,
                 residual=False,  # 定义是否使用残差
                 fusion={
   },  # 定义融合
                 pairwise={
   }):  # 定义成对建模
        super(MuRelCell, self).__init__()
        self.residual = residual
        self.fusion = fusion
        self.pairwise = pairwise
        #
        self.fusion_module = block.factory_fusion(self.fusion)  # 用工厂模式建立融合
        if self.pairwise:
            self.pairwise_module = Pairwise(**pairwise)  # 成对建模

    def forward(self, q_expand, mm, coords=None):
        mm_new = self.process_fusion(q_expand, mm)

        if self.pairwise:
            mm_new = self.pairwise_module(mm_new, coords)

        if self.residual:
            mm_new = mm_new + mm
        return mm_new

    def process_fusion(self, q, mm):  # 融合
        bsize = mm.shape[0]
        n_regions = mm.shape[1]
        mm = mm.contiguous().view(bsize * n_regions, -1)
        mm = self.fusion_module([q, mm])
        mm = mm.view(bsize, n_regions, -1)
        return mm

2.3Multimodal fusion：

Bilinear Fusion主要采用了另一篇文献的成果MUTAN模型。其基本思想是用双线性（bilinear）模型对两种模态数据的复杂关系进行编码，表达式为：在这里插入图片描述

以上的公式可以推导为：

分解之后的示意图如下:

使用相同的双线性融合将每个输入向量si与问题嵌入q融合，表达式如下：

每个mi都与一个图像区域对应，B(.)即表示上述的双线性模型,其中Θ是融合模块的可训练参数。

2.4Pairwise Relational Modeling

Pairwise Relational Modeling主要的目的是对各个图像区域的信息进行成对的关系建模。
这里首先计算图像区域两两之间的关联（pairwise links）：
在这里插入图片描述

代码如下：


class Pairwise(nn.Module):

    def __init__(self,
            residual=True,
            fusion_coord={
   },
            fusion_feat={
   },
            agg={
   }):
        super(Pairwise

最低0.47元/天解锁文章

Tiám青年

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
用于视觉问答的多模态关系推理的模型《Multimodal Relational Reasoning for Visual Question Answering》

目录文献摘要介绍网络框架介绍实验分析结论本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流附: 本文的代码地址一、文献摘要介绍    Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering ...
复制链接

扫一扫