【论文笔记】Fusion of Detected Objects in Text for Visual Question Answering

简介

In this paper, we consider visual context in addition to language and show that the right integration of visual and linguistic information can yield improvements in visual question answering.
.
The more general question we address in the context of this problem is how to encode visual and verbal information in a neural architecture.


本文的中心思想:视觉信息和语义信息的融合,可以提高视觉问答的能力。
.
从更广义的角度说,问题在于如何对视觉和文本信息进行编码。

.

问题定义如下:
在这里插入图片描述
A是问题的答案,R是答案的原因。(打勾的是正确的)

In this work we gather evidence to answer these questions by designing the Bounding Boxes in Text Transformer, B2T2 for short, a neural architecture for multimodal encoding of natural language and images, and we evaluate B2T2 on the Visual Commonsense Reasoning benchmark (VCR, Zellers et al. 2019).


  • 提出的模型:Bounding Boxes in Text Transformer (B2T2)
  • 数据集:Visual Commonsense Reasoning benchmark(VCR, Zellers et al. 2019)
    .

VCR数据集是人工标注的(质量比较高),内容是一些复杂的场景。

.

Are text and image best integrated late, allowing for independent analysis (late fusion), or should the processing of one be conditioned on the analysis of the other (early fusion)?
.
In our experiments, we found that early fusion of co-references between textual tokens and visual features of objects was the most critical factor in obtaining improvements on VCR.


融合的时间节点选择对结果的影响也是很重要的,本文的结论是提前融合视觉和文本信息是提升的关键。

.

We finally discovered that our models for VCR could be trained much more reliably when they were initialized from pretraining on Conceptual Captions (Sharma et al., 2018), a public dataset of about 3M images with captions.


先在Conceptual Captions上对模型进行预训练能提高模型的可靠性。

问题构建

In this work, we assume data comprised of 4-tuples (I, B, T, l) where :

  • I is an image,
  • B = [b1, . . . , bm] is a list of bounding boxes
    referring to regions of I, where each bi is identified by the lower left corner, height and width,
  • T = [t1, . . . , tn] is a passage of tokenized text, with the peculiarity that some of the tokens are not natural language, but explicit references to elements of B,
  • l is a binary label in {0, 1}.

  • 图片框的表示方法,三维(左下角坐标,长,宽)。
  • 文本的表示:有些词被角标引用替代了,角标指向图片框。
  • l 表示当前I B T是否匹配。

We assume an image representation function Φ that converts an image, perhaps after resizing and padding, to a fixed size vector representation of
dimension d.


Φ可以将图片转为固定大小d维的向量表示。(可能需要经过拉伸和填充)。

  • 在本文中,就是ResNet

We assume a context independent token representation E in the shape of a vector of dimension h for each token and a passage level representation Ψ which operates on E(T) and returns a passage level vector representation of dimension h.


  • E:把词转向量
  • Ψ:把句子转向量

.

The Q → A task is to choose A∗ given (I, O, Q, A). The QA → R task is to choose R∗ given (I, O, Q, A∗, R). Finally, the Q → AR task is a pipeline of the two, where a model must first correctly choose A∗ from A, then correctly choose R∗ given A∗.


  • 任务目标:给问题,找到答案和原因
  • 任务流程是:先找答案,再结合问题和所找到的答案找原因。

方法说明

We evaluate two main architectures: “Dual Encoder”, a late fusion architecture where image and text are encoded separately and answer scores are computed as an inner product, and the full B2T2 model, an early fusion architecture where visual features are embedded on the same level as input word tokens.


文章对两种方法进行了实验:

  • Dual Encoder:后融合方法,图片和文本分别编码,分数用内积计算得出。
  • B2T2:先融合方法,图片特征和文本token共同编码

Dual Encoder

在这里插入图片描述

在这里插入图片描述
where D is a learned matrix of size d × h. In this model, co-reference information is completely ignored, and the model must rely on fixed dimensional vectors for the late fusion of textual and visual contexts. However, we found this to be surprisingly competitive on VCR compared to published baselines, perhaps due to our choice of powerful pretrained models.


  • d是图片特征向量的维度
  • h是文本特征向量的维度
  • 在这种编码方式中,图片和文本的对应关系(共指)被完全无视了

B2T2

在这里插入图片描述

在这里插入图片描述
where al ∈ Rh and bl ∈ R for l ∈ {0, 1} are learned parameters. E0 (I, B, R, T) is a non-contextualized representation for each token and of its position in text, but also of the content and position of the bounding boxes.

在这里插入图片描述

Φ is a function to extract visual feature vectors of size d from an image,
π(bi) denotes the embedding of bi’s shape and position information in a vector of size d.


Φ是把图片矩阵转换为向量
π是把未知信息(坐标,长,宽)转换为向量

.

To embed the position and size of a bounding box b, we introduce two new learnable embedding matrices X and Y of dimension k × d/4. Let the coordinates of the opposite corners of b be (x1, y1) and (x2, y2), after normalizing so that a bounding box covering the entire image would have x1 = y1 = 0 and x2 = y2 = k. Position embeddings are thus defined to be

  • π(b) = concat(X[x1], Y[y1], X[x2], Y[y2] )

位置编码的编码方式

  • 图片框用对角坐标进行表示(一共四个)
  • 引入两个学习矩阵,分别对两个坐标进行线性映射

.

More formally, for a given example, let matrix R ∈ {0, 1}m×n encode the references between the bounding boxes in B and the tokens in T, so that Rij is 1 if and only if bounding box i is referenced by token j.


R是指图片与文本之间的对齐矩阵

Early and Late Fusion

The key difference from “Dual Encoder” is that text, image and bounding boxes are combined at the level of the non-contextualized token representations rather than right before the classification decision.


  • 后融合方法中,是整体和整体的融合,图片和文字在融合之前没有交集。
  • 早融合方法中,既有整体融合也有局部融合,在最终的整体融合之前,利用共指关系对局部进行融合。

损失

All of our models are trained with binary cross entropy loss using label l.
在这里插入图片描述


二元交叉熵损失函数,因为只有两种结果,正确和错误。

预训练 Conceptual Captions

在这里插入图片描述

We use two tasks for pretraining:

  • (1) impostor identification
  • (2) masked language model prediction.
    .
  • For the impostor task, we sample a random negative caption for each image and ask the model to predict whether the caption is correctly associated.
  • For mask-LM, we randomly replace tokens in the caption with the [MASK] token, and the model must predict the original token (see Devlin et al. (2018) for more details).

多任务联合学习:(是不是多模态任务一般都联合学习)
任务一的训练目标是判断图片与文字描述是否一致
任务二的训练目标是填空Mask
预训练过程中不考虑图片框的问题,都是整体性的文字和图片

实现细节

We use ResNet-1524 (He et al., 2016) pretrained on ImageNet for Φ, which yields a vector representation of size d = 2048.
.
BERT-Large (Devlin et al., 2018) provides both E and Ψ. The latter is a pretrained Transformer with 24 layers, 16 attention heads, and hidden size 1024. For BERT,E corresponds to its token embeddings, Ψ to the [CLS] token representation in the final layer, and so Ψ(E(T)) corresponds to the BERT passage representation of size h = 1024.


从结果来看,ResNet最好不要参与微调。

实验

模型最终效果

在这里插入图片描述

消融实验

在这里插入图片描述
只在Q→A的阶段进行了消融实验。

是否利用图片框

先融合和后融合

语言模型的大小

视觉模型的大小

是否预训练

是否使用图片框位置编码

错误分析

在这里插入图片描述
在这里插入图片描述

容易出错的问题:

  • 当图片框的部分不含有关键线索时
  • 对于人类的行为和表情的判断不准确(ResNet是在ImageNet上面训练的)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值