Auto-Encoding Scene Graphs for Image Captioning 论文阅读笔记-CSDN博客

本文链接：https://blog.csdn.net/luo3300612/article/details/90042843

Auto-Encoding Scene Graphs for Image Captioning 论文阅读笔记

时间：2018年

Introduction

end-to-end encoder-decoder模型存在一个问题：当将一张包括未见过的场景输入到网络中时，返回的结果仅仅就是一些显著的object，比如“there is a dog on the floor”，这样的结果与object detection几乎没有区别

认知上的证据表明，基于视觉的语言并非是end-to-end的，而是与高层抽象的符号相关，如果我们将scene抽象成符号，生成过程就会十分清晰，比如对于这幅图片
在这里插入图片描述
scene abstraction是“helmet-on-human”和"road dirty"，我们则可以生成"a man with a helmet in contryside"通过使用一个常识：country road is dirty，这种推断就是inductive bias

本文将inductive bias融合到encoder-decoder中来进行image captioning，利用符号推理和端到端多模型特征映射互补，通过scene graph( $\mathcal{G}$ )来bridge它们，一个scene graph( $\mathcal{G}$ )是一个统一的表示，它连接了以下几个部分

objects(or entities)
their attributes
their relationships in an image( $\mathcal{I}$ ) or a sentence( $\mathcal{S}$ )，通过有向边表示

key insight:the vector representations are expected to transfer the inductive bias from the pure language domain to the vision-language domain

作者提出了Scene Graph Auto-Encoder(SGAE)，作为一个句子重建网络，其过程是 $\mathcal{S}\rightarrow\mathcal{G}\rightarrow\mathcal{D}\rightarrow\mathcal{S}$
其中 $\mathcal{D}$ 是一个可训练的字典，用来记录结点特征， $\mathcal{S}\rightarrow\mathcal{G}$ 使用现成的scene graph language parser[1]， $\mathcal{D}\rightarrow\mathcal{S}$ 是一个可训练的RNN decoder，注意 $\mathcal{D}$ 是"juice"——即language inductive bias，在训练SGAE中得到，通过将 $\mathcal{D}$ 共享给encoder-decoder的pipline： $\mathcal{I}\rightarrow\mathcal{G}\rightarrow\mathcal{D}\rightarrow\mathcal{S}$ ，即可利用语言先验来指导端到端模型，具体的 $\mathcal{I}\rightarrow\mathcal{G}$ 是一个visual scene graph detector[52]，引入multi-modal GCN来进行 $\mathcal{G}\rightarrow\mathcal{D}$ 的过程，来补足detection的不足之处，有趣的是， $\mathcal{D}$ 可以被视作为一个working memory，用来从 $\mathcal{I}\rightarrow\mathcal{S}$ re-key encoded nodes，以更小的domain gap来得到一个更一般的表达，

Contrubution

一个先进的SGAE模型，可以学习language inductive bias的特征表达
一个multi-model 图卷积网络，用来调节scene graph到视觉表达
一个基于SGAE的 encoder-decoder image captioner with a shared dictionary guiding the language decoding

Encoder-Decoder

给一张图片 $\mathcal{I}$ ，我们需要生成一句话 $\mathcal{S}=\{w_1,w_2,...,w_T\}$ ，state-of-the-art的image captioner是如下形式
在这里插入图片描述
通常，encoder是一个卷积神经网络，map是一个attention mechanism，将feature编码到一个更加informative的空间中，decoder是一个RNN-based 语言decoder，来预测 $\mathcal{S}$ ，给定label $\mathcal{S^*}$ 和 $I$ 通过最小化交叉熵函数
在这里插入图片描述
或者通过强化学习最大化

这是目前几乎所有state-of-the-art的image captioning模型的基本架构，但它有 dataset bias，为了解决这个问题，我们使用language inductive bias，可以表示为

随后我们将使用SGAE来学习 $\mathcal{D}$ ，通过sentence self-reconstruction with the help of scene graphs，然后我们将encoder-decoder equip上SGAE作为全局的image captioner，特别的是我们使用 $\mathcal{D}$ 和Multi-model 图卷积网络来re-encode 图像的 features

Auto-Encoding Scene Graphs

本节介绍如何通过self-reconstructing学习 $\mathcal{D}$ ，如图所示
在这里插入图片描述
SGAE的过程如下

Scene Graphs

$\mathcal{S}\rightarrow\mathcal{G}$ ，从sentence到scene graph，scene graph是一个元组 $\mathcal{G}=(\mathcal{N},\mathcal{\varepsilon})$ ，其中 $\mathcal{N}$ 和 $\mathcal{\varepsilon}$ 是边节点和边的集合，有三种 $\mathcal{N}$ :目标结点 $o$ ，属性结点 $a$ ，以及关系结点 $r$ ，记 $o_i$ 是第i个目标， $a_{i,l}$ 是 $o_i$ 的第 $l$ 个属性，每个结点以d-维向量表示，作者的实验中d=1000，结点的特征是可训练的label embedding

边 $\mathcal{\varepsilon}$ 的有以下几种

如果一个目标 $o_i$ 有属性 $a_{i,l}$ ，则 $o_i$ 到 $a_{i,l}$ 有一条有向边
如果存在三元组关系 $o_i-r_{ij}-o_j>$ ，则 $o_i$ 到 $r_{ij}$ 和 $r_{ij}到o_j$ 均有两条边

下图给了一个例子，其中包括七个结点六条边
在这里插入图片描述
使用[1]中的scene graph parser来得到 $\mathcal{G}$

graph convolution network

$\mathcal{G}\rightarrow\mathcal{X}$ ，将node embedding $e_o,e_a,e_r$ 转化成context-aware embedding $\mathcal{X}$ ， $\mathcal{X}$ 包括三种d维 embedding：关系embedding $x_{r_{i,j}}$ for 关系结点 $r_{i,j}$ ，目标embedding for 目标结点 $o_i$ ，以及属性结点 $x_{a_i}$ for目标结点 $o_i$ ，作者使用d=1000，使用四个空间图卷积（spatial graph convolutions）： $g_r,g_a,g_s,g_o$ 来生成上述的embedding，这四个网络有一样的结构，相互独立的参数
在这里插入图片描述
Relationship Embedding $x_{r_{i,j}}$
对每个三元组 $o_i-r_{ij}-o_j>$ ， $x_{r_{i,j}}$ 综合上下文信息

Attribute Embedding $x_{a_i}$
对一个object结点 $o_i$ ， $x_{a_i}$ 综合它和它的所有属性

其中 $N_{a_i}$ 是 $o_i$ 的属性个数

Object Embedding $x_{o_i}$
$x_{o_i}$ 需要综合 $o_i$ 在整个graph中的主客体关系
在这里插入图片描述
若 $o_j\in sbj(o_i)$ 表示 $o_i$ 是subject， $o_j$ 是object， $N_{r_i}=|sbj(i)|+|obj(i)|$

Dictionary

这一步学习 $\mathcal{D}$ 并用它re-encode $R(\mathcal{X};\mathcal{D})\rightarrow\hat{\mathcal{X}}$ 的方法，核心观点是保留working memory来执行dynamic knowledge base for run-time inference。 $\mathcal{D}$ 的目标是embed language inductive bias到语言合成中。

这个过程就是学习一个字典 $D={d_1,d_2,...,d_K}\in R^{d\times K}$ ，文章设K=10,000，re-encode:
在这里插入图片描述
其中

是memory network中的核心操作
使用[2]中的attention structure来reconstruction $\mathcal{S}$ ，

Overall Model: SGAE-based Encoder Decoder

在这里插入图片描述

Multi-modal Graph Convolution Network

通过multi-modal图卷积网络将visual feature $\mathcal{V}$ 转化成graph-modulated features $\mathcal{V'}$ ，此处的scene graph $\mathcal{G}$ 是由image scene graph parser得到的，它包括一个object proposal detector（Faster-RCNN），一个attribute classifier（一个小的fc-ReLU-fc-Softmax network）和一个relationship classifier（MOTIFS）。

将检测到的label embedding $e_{o_i}$ 和visual feature $v_{o_i}$ 融合在一起成为新的结点特征 $u_{o_i}$ ：
在这里插入图片描述
其余的embedding $u_{r_ij}$ 和 $u_{a_i}$ 按照类似的方法生成，image $\mathcal{G}$ 和 sentence $\mathcal{G}$ 的不同在于前者simpler，nosier，如图所示

生成 $\mathcal{G}$ 之后，计算embedding和re-encode的过程与处理sentence $\mathcal{G}$ 类似

结论

本文将 language inductive bias 融合到了image caption中，实现了more human-like的 language generation，主要方法是使用基于scene graph $\mathcal{G}$ 的feature，学习并共享一个字典 $\mathcal{D}$ 来re-encode这个feature。