《Entangled Transformer for Image Captioning》

最新推荐文章于 2022-05-13 21:04:29 发布

Tiám青年

最新推荐文章于 2022-05-13 21:04:29 发布

阅读量2.4k

点赞数 1

分类专栏：计算机视觉

本文链接：https://blog.csdn.net/xiasli123/article/details/103152484

版权

本文提出ETA-Transformer模型，通过Entangled Attention桥接视觉和语义的语义鸿沟，结合Gated Bilateral Controller指导多模态交互，实现图像描述的性能提升，达到MSCOCO数据集的最新水平。

摘要由CSDN通过智能技术生成

本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址

一、文献摘要介绍

In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules.

作者认为，在图像字幕中，典型的注意机制很难识别出等效的视觉信号，尤其是在预测高度抽象的单词时。这种现象被称为视觉和语义之间的鸿沟。这个问题可以通过提供与语言相对应的语义属性来解决。由于其固有的递归性质和门控操作机制，循环神经网络（RNN）及其变体是图像描述中的主要架构。但是，当精心设计注意力机制以集成视觉输入和语义属性时，类似的RNN变体由于其复杂性而变得不灵活。在本文中，我们研究了仅基于关注层和前馈层构建的基于Transformer 的序列建模框架，为了弥补语义上的鸿沟，我们引入了Entangled Attention(ETA)，使Transformer能够同时利用语义和视觉信息。此外，作者提出了门控双向控制器（GBC）来指导多模态信息之间的交互。实验表明，在数据集MSCOCO上达到了最新的性能。

二、网络框架介绍

作者提出的模型（如下图所示）包含三个部分：视觉子编码器(visual sub-encoder)，语义子编码器(semantic sub-encoder)和多模式解码器(multimodal decoder)。生成过程分为三个步骤：（1）检测区域推荐和语义属性；（2）分别对视觉和语义特征进行编码；（3）逐字解码以获得最终图像描述。请注意，省略了残差连接，层规范化和嵌入层。

本论文采用了 Attention Is All You Need 里面的注意力机制，并在此基础上扩展了自己的框架，下面进行详细分析该框架。

2.1Dual-Way Encoder

在大多数情况下，首先考虑使用CNN(例如VGG或者ResNet)对视觉信息进行编码，而最初使用的Transformer编码器是为序列建模而设计的。然而，作者认为具有复杂设计的Transformer编码器可以更好地探索视觉实体与语义属性之间的内部和内部关系，具体而言，作者设计了由两个子编码器组成的双向编码器。每个子编码器都是自我注意的，具有相同的结构，即N个相同块的堆栈。

最低0.47元/天解锁文章

Tiám青年

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
《Entangled Transformer for Image Captioning》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论本文有点长，请耐心阅读，定会有收货。如有不足，欢迎交流，另附:论文下载地址一、文献摘要介绍In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals esp...
复制链接

扫一扫

专栏目录