用于视觉问答的双线性超对角线融合模型《BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering》

最新推荐文章于 2021-09-19 19:29:36 发布

Tiám青年

最新推荐文章于 2021-09-19 19:29:36 发布

阅读量1.3k

点赞数 1

分类专栏：计算机视觉 VQA

本文链接：https://blog.csdn.net/xiasli123/article/details/103953695

版权

这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。

一、文献摘要介绍

Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to defifine new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fifine interactions between modalities while maintaining powerful mono-modal representations.

作者认为在深度学习社区中，多模式表示学习越来越受到关注。虽然双线性模型提供了一个有意思的框架来查找模态的细微组合，但其参数数量却随着输入维呈二次方增长，这使其在经典深度学习流程中的实际实现具有挑战性。在本文中，我们介绍了基于块超对角张量分解的新型多峰融合BLOCK。它利用了块项秩的概念，它概括了已经用于多峰融合的张量的秩和模态秩的概念。它允许定义新方法来优化融合模型的表达性和复杂性之间的折衷，并能够在保持强大的单模态表示的同时表示模态之间非常精细的交互。

二、网络框架介绍

作者提出的VQA模型基于经典的注意力架构（Fukui et al.2016），并通过作作者提出的合并方案得到了丰富，融合模型如下图所示，使用（Teney等人）提供的自下而上的图像特征，由一组检测到的对象及其表示组成（见Mordan等人Durand等人，关于检测和定位）。为了获得问题的嵌入向量，对单词进行预处理，然后将其输入到经过预先训练的Skip-thought编码器中（Kiros等人.2015），该语言模型的输出用于生成表示整个问题的单个向量，如（Yu et al.2018）。

最低0.47元/天解锁文章

Tiám青年

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
用于视觉问答的双线性超对角线融合模型《BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering》

目录一、文献摘要介绍二、网络框架介绍三、实验分析四、结论这是视觉问答论文阅读的系列笔记之一，本文有点长，请耐心阅读，定会有收货。如有不足，随时欢迎交流和探讨。一、文献摘要介绍Multimodal representation learning is gaining more and more interest within the deep learning comm...
复制链接

扫一扫