《Deep Attention Neural Tensor Network for Visual Question Answering》视觉问答的深度注意神经张量网络论文理解

本文提出了一种名为DA-NTN的深度注意神经张量网络,用于视觉问答任务。DA-NTN通过双线性特征对图像、问题和答案进行建模,利用注意力模块进行推理,并通过回归的标签分布学习进行优化,从而提高模型的性能和收敛速度。
摘要由CSDN通过智能技术生成

一、介绍

在本文中,我们提出了一种新颖的深度关注神经张量网络(DA-NTN)用于视觉问题回答,它可以发现基于张量表示的图像,问题和答案之间的联合相关性。
首先,我们通过双线性特征对成对交互(例如,图像和问题)中的一个建模,进一步用三维(例如,答案)将其编码为双线性张量积的三元组。
其次,我们通过不同的答案和问题类型分解不同的三元组的相关性,并进一步在张量上提出一个切片式注意模块,以选择最具判别力的推理过程进行推理。
第三,我们通过学习带有 KL 散度损失的标签回归来优化建议的 DA-NTN。这样的设计使得可扩展的训练和在大量答案集上的快速收敛成为可能。
在这篇论文中我们将答案嵌入学习引入到我们的方法中,有三个目的。首先,我们希望对问题答案三元组之间的关系进行建模,以帮助推理。其次,答案嵌入可能会纠正问题的误解,尤其是对于复杂的句法结构的问题。第三,答案嵌入可以帮助确定问题的类型并决定使用哪种推理过程。

二、模型

2.1开放式可视问答框架的体系结构在这里插入图片描述

红色框中的结构是生成问题表示Vq和图像与问题特征向量Vqi融合的基本模型,两个蓝盒中的结构是我们提出的深层关注神经张量网络,利用蓝盒神经张量网络来度量图像-问题-答案三元组之间的相关性。

VQA任务的目标是提供一个给出图像 I ∈ I(张量)和相应的问题 q ∈ Q(张量),先前的大部分工作都把开放的VQA看作是一项分类任务:
在这里插入图片描述
其中θ表示模型的全部参数集,A(张量)为

Modern applications in engineering and data science are increasingly based on multidimensional data of exceedingly high volume, variety, and structural richness. However, standard machine learning algo- rithms typically scale exponentially with data volume and complex- ity of cross-modal couplings - the so called curse of dimensionality - which is prohibitive to the analysis of large-scale, multi-modal and multi-relational datasets. Given that such data are often efficiently represented as multiway arrays or tensors, it is therefore timely and valuable for the multidisciplinary machine learning and data analytic communities to review low-rank tensor decompositions and tensor net- works as emerging tools for dimensionality reduction and large scale optimization problems. Our particular emphasis is on elucidating that, by virtue of the underlying low-rank approximations, tensor networks have the ability to alleviate the curse of dimensionality in a number of applied areas. In Part 1 of this monograph we provide innovative solutions to low-rank tensor network decompositions and easy to in- terpret graphical representations of the mathematical operations on tensor networks. Such a conceptual insight allows for seamless migra- tion of ideas from the flat-view matrices to tensor network operations and vice versa, and provides a platform for further developments, prac- tical applications, and non-Euclidean extensions. It also permits the introduction of various tensor network operations without an explicit notion of mathematical expressions, which may be beneficial for many research communities that do not directly rely on multilinear algebra. Our focus is on the Tucker and tensor train (TT) decompositions and their extensions, and on demonstrating the ability of tensor networks to provide linearly or even super-linearly (e.g., logarithmically) scalable solutions, as illustrated in detail in Part 2 of this monograph.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值