Abstract
transformer优点:the superior ability of global dependency modeling(全局关系建模的能力很强)
目前存在问题:how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue(如何动态规划transformer中的全局和局部依赖关系建模已经成为一个问题)
本文方法:example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue(依赖例子的路由策略)
具体做法:each visual Transformer layer is equipped with a routing module with different attention spans(每个视觉transformer层都配有一个路由模块,有不同的注意力范围)
1. Introduction
目前存在的现象:
the multi-modal inference often requires visual attentions from different receptive fields(多模态推理经常需要来自不同感受野的视觉注意力)
grid features:
the semantic information of grid features are more fragmented(grid特征的语义更加碎片化)
待解决的关键问题:helping Transformer networks to explore different attention spans(帮助模型探索不同的注意力范围)
本文方法:a novel yet lightweight routing scheme called
Transformer Routing (TRAR)(自动选择注意力)
具体做法:equips each visual SA layer with a path controller to predict the next attention span (or receptive field) based on the output of the previous step(给每一个SA层配置一个路径控制器,根据上一层的输出来决定下一层的注意力范围/感受野范围)
2. Related Work
2.1. Visual Question Answering
2.2. Referring Expression Comprehension
multi-stage modeling(先得到框,再根据指代文本选择最佳区域)
single stage modeling(文本引导的目标检测)
2.3. Dynamic Neural Networks
优点:can adapt their structures or parameters to the given example during inference(在测试过程中能够根据给定的样本自适应调整模型结构和参数)
3. Transformer Routing
3.1. Routing Process
X表示上一层推理步骤的特征,Fi表示特征空间,这里有n个,X'表示下一个推理步骤的输出,α可以是soft,也可以是hard
标准的Self-attention机制(该函数可以被视为特征更新函数):
本文改进:
D为二进制,如果该注意力在范围内,则值为1,否则值为0,即注意力矩阵(邻接矩阵)变得相对比较稀疏。
修正后的Self-attention为:
为了简化计算量,进一步将self-attention改为:
将控制信息流动的α的加权求和过程转移到了D矩阵上。
3.2. Path Controller
3.3. Attention Span
借助sliding-window design of convolution(卷积操作中的滑动窗的设置),不过作者重新命名了这个模块,即order neighborhood(图论中的顺序邻接,一阶表示3*3,二阶表示5*5)
3.4. Optimization
Soft routing:连续可微的α
Hard routing:Gumbel-max
trick