论文笔记——Can Vision Transformers Perform Convolution?

一、问题引出与初步结论

Can a self-attention layer of ViT (with image patches as input) express any convolution operation?

self-attention layer是否可以充分表示任何一个卷积神经网络的层的问题。

之前比较相关的工作:pixels setting 和 patches setting不一致。(但是我不知道为什么不一致)

A partial answer has been given by Cordonnier et al. (2020). They showed that a self-attention layer with a sufficient number of heads can express convolution, but they only focused on the settings where the input to the attention layer is the representations of pixels, which is impractical due to extremely long input sequence and huge memory cost. 

初步结论:

1.We provide a constructive proof to show that a 9-head self-attention layer in Vision Transformers with image patch as the input can perform any convolution operation, where the key insight is to leverage the multi-head attention mechanism and relative positional encoding to aggregate features for computing convolution.

九个头的self-attention就可以表示任何卷积层,关键的见解是利用多头注意力机制和相对位置编码来聚合计算卷积的特征。

2.用patch作为输入只要9个heads就可以满足K × K kerne的需求,但是以pixels作为输入就需要K^{2}个heads,所以用patch作为输入更好

3.We propose a two-phase training pipeline for Vision Transformers. The key component in this pipeline is to initialize ViT from a well-trained CNN using the construction in our theoretical proof. We empirically show that with the proposed training pipeline that explicitly injects the convolutional bias, ViT can achieve much better performance compared with models trained with random initialization in low data regimes.

用双相训练方法(以训练好的CNN为基准来建设transformer会让transformer在低数据集上表现良好

二、符号说明

1.粗大写字母表示矩阵或者Tensor

2.粗小写字母表示向量

3.[m] = {1, 2, · · · , m}.

4.

K是卷积核大小,D_{out}是输出通道 

5.输入图像大小转为patch后的大小

6.对MHSA来说

 输入是R^{N*d},输出是R^{N*d_{O}}

对于ViT来说,输入是R^{N*(P^{2}C)}d = P^{2}C

三、具体细节

1.a MHSA layer in Vision Transformers can express a convolutional layer in the patch-input setting (Theorem 1).         

 证明的困难:卷积神经网络的卷积核是可以跨过patch的边界来捕捉信息的

解决方法:结合positional encoding 和 多头机制来聚合特征,再对聚合特征作线性投影。

定理1的证明思路:

attention的计算是可以拆分成context-aware part (which depends on all the input tokens)

和positional attention part (which is agnostic to the input):

比如上述公式中In Equation (4), QK^{T} and Bcorrespond to the context-aware part and the positional attention part respectively. 

因为卷积神经网络对上下文天生不可知,所以设

 在证明中只依靠位置信息。

我们让query只去查询与它相对位置为delta的key,

感觉文章的意思就是, 卷积神经网络与transformer进行类比,W^{O} = Kernel,  通过相对位置编码来捕捉patch与patch边缘的信息。

 文中的证明只用到了卷积的线性性,所以,其它卷积层都可以通过同样的方式进行构造。

3.TWO-PHASE TRAINING OF VISION TRANSFORMERS

目的:our theoretical insight can be used to inject convolutional bias to Vision Transformers and improve their performance in low data regimes.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值