论文笔记——Can Vision Transformers Perform Convolution?

Triton安

已于 2023-02-17 10:02:08 修改

阅读量117

点赞数

文章标签： python 其他

于 2021-11-15 15:17:02 首次发布

本文链接：https://blog.csdn.net/unamable/article/details/121335416

版权

论文笔记专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、问题引出与初步结论

Can a self-attention layer of ViT (with image patches as input) express any convolution operation?

self-attention layer是否可以充分表示任何一个卷积神经网络的层的问题。

之前比较相关的工作：pixels setting 和 patches setting不一致。（但是我不知道为什么不一致）

A partial answer has been given by Cordonnier et al. (2020). They showed that a self-attention layer with a sufficient number of heads can express convolution, but they only focused on the settings where the input to the attention layer is the representations of pixels, which is impractical due to extremely long input sequence and huge memory cost.

初步结论：

1.We provide a constructive proof to show that a 9-head self-attention layer in Vision Transformers with image patch as the input can perform any convolution operation, where the key insight is to leverage the multi-head attention mechanism and relative positional encoding to aggregate features for computing convolution.

九个头的self-attention就可以表示任何卷积层，关键的见解是利用多头注意力机制和相对位置编码来聚合计算卷积的特征。

2.用patch作为输入只要9个heads就可以满足K × K kerne的需求，但是以pixels作为输入就需要 $K^{2}$ 个heads，所以用patch作为输入更好

3.We propose a two-phase training pipeline for Vision Transformers. The key component in this pipeline is to initialize ViT from a well-trained CNN using the construction in our theoretical proof. We empirically show that with the proposed training pipeline that explicitly injects the convolutional bias, ViT can achieve much better performance compared with models trained with random initialization in low data regimes.

用双相训练方法（以训练好的CNN为基准来建设transformer会让transformer在低数据集上表现良好

二、符号说明

1.粗大写字母表示矩阵或者Tensor

2.粗小写字母表示向量

3.[m] = {1, 2, · · · , m}.

4.

K是卷积核大小， $D_{out}$ 是输出通道

5.输入图像大小转为patch后的大小

6.对MHSA来说

输入是 $R^{N*d}$ ，输出是 $R^{N*d_{O}}$

对于ViT来说，输入是 $R^{N*(P^{2}C)}$ 即 $d = P^{2}C$

三、具体细节

1.a MHSA layer in Vision Transformers can express a convolutional layer in the patch-input setting (Theorem 1).

证明的困难：卷积神经网络的卷积核是可以跨过patch的边界来捕捉信息的

解决方法：结合positional encoding 和多头机制来聚合特征，再对聚合特征作线性投影。

定理1的证明思路：

attention的计算是可以拆分成context-aware part (which depends on all the input tokens)

和positional attention part (which is agnostic to the input):

比如上述公式中In Equation (4), $QK^{T}$ and $B$ correspond to the context-aware part and the positional attention part respectively.

因为卷积神经网络对上下文天生不可知，所以设

在证明中只依靠位置信息。

我们让query只去查询与它相对位置为delta的key，