How do vision transformer work?【论文解读】

ICLR 2022的spotlight。

原论文地址:https://arxiv.org/abs/2202.06709

你懂的VIT吗?

论文的核心是在证明VIT的优化曲面对比ResNet要好很多。

Abstract

1、注意力机制有助于提升泛化能力,因为会使损失函数平滑化。

2、VIT的进步是因为数据特异度(specificity)高,不是因为长距离依赖性。(当前很多人认为VIT work的原因是因为VIT的距离比较长,不像CNN只有一个短局部的信息建模

3、VIT会受非突的损失函数的影响。(受到非突的损失函数影响很多模型都有,但VIT受到的影响比较大)

4、大数据集和损失函数平滑技术可以缓解非突损失函数的问题。(VIT适用大数据集)

5、MSA和CNN会有不同的表现,MSA更像是低通滤波器,所以MSA和CNN可以互补。

6、多阶段神经网络的结尾可以放MSA,提出了AlterNet。

Introduction

There is limited understanding of multi-head self-attentions (MSAs), although they are now ubiquitous in computer vision. The most widely accepted explanation for the success of MSAs is their weak inductive bias and capture of long-range dependencies (See, e.g., (Dosovitskiy et al., 2021; Naseer et al., 2021; Tuli et al., 2021; Yu et al., 2021a; Mao et al., 2021; Chu et al., 2021)). Yet because of their over-flexibility, Vision Transformers (ViTs)—neural networks (NNs) consisting of MSAs—have been known to have a tendency to overfit training datasets, consequently leading to poor predictive performance in small data regimes, e.g., image classification on CIFAR. However, we show that the explanation is poorly supported.

小数据集训练不好,不是因为VIT太灵活了(over-flexibility),是另外有原因。

Self-attentions (Vaswani et al., 2017; Dosovitskiy et al., 2021) aggregate (spatial) tokens with normalized importances: 

  where Q, K, and V are query, key, and value, respectively. d is the dimension of query and key, and zj is the j-th output token. From the perspective of convolutional neural networks (CNNs), MSAs are a transformation of all feature map points with large-sized and data-specific kernels. Therefore, MSAs are at least as expressive as convolutional layers (Convs) (Cordonnier et al., 2020), although this does not guarantee that MSAs will behave like Convs.

         前人工作指出,自注意力机制表达能力至少有CNN那么强,但不能保证MSA就一定比CNN好,因为损失优化不好,自由度空间太大也是一个大问题。

Is the weak inductive bias of MSA, such as modeling long-range dependencies, beneficial for the predictive performance? To the contrary, appropriate constraints may actually help a model learn strong representations. For example, local MSAs (Yang et al., 2019; Liu et al., 2021; Chu et al., 2021), which calculate self-attention only within small windows, achieve better performance than global MSAs not only on small datasets but also on large datasets, e.g., ImageNet-21K.

CNN中的局部约束在MSA仍然是有用的。例如Local MSAs证明了在小数据集下,local msa比global msa更好。

In addition, prior works observed that MSAs have the following intriguing properties: 1 MSAs improve the predictive performance of CNNs (Wang et al., 2018; Bello et al., 2019; Dai et al., 2021;Guo et al., 2021; Srinivas et al., 2021), and ViTs predict well-calibrated uncertainty (Minderer et al., 2021). 2 ViTs are robust against data corruptions, image occlusions (Naseer et al., 2021), and adversarial attacks (Shao et al., 2021; Bhojanapalli et al., 2021; Paul & Chen, 2022; Mao et al., 2021). They are particularly robust against high-frequency noises (Shao et al., 2021). 3 MSAs closer to the last layer significantly improve predictive performance (Graham et al., 2021; Dai et al., 2021).

MSA与CNN相比,更robust、更能抵抗遮挡、更能抵抗高频率噪音。

These empirical observations raise immediate questions: 1 What properties of MSAs do we need to better optimize NNs? Do the long-range dependencies of MSAs help NNs learn? 2 Do MSAs act like Convs? If not, how are they different? 3 How can we harmonize MSAs with Convs? Can we just leverage their advantages? 

We provide an explanation ofhow MSAs work by addressing them as a trainable spatial smoothing of feature maps, because Eq. (1) also suggests that MSAs average feature map values with the help CNNs see better (Zhang, 2019; Park & Kim, 2021). These simple spatial smoothings not only positive importance-weights. Even non-trainable spatial smoothings, such as a small 2 × 2 box blur, improve accuracy but also robustness by spatially ensembling feature map points and flattening the loss landscapes (Park & Kim, 2021). Remarkably, spatial smoothings have the properties of MSAs 1 – 3 . See Appendix B for detailed explanations of MSAs as a spatial smoothing

MSA的作用大概相当于加了一层可以训练的平滑层。

Contribution

What properties of MSAs do we need to improve optimization? We present various
evidences to support that MSA is generalized spatial smoothing. It means that MSAs improve performance because their formulation—Eq. (1)—is an appropriate inductive bias. Their weak inductive bias disrupts NN training. In particular, a key feature ofMSAs is their data specificity, not long-range dependency. As an extreme example, local MSAs with a 3×3 receptive field outperforms global MSA because they reduce unnecessary degrees of freedom.

MSA的作用大概相当于加了一层可以训练的平滑层,也就是说,MSA提供了一种很好的归纳偏差,MSA的效果来源于数据的特异性,并不是因为长距离依赖性,适当地减少自由度是有好处的。

How do MSAs improve performance? MSAs have their advantages and disadvantages. On the one hand, they flatten loss landscapes as shown in Fig. 1. The flatter the loss landscape, the better the performance and generalization (Li et al., 2018; Keskar et al., 2017; Santurkar et al., 2018; Foret et al., 2021; Chen et al., 2022). Thus, they improve not only accuracy but also robustness in large data regimes. On the other hand, MSAs allow negative Hessian eigenvalues in small data regimes. This means that the loss landscapes of MSAs are non-convex, and this non-convexity disturbs NN optimization (Dauphin et al., 2014). Large amounts of training data suppress negative eigenvalues and convexify losses

MSA的好处:可以平滑损失函数曲面。

坏处:是MSA的Hessian特征值也有负的,然后损失函数的曲面是非突的,会影响NN的优化性。大规模数据集压制了负的特征值和突性损失。(Hessian特征值可以理解为曲线优化的方向,如果是负的相当于朝着反方向走)

 图a是一个损失函数的可视化。损失函数的曲面越平滑是越好的,从图中可以看出,resnet的曲面没有那么平滑,但是VIT更加平滑。图b是极坐标下的信息轨迹,rt的分子代表的是当前epoch的w值与最优下的w值的差;分母代表的是初始化的w值。rt越大,代表离最优越远(分母是一个常量,相当于w_t-w_optim)。θ_t代表的是夹角,我个人的理解是arrange[0,300]。可以从图b看出,随着不断的训练,resnet有一段是反着训练的,而VIT的训练是相对于比较平缓的。图c代表的是hessian 最大特征值,vit的特征值基本上在0附近,但resnet的特征值比较大。(特征值小代表比较稳定,梯度没有那么的大),随着epoch变化,VIT的特征值不断的变大。

We show that MSAs and Convs exhibit opposite behaviors. MSAs
aggregate feature maps, but Convs diversify them. Moreover, as shown in Fig. 2a, the Fourier analysis of feature maps shows that MSAs reduce high-frequency signals, while Convs, conversely, amplifies high-frequency components. In other words, MSAs are low-pass filters, but Convs are high-pass filters. In addition, Fig. 2b indicates that Convs are vulnerable to high-frequency noise but that MSAs are not. Therefore, MSAs and Convs are complementary

MSA相当于低通滤波器,CNN代表的是高通滤波器。

图2表示没看懂,比较乱。

methods

左图是resnet,中间是VIT,右图是提出的一个新的策略。 

The stronger the inductive biases, the stronger the representations (not regularizations). Do models with weak inductive biases overfit training datasets? To address this question, we provide two criteria on CIFAR-100: the error of the test dataset and the cross-entropy, or the negative log-likelihood, of the training dataset (NLLtrain, the lower the better). See Fig. 6a for the results.
Contrary to our expectations, experimental results show that the stronger the inductive bias, the lower both the test error and the training NLL. This indicates that ViT does not overfit training datasets. In addition, appropriate inductive biases, such as locality constraints for MSAs, helps NNs learn strong representations. We also observe these phenomena on CIFAR-10 and ImageNet as shown in Fig. C.1. Figure C.2 also supports that weak inductive biases disrupt NN training. In this experiment, extremely small patch sizes for the embedding hurt the predictive performance of ViT.

在VIT中,过拟合不存在的。图中的蓝点是训练集,三角形是测试集。

Loss landscape smoothing methods aids in ViT train- ing. Loss landscape smoothing methods can also help ViT learn strong representations. In classification tasks, global average pooling (GAP) smoothens the loss land- scape by strongly ensembling feature map points (Park & Kim, 2021). We demonstrate how the loss smoothing method can help ViT improve performance by analyz- ing ViT with GAP classifier instead of CLS token on CIFAR-100.Figure 5 shows the Hessian max eigenvalue spectrum of the ViT with GAP. As expected, the result shows that GAP classifier suppresses negative Hessian max eigen- values, suggesting that GAP convexify the loss. Since negative eigenvalues disturb NN optimization, GAP clas- sifier improve the accuracy by +2.7 percent point. Likewise, Sharpness-Aware Minimization (SAM) (Foret et al., 2021), an optimizer that relies on the local smoothness of the loss function, also helps NNs seek out smooth minima. Chen et al. (2022) showed that SAM improves the predictive performance of ViT.

SAM等平滑方法有助于提高VIT result.

MSAs flatten the loss landscape. Another property of MSAs is that they reduces the magnitude of Hessian eigenvalues. Figure 1c and Fig. 4 shows that the eigenvalues of ViT are significantly smaller than that of CNNs. While large eigenvalues impede NN training (Ghorbani et al., 2019), MSAs can help NNs learn better representations by suppressing large Hessian eigenvalues. Figure 1a and Fig. 1b also support this claim. In Fig. 1a, we visualize the loss landscapes by using filter normalization (Li et al., 2018). The loss landscape of ViT is flatter than that of ResNet, and this trend is noticeable at the boundary. Similarly, Fig. 1b shows that ResNet follows an irregular trajectory, especially in the early phase of training; ViT converges to the optimum along a smooth trajectory. In large data regimes, the negative Hessian eigenvalues—the disadvantage of MSAs—disappears, and only their advantages remain. As a result, ViTs outperform CNNs on large datasets, such as ImageNet and JFT (Sun et al., 2017). PiT and Swin also flatten the loss landscapes. See Fig. C.4.

MSA可以起到平滑化损失函数平面的作用,可以用到CNN当中去。

A key feature of MSAs is data specificity (not long-range dependency). The two distinguishing features of MSAs are long-range dependency and data specificity, also known as data dependency, as discussed in Section 1.1. Contrary to popular belief, the long-range dependency hinders NN optimization. To demonstrate this, we analyze convolutional ViT, which consists of two-dimensional convolutional MSAs (Yang et al., 2019) instead of global MSAs. Convolutional MSAs calculates self-attention only between feature map points in convolutional receptive fields after unfolding the feature maps in the same way as convolutions.

长距离依赖其实对神经网络没什么用。

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值