HOW DO VISION TRANSFORMERS WORK 总结

最新推荐文章于 2024-09-14 19:18:15 发布

小ccccc

最新推荐文章于 2024-09-14 19:18:15 发布

阅读量172

点赞数 1

文章标签：深度学习

本文链接：https://blog.csdn.net/w18013886857/article/details/134340005

版权

MAS可以改善CNN的性能、抗干扰能力强、靠近最后几层的MAS可以显著提高预测性能。
MAS可以建立长依赖关系，但归纳偏差能力弱于CNN。
在同等数据量的情况下，归纳偏差能力越强，网络的表示能力就越强。
与ResNet相比，ViT’s non-convex losses lead to poor performance。non-convexity的优点是表达能力更强（进步能力更大），在数据量足够的情况下是一种优势，可以很好地学到一个loss function 。越convexity则容易限制在一个局部的最优值里，无法到达更加优的值。
对于下面的图——曲线越取向0，越平滑，负值越少越突。

Loss landscape smoothing methods aids in ViT training：MAS的Head数越多 and Head High embedding，则loss landscape convexity and loss landscape flatten.（Multi-heads and Head High embedding convexify and flatten loss landscapes）

A key feature of MSAs is data specificity (not long-range dependency)：其实很多情况下不需要global attention，attention的思想是我的权重是由我数值本身来觉定的。

Convs are data-agnostic and channel-specific. (卷积权重固定，与数据无关)
MSAs are data-specific and channel-agnostic. (只与数据有关)
MSAs and Convs are complementary.
MSAs are low-pass filters, but Convs are high-pass filters.——"MSAs是低通滤波器"的说法意味着MSAs在某种程度上平滑或减弱了序列之间的高频变化，保留了较低频的共性特征。而"Convs是高通滤波器"的说法则意味着卷积操作具有强调序列中的高频变化、捕捉局部细节的特性。

How to MSAs + Conv. Build-up rule:
- Alternately replace Conv blocks with MSA blocks from the end of a baseline CNN model.
- If the added MSA block does not improve predictive performance, replace a Conv block located at the end of an earlier stage with an MSA block.
- Use more heads and higher hidden dimensions for MSA blocks in late stages.

总结：

数据量大，选择Inductive Biases能力一般的模型来学习一个更强地表达能力，如SwinT，PiT。
数据量少，由于Inductive Biases能力一般的模型损失函数的非凸性，在优化上存在不利因素，无法学习到一个良好的表达能力，因此需要选择一个具有强Inductive Biases能力的模型，如ResNet。
关键在于找到一个平衡点，随着数据量的增加，我们需要一个表达能力更强的模型，它的Inductive Biases能力较弱，但经过良好训练后能够更好地表达整个数据集。