如何理解“Vision Transformer has much less image-specific inductive bias than CNNs”，关于CNN中的inductive bias_we note that vision transformer has much less imag-CSDN博客

本文链接：https://blog.csdn.net/tantangyueyue/article/details/131865256

在阅读论文《AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE》时有这么一段话：

We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood
structure is used very sparingly: in the beginning of the model by cutting the image into patches and
at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information
about the 2D positions of the patches and all spatial relations between the patches have to be learned
from scratch.

里面谈到cnn中有self-attention所没有的inductive bias，这导致在computer visual中，训练cnn需要的数据集要远小于Transformer。
那么什么是inductive bias呢？
inductive bias（归纳偏置）实际上是一种先验知识。在CNN中一般包括：

locality(局部性)：一张图片中越是靠近的东西就会有更加接近的特征，比如一张桌子的边上很自然的有可能有一张椅子，饭碗的边上自然有一双筷子，这种越靠近相关性越强的概念，就是局部性。
translation equivariance（平移等变性）：由于卷积核是相同的，故同一个物品，不管移动到图片的哪一个地方，经历相同的卷积核计算后所得到的结果是一样的。这就是平移等变性。

而Transformer中使用的self-attention的计算方式使得Transformer中除MLP外没有这种类似的inductive bias，故需要经过大量的数据训练去弥补cnn的inductive bias。
但是使用基于CNN的方法还是存在感受野有限的问题，不能很好的建模长远的依赖关系（全局信息），而基于transformer的方法可以很好的建模全局信息但是transformer反而缺乏类似于CNN的归纳偏置，这些先验信息必须通过大量的数据来进行学习弥补，所以小的数据在CNN上取得的效果一般优于基于transformer的方法。训练基于CNN的方法通常只需要一个较小的数据集，而训练基于transformer的方法一般需要再大的数据集上进行预训练。