1. 摘要
轻量级卷积网络在移动端计算中得到了广泛的应用。他们的spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. 空间归纳偏差让模型能够在不同的视觉任务中以较少的参数下学习到表征。然而,这些表征往往在空间上具有较强局限性。为了学习到全局的表征,视觉transformer结构中的自注意力模块被采纳。 How to cambine the sttrength of CNN and Transformer 结构,以构建一个低延迟、轻量级的网络实现对视觉任务的有效检测?基于此,文章提出了一种Mobile Vit模型。 In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices.
2. 需求
Self-attention-based models, especially vision transfor