论文笔记——Mobile-Former: Bridging MobileNet and Transformer

Triton安

已于 2023-02-17 10:02:49 修改

阅读量453

点赞数

文章标签： transformer 深度学习人工智能

于 2021-11-02 18:37:44 首次发布

本文链接：https://blog.csdn.net/unamable/article/details/121104977

版权

论文笔记专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、摘要

本文提供了Mobile-Former，是一种由双向桥将MobileNet和Transformer并行连接的结构。

桥允许双向融合局部与全局特征。

Mobile-Former有较少的随机初始化的tokens（少于6个），导致了低计算量。

接着描述了实验结果：

Combining with the proposed light-weight cross attention to model the bridge,Mobile-Former is not only computationally efficient, but also has more representation power, outperforming Mo-bileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.

二、Introduction

2.1 motivation

1.VIT提供了全局视野，但是如果将FLOPs限制在1G以内，效果就变差（对应于VIT最小的都是1.6G左右）

2.MobileNet及其衍生网络的FLOPs非常低（e.g. less than 300M FLOPs for ImageNet classification）（MobileNet是一种轻量级CNN）

提出问题：

How to design efficient networks to effectively encode both local processing and global interaction?

回答：A straightforward idea is to combine convolution and vision transformer.之前人们提供串行的，本文提供了并行的Mobile-Former、

2.2 framework

Mobile takes the image as input and stacks mobile (or inverted
bottleneck) blocks [24].It leverages the efficient depthwise and pointwise convolution to extract local features at pixel level.

Former没有像ViT那里将patch作为输入，而是将几个可学习的tokens作为输入。

Former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks (FFN). These tokens are used to encode global features of the image.

双向桥将local feature传给Former的tokens，又将global feature传给Mobile的featuremap的每一个像素

We propose a light-weight cross attention to model this bidirectional bridge by (a)performing the cross attention at the bottleneck of Mobile where the number of channel is low,
and (b) removing projections on query, key and value ( $W^{Q}, W^{K}, W^{V}$ ) from Mobile side.

双向桥以及former占的FLOPs很低：

The bridge and Former consumes less than 20% of the total computational cost, but significantly improve the representation capability.

三、Mobile-Former的具体实现细节以及计算量。

1.Mobile的输入是图像xi $\mathbb{R}^{H\times W\times 3}$ ，并跟着一个bottleneck block来提取local feature

文中定义L = H * W 称为spatial position

计算量： $(O(2LEC^{2} +9LEC))$

where L is the number of spatial positions, E is the channel expansion ratio, and C is
the number of channels before the expansion.

2.Former将可学习tokens zi作为输入 $\mathbb{R}^{M\times d}$ ，d和M分别是tokens的dimension和number，他们是随机初始化的，注意在所有的block中d和M都是一样的。

计算量： $O(M^{2}d + Md^{2})$

The first item relates to computing dot product between query and key,
and aggregating values based on attention values,

while thesecond item covers linear projections and FFN. Since Former only has a few tokens (M ≤ 6), the first item M 2d is ignorable.

Here, we follow [33] to use post layer normalization. To save computations, we use

expansion ratio 2 instead of 4 in FFN

3.双向桥与平常的cross attention不同处有两点：

（1）cross attention的计算是在Mobile的bottleneck处，这里的channel数低

（2）移除了Mobile边的映射，但是在Former处保留

具体地，如果假设local feature map as x, and the global tokens as z.They are split as

x = [ $x_{h}$ ] and z = [ $z_{h}$ ](1 ≤ h ≤ H) for multi-head attention with H heads.

Mobile->Former

计算量： $O(LMC + MdC)$

where the first item relates to computing cross attention between local and global features and aggregating local features for each global token, and the second item is the complexity to project global features to the same dimension of local features C and back to dimension d after aggregation.

Former->Mobile

计算量： $O(LMC + MdC)$

计算复杂度:

Mobile block消耗最多

Former 和双向桥占少于总计算量的百分之二十

Triton安

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
论文笔记——Mobile-Former: Bridging MobileNet and Transformer

一、摘要本文提供了Mobile-Former，是一种由双向桥将MobileNet和Transformer并行连接的结构。桥允许双向融合局部与全局特征。Mobile-Former有较少的随机初始化的tokens（少于6个），导致了低计算量。接着描述了实验结果：Combining with the proposed light-weight cross attention to model the bridge,Mobile-Former is not only computational
复制链接

扫一扫