论文笔记——Mobile-Former: Bridging MobileNet and Transformer

一、摘要

本文提供了Mobile-Former,是一种由双向桥将MobileNet和Transformer并行连接的结构。

桥允许双向融合局部与全局特征。

Mobile-Former有较少的随机初始化的tokens(少于6个),导致了低计算量。

接着描述了实验结果:

Combining with the proposed light-weight cross attention to model the bridge,Mobile-Former is not only computationally efficient, but also has more representation power, outperforming Mo-bileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.

二、Introduction

2.1 motivation

1.VIT提供了全局视野,但是如果将FLOPs限制在1G以内,效果就变差(对应于VIT最小的都是1.6G左右)

2.MobileNet及其衍生网络的FLOPs非常低(e.g. less than 300M FLOPs for ImageNet classification)(MobileNet是一种轻量级CNN)

提出问题:

How to design efficient networks to effectively encode both local processing and global interaction?

回答:A straightforward idea is to combine convolution and vision transformer.之前人们提供串行的,本文提供了并行的Mobile-Former、

2.2 framework

Mobile takes the image as input and stacks mobile (or inverted
bottleneck) blocks
[24].It leverages the efficient depthwise and pointwise convolution to extract local features at pixel level.

Former没有像ViT那里将patch作为输入,而是将几个可学习的tokens作为输入。

Former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks (FFN). These tokens are used to encode global features of the image.

双向桥将local feature传给Former的tokens, 又将global feature传给Mobile的featuremap的每一个像素

We propose a light-weight cross attention to model this bidirectional bridge by (a)performing the cross attention at the bottleneck of Mobile where the number of channel is low,
and (b) removing projections on query, key and value (W^{Q}, W^{K}, W^{V} ) from Mobile side.

双向桥以及former占的FLOPs很低:

The bridge and Former consumes less than 20% of the total computational cost, but significantly improve the representation capability.

三、Mobile-Former的具体实现细节以及计算量。

1.Mobile的输入是图像xi \mathbb{R}^{H\times W\times 3},并跟着一个bottleneck block来提取local feature

文中定义L = H * W 称为spatial position 

计算量:(O(2LEC^{2} +9LEC))

where L is the number of spatial positions, E is the channel expansion ratio, and C is
the number of channels before the expansion
.

2.Former将可学习tokens zi作为输入\mathbb{R}^{M\times d},d和M分别是tokens的dimension和number,他们是随机初始化的,注意在所有的block中d和M都是一样的。

计算量:O(M^{2}d + Md^{2})

The first item relates to computing dot product between query and key,
and aggregating values based on attention values,

while thesecond item covers linear projections and FFN. Since Former only has a few tokens (M ≤ 6), the first item M 2d is ignorable.

Here, we follow [33] to use post layer normalization. To save computations, we use

expansion ratio 2 instead of 4 in FFN

3.双向桥与平常的cross attention不同处有两点:

(1)cross attention的计算是在Mobile的bottleneck处,这里的channel数低

(2)移除了Mobile边的映射,但是在Former处保留

 具体地,如果假设local feature map as x, and the global tokens as z.They are split as

x = [x_{h}] and z = [z_{h}](1 ≤ h ≤ H) for multi-head attention with H heads.

Mobile->Former 

计算量:O(LMC + MdC)

where the first item relates to computing cross attention between local and global features and aggregating local features for each global token, and the second item is the complexity to project global features to the same dimension of local features C and back to dimension d after aggregation.

 Former->Mobile

 计算量:O(LMC + MdC)

计算复杂度:

Mobile block消耗最多

Former 和 双向桥占少于总计算量的百分之二十

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值