【ShuffleNet】《ShuffleNet：An Extremely Efficient Convolutional Neural Network for Mobile Devices》

最新推荐文章于 2023-12-11 10:49:48 发布

bryant_meng

最新推荐文章于 2023-12-11 10:49:48 发布

阅读量893

点赞数 1

分类专栏： CNN / Transformer 文章标签： ShuffleNet point-wise convolution channel shuffle complexity bottleneck ratio

本文链接：https://blog.csdn.net/bryant_meng/article/details/86645643

版权

CNN / Transformer 专栏收录该内容

204 篇文章 7 订阅

订阅专栏

在这里插入图片描述
CVPR-2018

1 Background and Motivation

Building deeper and larger CNN is a primary trend for solving major visual recognition tasks. CNN 太大需要 computation at billions of FLOPs. 作者在有限的 computation 下追求 acc，轻量化 CNN，专注于将其运用到移动平台，如无人机、机器人和智能手机。（pursuing the best accuracy in very limited computational budgets at tens or hundreds of MFLOPs, focusing on common mobile platforms such as drones, robots, and smartphones）

受 mobilenet（depth separable convolution）和 resnext（group convolution）的启发，针对 $1 * 1$ 卷积（point-wise convolution）计算量大，group convolution 组之间信息无交流的缺点，提出 point-wise group convolution 和 channels shuffle

Note：depth separable convolution = depth-wise convolution + point-wise convolution

2 Advantages

比 MobileNet 准，absolute 7.8% lower ImageNet top-1 error at level of 40 MFLOPs
比 AlexNet 快，achieves ~13× actual speedup while maintaining comparable accuracy

3 Innovations

提出 ShuffleNet 的结构
提出 shuffle group 的思想

4 Related wrok

Efficient Model Designs
GoogleNet、Squeezenet、ResNet、SENet、NASNet
Group Convolution
AlexNet、ResNeXt、Xception、MobileNet
Channel Shuffle Operation
两篇论文 cuda-convnet [20]（先打乱，再 group convolution）、[41]，作者说 [41] did not specially investigate the effectiveness of channel shuffle itself and its usage in tiny model design（哈哈哈会玩）

5 Method

5.1 Channel Shuffle for Group Convolutions

在这里插入图片描述
GConv 就是 Group convolution，（a）是传统的 GConv，（b）（c）等价是在（a）的基础上，对 feature map 进行通道维度的 shuffle。why？

Group Convolution 有个 side-effect：outputs from a certain channel are only derived from a small fraction of input channels.（也即上图标注的 No cross talk）

This property blocks information flow between channel groups and weakens representation.

所以作者做了（b）（c）的改进！妙哉

5.2 ShuffleNet Unit

在这里插入图片描述
1）bottleneck unit 改进
同 ResNet 还是 $1 \times 1$ → $3 \times 3$ → $1 \times 1$ 的组合，只是 $1 \times 1$ 换成了 GConv， $3 \times 3$ 换成了 Depth-wise Convolution，后面的 $3 \times 3$ → $1 \times 1$ 构成 Depth separable convolution

2）计算量对比
对比下计算量，假设 input size $c \times h \times w$ ， $\ channels = m$ ， $\ of \ groups$ 为 $g$
在这里插入图片描述

	parameters	computational complexity
Resnet	$2cm + 9m^2$	$hw(2cm + 9m^2)$
ResneXt	$2cm + 9m^2/g$	$hw(2cm + 9m^2/g)$
Shufflenet	$2 c m / g + 9 m$	$h w (2 c m / g + 9 m)$

3）resolution 降低时候的结构

在这里插入图片描述
不同的地方如下：

add a 3 × 3 average pooling on the shortcut path;
replace the element-wise addition with channel concatenation, which makes it easy to enlarge channel dimension with little extra computation cost.

第二点还是很巧妙的，把 add 替换成 concatenate 来实现 double channels，不过见过 inception family，这种形式也不会大惊小怪了！

4）关于 depth-wise separable 的使用
我们知道，depth-wise separable 用于 3×3卷积，值得注意的是：虽然 depth-wise separable 能大大降低 parameters，但是呢！
we find it difficult to efficiently implement on low power mobile devices, which may result from a worse computation/memory access ratio compared with other dense operations.（Xception 中也提到了这一点）

所以，作者仅在 bottleneck unit 中把 regular 3×3 替换成了 depth-wise separable

5.3 Network Architecture

在这里插入图片描述

the first building block in each stage is applied with stride=2（每个stage 的第一个 bottleneck unit 用 stride= 2）
bottleneck channels 为 output channels 的 1/4，也即（ $\frac{1}{4}c$ ），也即 bottleneck ratio = 1：4
组增多，channels 也会随之增多（因为组多参数量也降的多），保持 ~140 MFLOPs！组多，参数量少，可以 encode more information（wider feature maps），但是相应的，每组的 channels 变少了，效果可能会 degrade
上表可以称为 Shufflenet 1×，学习 MobileNet，可以来 Shufflenet s×（缩减 number of filers s time），这样参数量或者 overall complexity 会降低 $s^2$

6 Experiments

6.1 Ablation Study

要注意一点，同一 complexity 下，g 不同，也表示着，每个 bottleneck unit 中的 feature map 的厚度不同，因为 g越多降的参数量也越多！
1）Pointwise Group Convolutions
在这里插入图片描述
Smaller model tend to benefit more from groups（同 complexity 下，有 wider feature maps），随着组别的增加，ShuffleNet 1X 提升 1.2%，ShuffleNet 0.5X 提升 3.5%，ShuffleNet 0.25X 提升 4.4%（和精度最高的对比）

从 ShuffleNet 1X - ShuffleNet 0.25X 可以看出，it benefits more from enlarged feature map.

2）Channel Shuffle vs. No Shuffle
cross-group information interchange，在三种 complexity 下对比
在这里插入图片描述
组越多，shuffle 的效果越明显

6.2 Comparison with Other Structure Units

同 complexity 下比较 performance
在这里插入图片描述
同一 complexity 下，参数量更少的结构，可以有 wider feature map，所以，精度会高

6.3 Comparison with MobileNets and Other Frameworks

1）MobileNet 和 ShuffleNet solo
相仿 complexity 下比较 performance
在这里插入图片描述
可以看出，ShuffleNet 的参数利用效率更高，结构设计更优化

虽然，ShuffleNet network is specially designed for small models（<150 MFLOPs），但是 > 150 的时候还是可以看到 ShuffleNet 的淫威。

最后一行 shallow 表示 stage 2-4 bottleneck unit half

注意 with SE 哟，Squeeze-and-Excitation（SE），加了 SE，在手机上速度会慢 25-40%

2）和其它常见结构对比
similar accuracy 下比较 complexity
在这里插入图片描述

6.4 Generalization Ability

在 COCO 上跑跑
在这里插入图片描述
We conjecture that this significant gain is partly due to ShuffleNet’s simple design of architecture without bells and whistles.

6.5 Actual Speedup Evaluation

在移动端上试试（a mobile device with an ARM platform），理论上的加速和实际中的加速还有些出入

Empirically g = 3 usually has a proper trade-off between accuracy and actual inference time.

在这里插入图片描述

achieves ~ 13x actual speedup（~18x theoretical）

7 Conclusion

注意几点

shufflenet 的两大改进之处：point-wise convolution 变成 group-wise point convolution， channel shuffle 增进 cross-group talk
shufflenet 中的 bottleneck unit 还是 1、3、1 的节奏。get 到了一个新概念，bottleneck ratio
shufflenet 中，同 complexity 下，不同 group 意味着 bottleneck unit 的 width 也不同，因为 group convolution 会降低计算量
脑洞也是大，从嫌弃 7x7 到 5x5、3x3，现在换成 1x1 后，由于 1x1 占了绝大运算，都开始嫌弃了 1x1 了
resolution half，channel double 这种描述我第三次见了（ResNet、ResNeXt、ShuffleNet），背后的原理就是保持 bottleneck 的 complexity
核心代码如下

def channel_shuffle(x, groups):
    """
    Parameters
    ----------
    x:
        Input tensor of with `channels_last` data format
    groups: int
        number of groups per channel
    Returns
    -------
        channel shuffled output tensor
    Examples
    --------
    Example for a 1D Array with 3 groups
    >>> d = np.array([0,1,2,3,4,5,6,7,8])
    >>> x = np.reshape(d, (3,3))
    >>> x = np.transpose(x, [1,0])
    >>> x = np.reshape(x, (9,))
    '[0 1 2 3 4 5 6 7 8] --> [0 3 6 1 4 7 2 5 8]'
    """
    height, width, in_channels = x.shape.as_list()[1:]
    channels_per_group = in_channels // groups

    x = K.reshape(x, [-1, height, width, groups, channels_per_group])
    x = K.permute_dimensions(x, (0, 1, 2, 4, 3))  # transpose
    x = K.reshape(x, [-1, height, width, in_channels])
    return x

keras 调用自己定义的层的方式为

x = Lambda(channel_shuffle, arguments={'groups': groups}, name='%s/channel_shuffle' % prefix)(x)

通过 Lambda，配合 arguments 传递形参！！！

参考：keras shufflenet

bryant_meng

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【ShuffleNet】《ShuffleNet：An Extremely Efficient Convolutional Neural Network for Mobile Devices》

arXiv-2017文章目录1 Background and Motivation2 Advantages3 Innovations4 Related wrok5 Method5.1 Channel Shuffle for Group Convolutions5.2 ShuffleNet Unit5.3 Network Architecture6 Experiments6.1 Ablati...
复制链接

扫一扫

专栏目录