【ShuffleNet】《ShuffleNet:An Extremely Efficient Convolutional Neural Network for Mobile Devices》


1 Background and Motivation

  Building deeper and larger CNN is a primary trend for solving major visual recognition tasks. CNN 太大需要 computation at billions of FLOPs. 作者在有限的 computation 下追求 acc,轻量化 CNN,专注于将其运用到移动平台,如无人机、机器人和智能手机。(pursuing the best accuracy in very limited computational budgets at tens or hundreds of MFLOPs, focusing on common mobile platforms such as drones, robots, and smartphones)

  受 mobilenet(depth separable convolution) 和 resnext(group convolution) 的启发,针对 1 ∗ 1 1*1 11卷积 (point-wise convolution)计算量大,group convolution 组之间信息无交流的缺点,提出 point-wise group convolution 和 channels shuffle

  Note:depth separable convolution = depth-wise convolution + point-wise convolution

2 Advantages

  • 比 MobileNet 准,absolute 7.8% lower ImageNet top-1 error at level of 40 MFLOPs
  • 比 AlexNet 快,achieves ~13× actual speedup while maintaining comparable accuracy

3 Innovations

  • 提出 ShuffleNet 的结构
  • 提出 shuffle group 的思想

4 Related wrok

  • Efficient Model Designs

  • Group Convolution

  • Channel Shuffle Operation
    两篇论文 cuda-convnet [20](先打乱,再 group convolution)、[41],作者说 [41] did not specially investigate the effectiveness of channel shuffle itself and its usage in tiny model design(哈哈哈会玩)

5 Method

5.1 Channel Shuffle for Group Convolutions

GConv 就是 Group convolution,(a)是传统的 GConv,(b)(c)等价是在(a)的基础上,对 feature map 进行通道维度的 shuffle。why?

Group Convolution 有个 side-effect:outputs from a certain channel are only derived from a small fraction of input channels.(也即上图标注的 No cross talk)

This property blocks information flow between channel groups and weakens representation.


5.2 ShuffleNet Unit

1)bottleneck unit 改进
同 ResNet 还是 1 × 1 1×1 1×1 3 × 3 3×3 3×3 1 × 1 1×1 1×1 的组合,只是 1 × 1 1×1 1×1 换成了 GConv, 3 × 3 3×3 3×3 换成了 Depth-wise Convolution,后面的 3 × 3 3×3 3×3 1 × 1 1×1 1×1 构成 Depth separable convolution

对比下计算量,假设 input size c × h × w c×h×w c×h×w b o t t l e n e c k   c h a n n e l s = m bottleneck \ channels = m bottleneck channels=m n u m b e r   o f   g r o u p s number \ of \ groups number of groups g g g

parameterscomputational complexity
Resnet 2 c m + 9 m 2 2cm + 9m^2 2cm+9m2 h w ( 2 c m + 9 m 2 ) hw(2cm + 9m^2) hw(2cm+9m2)
ResneXt 2 c m + 9 m 2 / g 2cm + 9m^2/g 2cm+9m2/g h w ( 2 c m + 9 m 2 / g ) hw(2cm + 9m^2/g) hw(2cm+9m2/g)
Shufflenet 2 c m / g + 9 m 2cm/g + 9m 2cm/g+9m h w ( 2 c m / g + 9 m ) hw(2cm/g + 9m) hw(2cm/g+9m)

3)resolution 降低时候的结构


  • add a 3 × 3 average pooling on the shortcut path;
  • replace the element-wise addition with channel concatenation, which makes it easy to enlarge channel dimension with little extra computation cost.

第二点还是很巧妙的,把 add 替换成 concatenate 来实现 double channels,不过见过 inception family,这种形式也不会大惊小怪了!

4)关于 depth-wise separable 的使用
我们知道,depth-wise separable 用于 3×3卷积,值得注意的是:虽然 depth-wise separable 能大大降低 parameters,但是呢!
we find it difficult to efficiently implement on low power mobile devices, which may result from a worse computation/memory access ratio compared with other dense operations.(Xception 中也提到了这一点)

所以,作者仅在 bottleneck unit 中把 regular 3×3 替换成了 depth-wise separable

5.3 Network Architecture


  • the first building block in each stage is applied with stride=2(每个stage 的第一个 bottleneck unit 用 stride= 2)
  • bottleneck channels 为 output channels 的 1/4,也即( m = 1 4 c m = \frac{1}{4}c m=41c),也即 bottleneck ratio = 1:4
  • 组增多,channels 也会随之增多(因为组多参数量也降的多),保持 ~140 MFLOPs!组多,参数量少,可以 encode more information(wider feature maps),但是相应的,每组的 channels 变少了,效果可能会 degrade
  • 上表可以称为 Shufflenet 1×,学习 MobileNet,可以来 Shufflenet s×(缩减 number of filers s time),这样参数量或者 overall complexity 会降低 s 2 s^2 s2

6 Experiments

6.1 Ablation Study

要注意一点,同一 complexity 下,g 不同,也表示着,每个 bottleneck unit 中的 feature map 的厚度不同,因为 g越多降的参数量也越多!
1)Pointwise Group Convolutions
Smaller model tend to benefit more from groups(同 complexity 下,有 wider feature maps),随着组别的增加,ShuffleNet 1X 提升 1.2%,ShuffleNet 0.5X 提升 3.5%,ShuffleNet 0.25X 提升 4.4%(和精度最高的对比)

ShuffleNet 1X - ShuffleNet 0.25X 可以看出,it benefits more from enlarged feature map.

2)Channel Shuffle vs. No Shuffle
cross-group information interchange,在三种 complexity 下对比
组越多,shuffle 的效果越明显

6.2 Comparison with Other Structure Units

同 complexity 下比较 performance
同一 complexity 下,参数量更少的结构,可以有 wider feature map,所以,精度会高

6.3 Comparison with MobileNets and Other Frameworks

1)MobileNet 和 ShuffleNet solo
相仿 complexity 下比较 performance
可以看出,ShuffleNet 的参数利用效率更高,结构设计更优化

虽然,ShuffleNet network is specially designed for small models(<150 MFLOPs),但是 > 150 的时候还是可以看到 ShuffleNet 的淫威。

最后一行 shallow 表示 stage 2-4 bottleneck unit half

注意 with SE 哟,Squeeze-and-Excitation(SE),加了 SE,在手机上速度会慢 25-40%

similar accuracy 下比较 complexity

6.4 Generalization Ability

在 COCO 上跑跑
We conjecture that this significant gain is partly due to ShuffleNet’s simple design of architecture without bells and whistles.

6.5 Actual Speedup Evaluation

在移动端上试试(a mobile device with an ARM platform),理论上的加速和实际中的加速还有些出入

Empirically g = 3 usually has a proper trade-off between accuracy and actual inference time.


achieves ~ 13x actual speedup(~18x theoretical)

7 Conclusion


  • shufflenet 的两大改进之处:point-wise convolution 变成 group-wise point convolution, channel shuffle 增进 cross-group talk
  • shufflenet 中的 bottleneck unit 还是 1、3、1 的节奏。get 到了一个新概念,bottleneck ratio
  • shufflenet 中,同 complexity 下, 不同 group 意味着 bottleneck unit 的 width 也不同,因为 group convolution 会降低计算量
  • 脑洞也是大,从嫌弃 7x7 到 5x5、3x3,现在换成 1x1 后,由于 1x1 占了绝大运算,都开始嫌弃了 1x1 了
  • resolution half,channel double 这种描述我第三次见了(ResNet、ResNeXt、ShuffleNet),背后的原理就是保持 bottleneck 的 complexity
  • 核心代码如下
def channel_shuffle(x, groups):
        Input tensor of with `channels_last` data format
    groups: int
        number of groups per channel
        channel shuffled output tensor
    Example for a 1D Array with 3 groups
    >>> d = np.array([0,1,2,3,4,5,6,7,8])
    >>> x = np.reshape(d, (3,3))
    >>> x = np.transpose(x, [1,0])
    >>> x = np.reshape(x, (9,))
    '[0 1 2 3 4 5 6 7 8] --> [0 3 6 1 4 7 2 5 8]'
    height, width, in_channels = x.shape.as_list()[1:]
    channels_per_group = in_channels // groups

    x = K.reshape(x, [-1, height, width, groups, channels_per_group])
    x = K.permute_dimensions(x, (0, 1, 2, 4, 3))  # transpose
    x = K.reshape(x, [-1, height, width, in_channels])
    return x

keras 调用自己定义的层的方式为

x = Lambda(channel_shuffle, arguments={'groups': groups}, name='%s/channel_shuffle' % prefix)(x)

通过 Lambda,配合 arguments 传递形参!!!

参考:keras shufflenet

