【GhostNet】《GhostNet：More Features from Cheap Operations》

bryant_meng

已于 2022-09-09 11:32:09 修改

阅读量1.1k

点赞数 2

分类专栏： CNN / Transformer 文章标签：深度学习人工智能计算机视觉

于 2022-08-31 20:41:47 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/126578805

版权

CNN / Transformer 专栏收录该内容

210 篇文章 7 订阅

订阅专栏

在这里插入图片描述
CVPR-2020

文章目录

1 Background and Motivation
2 Related Work
3 Advantages / Contributions
4 Method
- 4.1 Ghost Module for More Features
- 4.2 Building Efficient CNNs
5 Experiments
6 Conclusion（own） / Future work

1 Background and Motivation

由于内存和计算资源的限制，在嵌入式设备上部署卷积神经网络非常困难。

The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design.

在这里插入图片描述

本文作者提出 GhostNet，保留部分固有特征（intrinsic features），通过固有特征的线性变换（cheap operation）模拟生成相对冗余的特征（ghost features），降低计算量的同时，保持了特征的多样性——图1 中同颜色的框，可以看成一个是 intrinsic feature，一个是 intrinsic feature 通过线性变换得到的 ghost feature

为啥要冗余特征？

Abundant and even redundant information in the feature maps of well-trained deep neural networks often guarantees a comprehensive understanding of the input data.
Redundancy in feature maps could be an important characteristic for a successful deep neural network.

作者 embrace 冗余特征, but in a cost-efficient way

2 Related Work

Model Compression
- Pruning connections
- Channel pruning
- Model quantization
- binarization methods
- Tensor decomposition
- Knowledge distillation
Compact Model Design
- MobileNets
- MobileNets V2
- MobileNets V3
- ShuffleNet
- ShuffleNet V2

3 Advantages / Contributions

提出 GhostNet，分类任务中，速度精度权衡的比 mobilenetv3 好

4 Method

4.1 Ghost Module for More Features

输入 $\in \mathbb{R}^{c \times h \times w}$

卷积产生的特征图 $\in \mathbb{R}^{h' \times w' \times n}$ ， $Y = X * f + b$

其中 $*$ 是 conv 操作， $b$ 是 bias，convolution filters $\in \mathbb{R}^{c \times k \times k \times n}$

在这里插入图片描述
We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters.

Suppose that the output feature maps are “ghosts” of a handful of intrinsic feature maps with some cheap transformations

These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters.

通道数为 $m$ 的 intrinsic feature $Y^{'}$ 产生的方式为

$Y^{'} = X * f^{'}$ ， $\in \mathbb{R}^{h' \times w' \times m}$

其中 $\in \mathbb{R}^{c \times k \times k \times m}$ ， $\leq n$

通过 intrinsic feature $Y^{'}$ 产生 ghost feature 的形式如下
在这里插入图片描述
其中

$y'_i$ 是 $i$ -th intrinsic feature map

$\Phi_{ij}$ 是 $j$ -th linear operation for generating the $j$ -th ghost feature map $y_{ij}$ ，最后一个操作固定为 identity mapping

一个 intrinsic feature 可以有多个 ghost map， ${y_{ij}\}_{j=1}^s$

引入 ghost 机制后，最终的输出特征图为

$Y = [y_{11}, y_{12}, ..., y_{1m}, y_{21}, ..., y_{ms}]$ ，

$m$ 是 intrinsic features 的数量
在这里插入图片描述

看代码后，结构是这样的

在这里插入图片描述
图片来自于 https://zhuanlan.zhihu.com/p/115844245

cheap ops 也即 depth-wise conv

1）提出的 Ghost module 和普通卷积之间有什么不同呢？

Ghost module 可以 have customized kernel size（intrinsic->ghost 这个过程，x->intrinsic 的时候作者为了高效还是采用的 1x1 conv），不像一些轻量级的 module，为减少计算量采用了大量的 1x1 conv
轻量级 module 用 point-wise 来处理 feature across channel，depth-wise conv 处理 spatial information，Ghost module 用正常卷积产生 intrinsic feature，然后 utilizes cheap linear operations to augment the features and increase the channels
其他 module 的 operation 仅局限于 depthwise 或者 shift ，Ghost module 是 linear operation（比如 affine transformation, wavelet transformation, and conv——包含smoothing, blurring, motion, etc.），特征可以更多样
the identity mapping is paralleled with linear transformations（module 级别，而不是 bottleneck 级别的）

2）Ghost module Complexities 如何？

Ghost module 有 1 个 identity mapping， $\cdot (s-1) = \frac{n}{s} \cdot (s-1)$ 个 linear operation

和正常 conv 对比，计算量比值如下
在这里插入图片描述
分母两项，前面一项是正常卷积，输入通道 c，输出通道 $\frac{n}{s}$ ，后面一项对通道为 $m$ 的 intrinsic feature maps 每个通道做了 $s - 1$ 种 linear operation（比如 depth-wise conv），the averaged kernel size of each linear operation is equal to $\times d$

$\approx d$ ， $\ll c$

we suggest to take linear operations of the same size (e.g. 3x3 or 5x5) in one Ghost module for efficient implementation.

参数量比值如下
在这里插入图片描述
可以看到，计算量和参数量都约等于减少了 $s$ 倍数（linear operation 的个数）

larger $s$ leads to larger compression and speed-up ratio

4.2 Building Efficient CNNs

1）Ghost Bottlenecks

在这里插入图片描述
two stacked Ghost modules

一个 expansion layer increasing the number of channels

一个 reduces the number of channels to match the shortcut path

第二个 Ghost modules 没用 relu 是借鉴的 MobileNetV2 思想（通道数较少的时候不用 relu）

2）GhostNet

在这里插入图片描述
GhostNet- $\alpha$ ，multiply a factor $\alpha$ on the number of channels

5 Experiments

5.1 Datasets

CIFAR-10
ImageNet ILSVRC 2012
MS COCO object detection

5.2 Efficiency of Ghost Module

1）Toy Experiments
在这里插入图片描述
用的是 depth-wise conv

there are strong correlations between feature maps in deep neural networks and these redundant feature maps could be generated from several intrinsic feature maps.

the irregular module（各种 linear operation） will reduce the efficiency of computing units，作者推荐是 $d$ 固定，用 depth-wise conv

2）CIFAR-10

a）Analysis on Hyper-parameters

固定 $s = 2$ （两分支），消融 $d$ （非 identity mapping 分支中的 depth-wise conv 的 kernel size）
在这里插入图片描述
此时 $d = 3$ 效果最好，1x1 cannot introduce spatial information， $d = 5$ or $d = 7$ lead to overfitting and more computations

固定 $d = 3$ ，消融 $s$
在这里插入图片描述
计算量和参数量随着 $s$ 的增加明显降低，精度损失的比较缓慢

larger $s$ leads to larger compression and speed-up ratio

b）Comparison with State-of-the-arts
在这里插入图片描述
c）Visualization of Feature Maps

Although the generated feature maps are from the primary feature maps, they exactly have significant difference which means the generated features are flexible enough to satisfy the need for the specific task.

在这里插入图片描述

3）Large Models on ImageNet

对比不同的压缩形式
在这里插入图片描述

5.3 GhostNet on Visual Benchmarks

1）ImageNet Classification

在这里插入图片描述
人狠话不多，SOTA

Actual Inference Speed

在这里插入图片描述

2）Object Detection
在这里插入图片描述
和 V3 差不多

6 Conclusion（own） / Future work

code：https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnet_pytorch#g-ghostnet

class GhostModule(nn.Module):
    def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True):
        super(GhostModule, self).__init__()
        self.oup = oup
        init_channels = math.ceil(oup / ratio)
        new_channels = init_channels*(ratio-1)

        self.primary_conv = nn.Sequential(
            nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),
            nn.BatchNorm2d(init_channels),
            nn.ReLU(inplace=True) if relu else nn.Sequential(),
        )

        self.cheap_operation = nn.Sequential(
            nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),
            nn.BatchNorm2d(new_channels),
            nn.ReLU(inplace=True) if relu else nn.Sequential(),
        )

    def forward(self, x):
        x1 = self.primary_conv(x)
        x2 = self.cheap_operation(x1)
        out = torch.cat([x1,x2], dim=1)
        return out[:,:self.oup,:,:]


class GhostBottleneck(nn.Module):
    """ Ghost bottleneck w/ optional SE"""

    def __init__(self, in_chs, mid_chs, out_chs, dw_kernel_size=3,
                 stride=1, act_layer=nn.ReLU, se_ratio=0.):
        super(GhostBottleneck, self).__init__()
        has_se = se_ratio is not None and se_ratio > 0.
        self.stride = stride

        # Point-wise expansion
        self.ghost1 = GhostModule(in_chs, mid_chs, relu=True)

        # Depth-wise convolution
        if self.stride > 1:
            self.conv_dw = nn.Conv2d(mid_chs, mid_chs, dw_kernel_size, stride=stride,
                             padding=(dw_kernel_size-1)//2,
                             groups=mid_chs, bias=False)
            self.bn_dw = nn.BatchNorm2d(mid_chs)

        # Squeeze-and-excitation
        if has_se:
            self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio)
        else:
            self.se = None

        # Point-wise linear projection
        self.ghost2 = GhostModule(mid_chs, out_chs, relu=False)
        
        # shortcut
        if (in_chs == out_chs and self.stride == 1):
            self.shortcut = nn.Sequential()
        else:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_chs, in_chs, dw_kernel_size, stride=stride,
                       padding=(dw_kernel_size-1)//2, groups=in_chs, bias=False),
                nn.BatchNorm2d(in_chs),
                nn.Conv2d(in_chs, out_chs, 1, stride=1, padding=0, bias=False),
                nn.BatchNorm2d(out_chs),
            )


    def forward(self, x):
        residual = x

        # 1st ghost bottleneck
        x = self.ghost1(x)

        # Depth-wise convolution
        if self.stride > 1:
            x = self.conv_dw(x)
            x = self.bn_dw(x)

        # Squeeze-and-excitation
        if self.se is not None:
            x = self.se(x)

        # 2nd ghost bottleneck
        x = self.ghost2(x)
        
        x += self.shortcut(residual)
        return x