论文链接:https://arxiv.org/abs/1911.11907
开源代码:https://github.com/huawei-noah/ghostnet
摘要翻译
由于内存和计算资源的限制,在嵌入式设备上部署卷积神经网络(cnn)非常困难。特征图的冗余性是那些成功的cnn的一个重要特征,但在神经结构设计中很少被研究。本文提出了一种新颖的Ghost模块,通过低成本的操作生成更多的特征图。在一组内在特征图的基础上,通过一系列低成本的线性变换,生成了许多能够充分揭示内在特征信息的幽灵特征图。提出的Ghost模块可以作为即插即用组件来升级现有的卷积神经网络。Ghost瓶颈被设计用来堆叠Ghost模块,然后可以很容易地建立轻量级的GhostNet。在基准测试上进行的实验表明,所提出的Ghost模块是基线模型中卷积层的一个令人印象深刻的替代方案,并且我们的GhostNet可以在ImageNet ILSVRC2012分类数据集上以相似的计算成本实现比MobileNetV3更高的识别性能(例如75.7%的top-1精度)。
2.背景
使用卷积提取特征,有的特征图相似性特别高,比方说下图扳手连接的两个特征图具有相似性,这就是神经网络中存在特征冗余的情况,论文中提到不是避免冗余的特征映射,而是倾向于接受它们,但以一种低成本高效益的方式。其实就是使用低成本的计算去生成冗余(相似度高)的特征图。
论文:特征图中的冗余可能是一个成功的深度神经网络的重要特征。我们倾向于采用它们,而不是避免冗余的特征图,但是以一种成本效益高的方式。
3.评价指标(FLOPs)
论文:
传统的cnn通常需要大量的参数和浮点运算(FLOPs)来达到满意的精度,例如ResNet-50大约有25.6M的参数,需要4.1B FLOPs来处理大小为224 × 224的图像。高效的神经架构设计在建立参数和计算更少的高效深度网络方面具有非常大的潜力。
MobileNets是一系列基于深度可分离卷积的轻量级深度神经网络。MobileNetV2提出了反向残差块,MobileNetV3进一步利用了AutoML技术,以更少的FLOPs实现了更好的性能。ShuffleNet引入信道shuffle操作,提高信道组之间的信息流交换。ShuffleNetV2进一步考虑了紧凑模型设计的目标硬件上的实际速度。虽然这些模型在FLOPs很少的情况下获得了很好的性能,但特征映射之间的相关性和冗余性一直没有得到很好的利用。
计算FLOPs
在实践中,给定输入数据,其中c是输入通道的数量,h和w分别是2个输入特征图的高和宽,任意卷积层生成n个特征图的操作可以表示为Y = X * f + b
其中*是卷积运算,b是偏置项,是有n个通道的输出特征映射,是这一层的卷积滤波器(卷积核)。和分别为输出特征图的高和宽,k × k分别为卷积滤波器f的核大小。
在此卷积过程中,所需的FLOPs数可计算为(n个卷积核,每个卷积核大小为,一般k为3,表示所有卷积核的计算量,这是输出特征图的大小,由于滤波器数(卷积核数)n和通道数c通常非常大(例如256或512),因此flop数往往大到数十万。
如下图所示,图(1)是普通卷积比上Ghost模块的FLOPs计算量,图(2)是Ghost模块
图(1)
图(2)
图(2),相比普通卷积,Ghost模块采用普通卷积,先生成一些固有的特征映射,然后利用廉价的线性运算来增强特征和增加通道。
图(1)分子是Ghost模块的FLOPs,s=2,把s=2带到公式中计算可以发现分子卷积的计算量和廉价操做(DW卷积)的计算量相同,个人理解s主要是把卷积的特征图通道数压缩几倍,然后廉价操作扩充s-1倍,主要是保证最开始输入特征图和拼接后的特征图通道数相同
假设卷积输出通道数n/s,廉价操作,他俩相加即为n
4.网络结构
5.代码
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
__all__ = ['ghost_net']
def _make_divisible(v, divisor, min_value=None):
"""
This function is taken from the original tf repo.
It ensures that all layers have a channel number that is divisible by 8
It can be seen here:
https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
"""
if min_value is None:
min_value = divisor
new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_v < 0.9 * v:
new_v += divisor
return new_v
# 激活函数
def hard_sigmoid(x, inplace: bool = False):
if inplace:
return x.add_(3.).clamp_(0., 6.).div_(6.)
else:
return F.relu6(x + 3.) / 6.
# SE注意力机制
class SqueezeExcite(nn.Module):
def __init__(self, in_chs, se_ratio=0.25, reduced_base_chs=None,
act_layer=nn.ReLU, gate_fn=hard_sigmoid, divisor=4, **_):
super(SqueezeExcite, self).__init__()
self.gate_fn = gate_fn # 激活函数
reduced_chs = _make_divisible((reduced_base_chs or in_chs) * se_ratio, divisor)
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.conv_reduce = nn.Conv2d(in_chs, reduced_chs, 1, bias=True)
self.act1 = act_layer(inplace=True)
self.conv_expand = nn.Conv2d(reduced_chs, in_chs, 1, bias=True)
def forward(self, x):
x_se = self.avg_pool(x)
x_se = self.conv_reduce(x_se)
x_se = self.act1(x_se)
x_se = self.conv_expand(x_se)
x = x * self.gate_fn(x_se)
return x
class ConvBnAct(nn.Module):
def __init__(self, in_chs, out_chs, kernel_size,
stride=1, act_layer=nn.ReLU):
super(ConvBnAct, self).__init__()
self.conv = nn.Conv2d(in_chs, out_chs, kernel_size, stride, kernel_size//2, bias=False)
self.bn1 = nn.BatchNorm2d(out_chs)
self.act1 = act_layer(inplace=True)
def forward(self, x):
x = self.conv(x)
x = self.bn1(x)
x = self.act1(x)
return x
class GhostModule(nn.Module):
def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True):
super(GhostModule, self).__init__()
self.oup = oup
init_channels = math.ceil(oup / ratio)
new_channels = init_channels*(ratio-1)
# 1*1卷积特征压缩 跨通道的特征提取
self.primary_conv = nn.Sequential(
nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),
nn.BatchNorm2d(init_channels),
nn.ReLU(inplace=True) if relu else nn.Sequential(),
)
# 逐层卷积,跨特征点的特征提取
self.cheap_operation = nn.Sequential(
nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),
nn.BatchNorm2d(new_channels),
nn.ReLU(inplace=True) if relu else nn.Sequential(),
)
def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
out = torch.cat([x1,x2], dim=1) # 1是在通道上拼接
return out[:,:self.oup,:,:]
class GhostBottleneck(nn.Module):
""" Ghost bottleneck w/ optional SE"""
def __init__(self, in_chs, mid_chs, out_chs, dw_kernel_size=3,
stride=1, act_layer=nn.ReLU, se_ratio=0.):
super(GhostBottleneck, self).__init__()
has_se = se_ratio is not None and se_ratio > 0.
self.stride = stride
# Point-wise expansion
self.ghost1 = GhostModule(in_chs, mid_chs, relu=True)
# Depth-wise convolution
if self.stride > 1:
self.conv_dw = nn.Conv2d(mid_chs, mid_chs, dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2,
groups=mid_chs, bias=False)
self.bn_dw = nn.BatchNorm2d(mid_chs)
# Squeeze-and-excitation
if has_se:
self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio)
else:
self.se = None
# Point-wise linear projection
self.ghost2 = GhostModule(mid_chs, out_chs, relu=False)
# shortcut
if (in_chs == out_chs and self.stride == 1):
self.shortcut = nn.Sequential()
else:
self.shortcut = nn.Sequential(
nn.Conv2d(in_chs, in_chs, dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2, groups=in_chs, bias=False),
nn.BatchNorm2d(in_chs),
nn.Conv2d(in_chs, out_chs, 1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(out_chs),
)
def forward(self, x):
residual = x
# 1st ghost bottleneck
x = self.ghost1(x)
# Depth-wise convolution
if self.stride > 1:
x = self.conv_dw(x)
x = self.bn_dw(x)
# Squeeze-and-excitation
if self.se is not None:
x = self.se(x)
# 2nd ghost bottleneck
x = self.ghost2(x)
x += self.shortcut(residual)
return x
class GhostNet(nn.Module):
def __init__(self, cfgs, num_classes=1000, width=1.0, dropout=0.2):
super(GhostNet, self).__init__()
# setting of inverted residual blocks
self.cfgs = cfgs
self.dropout = dropout
# building first layer
output_channel = _make_divisible(16 * width, 4)
self.conv_stem = nn.Conv2d(3, output_channel, 3, 2, 1, bias=False)
self.bn1 = nn.BatchNorm2d(output_channel)
self.act1 = nn.ReLU(inplace=True)
input_channel = output_channel
# building inverted residual blocks
stages = []
block = GhostBottleneck
for cfg in self.cfgs:
layers = []
for k, exp_size, c, se_ratio, s in cfg:
output_channel = _make_divisible(c * width, 4)
hidden_channel = _make_divisible(exp_size * width, 4)
layers.append(block(input_channel, hidden_channel, output_channel, k, s,
se_ratio=se_ratio))
input_channel = output_channel
stages.append(nn.Sequential(*layers))
output_channel = _make_divisible(exp_size * width, 4)
stages.append(nn.Sequential(ConvBnAct(input_channel, output_channel, 1)))
input_channel = output_channel
self.blocks = nn.Sequential(*stages)
# building last several layers
output_channel = 1280
self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
self.conv_head = nn.Conv2d(input_channel, output_channel, 1, 1, 0, bias=True)
self.act2 = nn.ReLU(inplace=True)
self.classifier = nn.Linear(output_channel, num_classes)
def forward(self, x):
x = self.conv_stem(x)
x = self.bn1(x)
x = self.act1(x)
x = self.blocks(x)
x = self.global_pool(x)
x = self.conv_head(x)
x = self.act2(x)
x = x.view(x.size(0), -1)
if self.dropout > 0.:
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.classifier(x)
return x
def ghostnet(**kwargs):
"""
Constructs a GhostNet model
"""
cfgs = [
# k, t, c, SE, s
# stage1
[[3, 16, 16, 0, 1]],
# stage2
[[3, 48, 24, 0, 2]],
[[3, 72, 24, 0, 1]],
# stage3
[[5, 72, 40, 0.25, 2]],
[[5, 120, 40, 0.25, 1]],
# stage4
[[3, 240, 80, 0, 2]],
[[3, 200, 80, 0, 1],
[3, 184, 80, 0, 1],
[3, 184, 80, 0, 1],
[3, 480, 112, 0.25, 1],
[3, 672, 112, 0.25, 1]
],
# stage5
[[5, 672, 160, 0.25, 2]],
[[5, 960, 160, 0, 1],
[5, 960, 160, 0.25, 1],
[5, 960, 160, 0, 1],
[5, 960, 160, 0.25, 1]
]
]
return GhostNet(cfgs, **kwargs)
if __name__=='__main__':
model = ghostnet()
model.eval()
print(model)
input = torch.randn(32,3,320,256)
y = model(input)
print(y.size())
模型压缩(性能通常取决于给定的预训练模型)
pruning connection: 连接剪枝,剪掉一些不重要的神经元连接
channel pruning: 通道剪枝,剪掉一些无用的通道
model quantization: 模型量化,在具有离散值的神经网络中对权重或激活函数进行压缩和计算加速
tensor decomposition: 张量分解,通过利用权重的冗余性和低秩性来减少参数或计算
knowledge distillation: 知识蒸馏, 利用大模型教小模型,提高小模型的性能
参考链接:
https://blog.csdn.net/hhhhhhhhhhwwwwwwwwww/article/details/127994497
https://blog.csdn.net/qq_50489856/article/details/123845657
https://blog.csdn.net/weixin_44791964/article/details/120884617