论文精要解读：Going Deeper with Convolutions

最新推荐文章于 2025-04-18 20:10:15 发布

Monalena

最新推荐文章于 2025-04-18 20:10:15 发布

阅读量1.6k

点赞数 3

文章标签：深度学习神经网络

本文链接：https://blog.csdn.net/SweetWind1996/article/details/88929624

版权

本文探讨了Inception网络结构的设计理念，旨在通过优化网络的深度和宽度，在保持计算预算不变的情况下提高性能。文中详细介绍了如何利用1×1卷积实现降维，并通过组合不同大小的卷积核来捕获多尺度特征，从而有效地减少参数量和计算复杂度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Going Deeper with Convolutions

Abstract：

Increase the depth and width of the network while keeping the computational budget constant.

Introduction:

One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.
目前深度学习的主要改进点：idea，算法和网络结构。

For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
很多时候，我们的实验学术证明不严谨，但不影响实际应用（有实际应用效果）。

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme.
灵感来自Network In Network.

Related Work

convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max pooling) are followed by one or more fully-connected layers.
传统卷积神经网络结构：栈式堆叠-----卷积层+卷积层+…+全连接+全连接+…（中间可能包含最大池化层）

For larger datasets such as Imagenet, the recent trend has been to increase the number of layers an layer size ,while using dropout to address the problem of overfitting.
现在对付大数据集的手段：增加网络深度和宽度，利用神经元失活（Dropout）来防止过拟合。

Network-in-Network：When applied to convolutional layers, the method could be viewed as additional 1×1 convolutional layers followed typically by the rectified linear activation
当NIN应用于卷积层时，该方法可被视为额外的1×1卷积层，其后通常是经过整流的线性激活。

1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty.
降维，消除计算瓶颈。在增加深度和宽度的同时，提升表现能力。

Network In Network

We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator.
实例化一个微型神经网络（多层感知机），作为函数逼近器。

多层感知机构成的微型网络
With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers.
在分类层中使用全局平均池化（不容易过拟合）。
在这里插入图片描述

Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size.
Two major drawbacks：
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting.Another drawback of uniformly increased network size is the dramatically increased use of computational resources.
直接扩大网络尺寸带来的不利：过拟合+计算量大

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. 解决办法：稀疏连接

On the downside, todays computing infrastructures are very inefficient when it comes to numerical
calculation on non-uniform sparse data structures.The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.
不利：非匀称的稀疏数据计算效率低。另一方面：匀称的密集的计算是可以进行的。

clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication.解决方案：将稀疏矩阵聚类成相对密集的子矩阵。

Architectural Details

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.
用密集的结构来近似或者代替局部最优稀疏结构。

This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer.
同一个图像块的聚类（一个图像区域的不同通道的特征）可以在下一层中被一个1*1的卷积层覆盖。（NIN）

However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions.
空间上分散的聚类块可以使用更大的卷积核来代替。

在更高的层（卷积核数量很大）使用5*5的卷积核，会产生很多的输出。其后在紧跟池化层（拥有同样数量的卷积核），两者连接会产生巨大的参数量和计算量。输出的结果在下一层也可能会导致计算量爆炸。

解决方案：在卷积之前使用1*1卷积进行降维，减少计算量。

参数变化
输入的 feature map 是 28×28×192，1×1 卷积通道为 64，3×3 卷积通道为128, 5×5 卷积通道为 32 ，如果是左图结构，那么卷积核参数为1×1×192×64+3×3×192×128+5×5×192×32，而右图对 3×3 和 5×5 卷积层前分别加入了通道数为96 和 16 的 1×1 卷积层，这样卷积核参数就变成了1×1×192×64+（1×1×192×96+3×3×96×128）+（1×1×192×16+5×5×16×32），参数大约减少到原来的三分之一。

GoogLeNet

在这里插入图片描述

全局平均池化(GAP)—NIN

传统的CNN网络中，在使用卷积层提取特征之后会将提取出来的Feature map输入一个全连接神经网络，再接一个softmax逻辑回归层完成分类任务，由于全连接层容易导致模型过拟合，后来出现的dropout很好的解决了这个问题。在NIN这篇论文中，作者提出了用全局平均池化来代替全连接层的新策略。
GAP的思路是在所有mlpconv层之后，将最后一层mlpconv层输出的每一张feature map进行相加求平均，也就是说输出的每张feature map都会计算得到一个平均值，最后将这些feature map对应的平均值作为某一类的置信度输入到softmax中进行分类，那么这里存在的一个问题就是每张feature map得到的全局平均值与最终的类别是一一对应的，也就是说最后一层mlpconv输出多少个feature map就会有多少种分类，因此这里需要控制最后一层mlpconv输出feature map数与总的类别数相同。下图展示了全局平均池化层：
在这里插入图片描述
全局池化是有很多优点的。一是很大程度减少了参数的数量，因为我们知道传统的全连接层参数数量非常之多，直接用GAP替换掉全连接层，而GAP是没有需要学习的参数的！而且与此同时GAP还没有全连接层容易过拟合的缺点；其次，作者认为GAP更符合CNN的特点，这种特征图和类别一一对应的模式，加强了特征图与概念（类别）的可信度的联系，使得分类任务有了很高的可理解性。再者，对每个特征图进行全局取平均操作综合了空间信息，使得模型的鲁棒性更强了！

示例

在这里插入图片描述