论文笔记：GoogleNet网络结构设计理解

最新推荐文章于 2022-04-15 08:58:10 发布

eight_Jessen

最新推荐文章于 2022-04-15 08:58:10 发布

阅读量405

点赞数

分类专栏：论文笔记 network 文章标签：深度学习人工智能卷积网络机器学习

本文链接：https://blog.csdn.net/eight_Jessen/article/details/107943295

版权

论文笔记同时被 2 个专栏收录

49 篇文章 7 订阅

订阅专栏

network

8 篇文章 0 订阅

订阅专栏

2.Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers.

Max-pooling layers result in loss of accurate spatial information

For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks.

Additional 1×1 convolutional layers followed typically by the rectified linear activation

Dimension reduction modules to remove computational bottlenecks, increasing the depth, but also the width of our networks without significant performance penalty.

Regions with Convolutional Neural Networks (R-CNN). R-CNN decomposes the overall detection problem into two subproblems:

Utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion
Use CNN classifiers to identify object categories at those locations.

3.Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size.
Drawback

Bigger size typically means a larger number of parameters, makingthe enlarged network more prone to overfitting.

Computational resources
Fundamental way

Moving from fully connected to sparsely connected architectures, even inside the convolutions. how

Problem: Computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures.

Convolutions are implemented as collections of dense connections to the patches in the earlier layer.ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing.

Question: Whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices.

Clustering sparse matrices into relatively dense submatrices

4.Architectural Details

Main idea

How an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.

Find the optimal local construction and to repeat it spatially.
End up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer
In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1,3×3 and 5×5
the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage.
Adding an alternative parallel pooling path

Problem: A modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.
S: Applying dimension reductions and projections

Base on success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch.

1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions.
Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.

Inception network

Consisting of modules of the above type

Occasional max-pooling layers with stride 2 to halve the resolution of the grid.

For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion.

Beneficial aspect:

Allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity.
It aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

5.GoogLeNet

All the convolutions, including those inside the Inception modules, use rectified linear activation.
The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction.
All these reduction/projection layers use rectified linear activation as well.
The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer.
Enables adapting and fine-tuning networks for other label sets easily
A move from fully connected layers to average poolingT
The use of dropout remained essential even after removing the fully connected layers
Some details:
Average pooling layer 5 * 5/S 3
1 * 1 with 128 filters
A Fully connected layer with 1024 units and ReLU
A dropout layer 70%
A linear year with softmax loss as classifier.
Useless in lower layers
Adding auxiliary classifiers connected to these intermediate layers,encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.
During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3).

6.Training Methodology

Parameter setting

Use asynchronous stochastic gradient descent
Image sampling

7.Conclusion

Approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.

在这里插入图片描述
说明：

1 . 采用不同大小的卷积核意味着不同大小的感受野，最后拼接意味着不同尺度特征的融合；
2 . 之所以卷积核大小采用1、3和5，主要是为了方便对齐。设定卷积步长stride=1之后，只要分别设定pad=0、1、2，那么卷积之后便可以得到相同维度的特征，然后这些特征就可以直接拼接在一起了；
3 . 文章说很多地方都表明pooling挺有效，所以Inception里面也嵌入了。
4 . 网络越到后面，特征越抽象，而且每个特征所涉及的感受野也更大了，因此随着层数的增加，3x3和5x5卷积的比例也要增加。
使用5x5的卷积核仍然会带来巨大的计算量。借鉴NIN2，采用1x1卷积核来进行降维。

在这里插入图片描述

General Design Principles

下面的准则来源于大量的实验，因此包含一定的推测，但实际证明基本都是有效的。

1 . 避免表达瓶颈，特别是在网络靠前的地方。信息流前向传播过程中显然不能经过高度压缩的层，即表达瓶颈。从input到output，feature map的宽和高基本都会逐渐变小，但是不能一下子就变得很小。比如你上来就来个kernel = 7, stride = 5 ,这样显然不合适。

另外输出的维度channel，一般来说会逐渐增多(每层的num_output)，否则网络会很难训练。（特征维度并不代表信息的多少，只是作为一种估计的手段）
2 . 高维特征更易处理。高维特征更易区分，会加快训练。
1. 可以在低维嵌入上进行空间汇聚而无需担心丢失很多信息。比如在进行3x3卷积之前，可以对输入先进行降维而不会产生严重的后果。假设信息可以被简单压缩，那么训练就会加快。
4 . 平衡网络的宽度与深度。
上述的这些并不能直接用来提高网络质量，而仅用来在大环境下作指导。

黄色的1x1卷积模块什么用处呢

NIN结构中无论是第一个3x3卷积还是新增的1x1卷积，后面都紧跟着激活函数（比如relu）。将两个卷积串联，就能组合出更多的非线性特征。
使用1x1卷积进行降维，降低了计算复杂度

多个尺寸上进行卷积再聚合

解释1: 在直观感觉上在多个尺度上同时进行卷积，能提取到不同尺度的特征。特征更为丰富也意味着最后分类判断时更加准确。

解释2: 利用稀疏矩阵分解成密集矩阵计算的原理来加快收敛速度。

举个例子下图左侧是个稀疏矩阵（很多元素都为0，不均匀分布在矩阵中），和一个2x2的矩阵进行卷积，需要对稀疏矩阵中的每一个元素进行计算；如果像右图那样把稀疏矩阵分解成2个子密集矩阵，再和2x2矩阵进行卷积，稀疏矩阵中0较多的区域就可以不用计算，计算量就大大降低。

在这里插入图片描述
这个原理应用到inception上就是要在特征维度上进行分解！传统的卷积层的输入数据只和一种尺度（比如3x3）的卷积核进行卷积，输出固定维度（比如256个特征）的数据，所有256个输出特征基本上是均匀分布在3x3尺度范围上，这可以理解成输出了一个稀疏分布的特征集；而inception模块在多个尺度上提取特征（比如1x1，3x3，5x5），输出的256个特征就不再是均匀分布，而是相关性强的特征聚集在一起（比如1x1的的96个特征聚集在一起，3x3的96个特征聚集在一起，5x5的64个特征聚集在一起），这可以理解成多个密集分布的子特征集。这样的特征集中因为相关性较强的特征聚集在了一起，不相关的非关键特征就被弱化，同样是输出256个特征，inception方法输出的特征“冗余”的信息较少。用这样的“纯”的特征集层层传递最后作为反向计算的输入，自然收敛的速度更快。

解释3： Hebbin赫布原理。两个神经元或者神经元系统，如果总是同时兴奋，就会形成一种‘组合’，其中一个神经元的兴奋会促进另一个的兴奋。用在inception结构中就是要把相关性强的特征汇聚到一起。这有点类似上面的解释2，把1x1，3x3，5x5的特征分开。因为训练收敛的最终目的就是要提取出独立的特征，所以预先把相关性强的特征汇聚，就能起到加速收敛的作用。