
2.Related Work

  • Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers.

Max-pooling layers result in loss of accurate spatial information

  • For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
  • Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks.

Additional 1×1 convolutional layers followed typically by the rectified linear activation

Dimension reduction modules to remove computational bottlenecks, increasing the depth, but also the width of our networks without significant performance penalty.

  • Regions with Convolutional Neural Networks (R-CNN). R-CNN decomposes the overall detection problem into two subproblems:
  • Utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion
  • Use CNN classifiers to identify object categories at those locations.

3.Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size.

  • Bigger size typically means a larger number of parameters, makingthe enlarged network more prone to overfitting.
  • Computational resources
    Fundamental way

    Moving from fully connected to sparsely connected architectures, even inside the convolutions. how

    Problem: Computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures.

    Convolutions are implemented as collections of dense connections to the patches in the earlier layer.ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing.

    Question: Whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices.

    Clustering sparse matrices into relatively dense submatrices

4.Architectural Details

Main idea

How an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.

  • Find the optimal local construction and to repeat it spatially.
  • End up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer
  • In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1,3×3 and 5×5
    the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage.
  • Adding an alternative parallel pooling path

Problem: A modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.
S: Applying dimension reductions and projections

Base on success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch.

1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions.
Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.

Inception network

Consisting of modules of the above type

Occasional max-pooling layers with stride 2 to halve the resolution of the grid.

For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion.

Beneficial aspect:

  • Allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity.
  • It aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.


  • All the convolutions, including those inside the Inception modules, use rectified linear activation.
  • The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction.
  • All these reduction/projection layers use rectified linear activation as well.
  • The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer.
    Enables adapting and fine-tuning networks for other label sets easily
  • A move from fully connected layers to average poolingT
  • The use of dropout remained essential even after removing the fully connected layers
    Some details:
  • Average pooling layer 5 * 5/S 3
  • 1 * 1 with 128 filters
  • A Fully connected layer with 1024 units and ReLU
  • A dropout layer 70%
  • A linear year with softmax loss as classifier.
    Useless in lower layers
  • Adding auxiliary classifiers connected to these intermediate layers,encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.
    During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3).

6.Training Methodology

Parameter setting

  • Use asynchronous stochastic gradient descent
  • Image sampling


  • Approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.



  • 1 . 采用不同大小的卷积核意味着不同大小的感受野,最后拼接意味着不同尺度特征的融合;
  • 2 . 之所以卷积核大小采用1、3和5,主要是为了方便对齐。设定卷积步长stride=1之后,只要分别设定pad=0、1、2,那么卷积之后便可以得到相同维度的特征,然后这些特征就可以直接拼接在一起了;
  • 3 . 文章说很多地方都表明pooling挺有效,所以Inception里面也嵌入了。
  • 4 . 网络越到后面,特征越抽象,而且每个特征所涉及的感受野也更大了,因此随着层数的增加,3x3和5x5卷积的比例也要增加。


General Design Principles


  • 1 . 避免表达瓶颈,特别是在网络靠前的地方。 信息流前向传播过程中显然不能经过高度压缩的层,即表达瓶颈。从input到output,feature map的宽和高基本都会逐渐变小,但是不能一下子就变得很小。比如你上来就来个kernel = 7, stride = 5 ,这样显然不合适。


  • 2 . 高维特征更易处理。 高维特征更易区分,会加快训练。

    1. 可以在低维嵌入上进行空间汇聚而无需担心丢失很多信息。 比如在进行3x3卷积之前,可以对输入先进行降维而不会产生严重的后果。假设信息可以被简单压缩,那么训练就会加快。
  • 4 . 平衡网络的宽度与深度。


  • NIN结构中无论是第一个3x3卷积还是新增的1x1卷积,后面都紧跟着激活函数(比如relu)。将两个卷积串联,就能组合出更多的非线性特征。
  • 使用1x1卷积进行降维,降低了计算复杂度


解释1: 在直观感觉上在多个尺度上同时进行卷积,能提取到不同尺度的特征。特征更为丰富也意味着最后分类判断时更加准确。

解释2: 利用稀疏矩阵分解成密集矩阵计算的原理来加快收敛速度。



解释3: Hebbin赫布原理。两个神经元或者神经元系统,如果总是同时兴奋,就会形成一种‘组合’,其中一个神经元的兴奋会促进另一个的兴奋。用在inception结构中就是要把相关性强的特征汇聚到一起。这有点类似上面的解释2,把1x1,3x3,5x5的特征分开。因为训练收敛的最终目的就是要提取出独立的特征,所以预先把相关性强的特征汇聚,就能起到加速收敛的作用。

Max pooling


Global Average Pooling(GAP)层来代替全连接层



  • 1、对数据在整个feature上作正则化,防止了过拟合
  • 2、不再需要全连接层,减少了整个结构参数的数目(一般全连接层是整个结构中参数最多的层),过拟合的可能性降低;
  • 3、不用再关注输入图像的尺寸,因为不管是怎样的输入都是一样的平均方法,传统的全连接层要根据尺寸来选择参数数目,不具有通用性




