Network In Network_跨通道参数池化层-CSDN博客

参考来源

一、主要工作:

2014年提出一种称为“NIN”的新型深度网络结构，以提高局部接受域内模型的抽象能力。

多层感知器（multilayer perceptron）是一个有效的函数逼近器，用MLP建立结构更复杂的微观神经网络来抽象感受野内的数据，通过类似于CNN的方式在输入上滑动微网络（MLP）并且馈送到下一层来获得特征图。NIN的整体结构是多个mlpconv层的堆叠。通过微型网络增强局部建模，在分类层中特征图上利用全局平均池化（global average pooling），比传统的全连接层更易于解释并且不易过拟合。

创新点：

1、提出了抽象能力更高的Mlpconv层

2、提出了Global Average Pooling（全局平均池化）层

NIN优点：

（1）更好的局部抽象

（2）去除全连接层，更少的参数

（3）更小的过拟合

1、提出Mlpconv层

传统CNN中的卷积层其实是用线性滤波器对图像进行内积运算，在每个局部输出后跟着一个非线性的激活函数，最终得到特征图(feature map)。而这种卷积滤波器是一种广义线性模型。所以用CNN进行特征提取时，其实就隐含地假设了特征是线性可分的，可实际问题往往是难以线性可分的。

GLM:(Generalized linear model) 广义线性模型

GLM的抽象能力是比较低的，自然而然地我们想到用一种抽象能力更强的模型去替换它，从而提升传统CNN的表达能力。

抽象：指得到对同一概念的不同变体保持不变的特征。

什么样的模型抽象水平更高呢？当然是比线性模型更有表达能力的非线性函数近似器(nonlinear function approximator)了（比如MLP、径向基神经）。

MLP的优点：
（１）非常有效的通用函数近似器
（２）可用BP算法训练，与卷积神经网络结构相容

（３）其本身也是一种深度模型，可以特征再利用

MLP:

The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron (MLP) consisting of multiple fully connected layers with nonlinear activation functions. The MLP is shared among all local receptive fields. It is called “Network In Network” (NIN) as we have micro networks (MLP), which are composing elements of the overall deep network, within mlpconv layers.

NIN is proposed from a more general perspective, the micro network is integrated into CNN structure in persuit of better abstractions for all levels of features.

目的：

通过用Mlpconv层来替代传统的conv层，可以学习到更加抽象的特征。传统卷积层通过将前一层进行了线性组合，然后经过非线性激活得到广义线性模型(GLM)，作者认为传统卷积层的假设是基于特征的线性可分。而Mlpconv层使用多层感知机，是一个深层的网络结构，可以近似任何非线性的函数。在网络中高层的抽象特征代表它对于相同concept的不同表现具有不变性（By abstraction we mean that the feature is invariant to the variants of the same concept）。微小的神经网络在输入的map上滑动，它的权值是共享的，而且Mlpconv层同样可以使用BP算法学习到其中的参数。

2、提出全局平均池化层

传统的卷积神经网络在网络的较低层执行卷积。为了进行分类，最后的卷积层的特征图被矢量化并且被馈送到全连接层，接着是softmax逻辑回归层[4][8][11]。它将卷积图层视为特征提取器，并将结果特征以传统方式分类。然而，全连接层往往会过拟合，从而阻碍了整个网络的泛化能力。Dropout是由Hinton等人提出的[5]。作为正规化技术，在训练过程中随机地将完全连接的层的一半激活设置为零。它提高了泛化能力，很大程度上防止了过拟合。

本文采用了NIN改进CNN后，增强了局部模型的表达能力，这样可以在分类层对特征图进行全局平均池化，这种方式更有意义和容易解释(可将最后一层输出的特征图的空间平均值解释为相应类别的置信度，因为在采用了微神经网络后，可以抽象出更好的局部特征，从而使特征图与类别别之间有一致性)，不易过拟合（因为全局平均池化本身就是一种结构性的正则化项）。

过程：we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer.

全局平均池化的优势：
（１）通过加强特征图与类别的一致性，让微网络进行局部建模更简单（因此特征图可以被容易地解释为类别置信映射）
（２）正则化结构，不需要优化参数，避免过拟合
（３）对空间信息进行了求和，因而对输入的空间变换更具有稳定性

二、NIN提出的意义：

①传统CNN为了解决广义线性模型抽象能力不足的问题，采用了过完备的滤波器集合来补偿，也就是说学习不同的滤波器用来检查同一特征的不同变体。但是过多的滤波器会对下一层施加额外的负担，因为下一层要考虑来自前一层所有的特征变体的组合。为什么采用NIN是有价值的？因为高层的特征（PS:论文中使用concept一词）来自低层特征的组合，在低层特征组合成高层特征之前，对每一局部块进行更好地抽象是有利的。（This linear convolution is sufficient for abstraction when the instances of the latent concepts are linearly separable. However, representations that achieve good abstraction are generally highly nonlinear functions of the input data. In conventional CNN, this might be compensated by utilizing an over-complete set of filters [6] to cover all variations of the latent concepts. Namely, individual linear filters can be learned to detect different variations of a same concept. However, having too many filters for a single concept imposes extra burden on the next layer, which needs to consider all combinations of variations from the previous layer [7]. As in CNN, filters from higher layers map to larger regions in the original input. It generates a higher level concept by combining the lower level concepts from the layer below. Therefore, we argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts.）

②论文对比了maxout network，这种网络通过对仿射特征图进行最大值池化降低了输出特征图的数量。（仿射特征图(affine feature maps):直接由线性卷积得到的特征图，没有通过激活函数进行非线性映射。）对线性函数进行最大化处理可得到分段线性函数近似器，可近似任意的凸函数。相比传统的CNN，maxout network更有效，因为可以区别凸集内的特征。但是maxout network是假设特征位于凸集中，这在实际中并不是总能满足的。

三、Network In Network网络结构

3.1 MLP Convolution Layers

MLP取代GLM以在输入上进行卷积。mlpconv层中每一个神经元执行的计算如下所示：

非线性激活函数采用的是整流线性单元（即ReLU:max（wx+b,0)

n是多层感知器中的层数，k代表通道下标，Xij表示以像素（i,j）为中心的输入区域。在上述b图可以看到，对于每一个神经元，生成的只有单个输出，而输入是多维（可以理解为多通道，在网络中的每一层是一个1*k的向量），可以把整个过程看作是一个1*1*k的卷积层作用在k通道上。在后续的一些论文中，常用到这样的方法来对输入进行降维（不是对图像的输入空间，而是通道降维），这样的非抽象的过程可以很好地把多维信息压缩。

Equation 2 is equivalent to cascaded cross channel parametric pooling on a normal convolution layer. Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit. The cross channel pooled feature maps are cross channel pooled again and again in the next layers. This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information.(从交叉通道（交叉特征图）池的角度来看，方程2相当于在正常卷积层上的通道交叉通道参数化池。每个汇聚层在输入特征图上执行加权线性重组，然后通过整流器线性单元。跨渠道汇集的特征地图在下一个层次中反复汇集。这种级联的跨通道参数化池结构允许交叉通道信息的复杂和可学习的交互。)

The cross channel parametric pooling layer is also equivalent to a convolution layer with 1x1 convolution kernel. This interpretation makes it straightforawrd to understand the structure of NIN.(跨通道参数化池层也相当于一个1x1卷积核的卷积层。这个解释使得理解NIN的结构变得很直观。)

Comparison to maxout layers: 特征图如下计算：

线性函数Maxout形成了一个分段线性函数，它可以建模任何凸函数。对于凸函数，函数值低于特定阈值的采样形成一个凸集。因此，通过近似局部区块的凸函数，maxout能够为样本在凸集内形成分离超平面。Mlpconv层与Maxout层的不同之处在于凸函数逼近器被一个通用函数逼近器所替代，它具有较强的潜在分布特征的建模能力。

3.2 Global Average Pooling

在本文中，提出全局平均池化的取代CNN中传统的全连接层。为最后一mlpconv层分类的每个相应类别生成一个特征图。不是在特征图的顶部添加全连接图，而是使用每个特征图的平均值，并将生成的矢量直接输入到softmax图层。

全局平均池化作为一个结构正规化，明确将特征图确定为类别的置信度（explicitly enforces feature maps to be confidence maps of concepts (categories)。这是通过mlpconv层实现的，因为它们比GLM更好地逼近置信图。

3.3 NIN总体结构

五、实验

We evaluate NIN on four benchmark datasets: CIFAR-10 [12], CIFAR-100 [12], SVHN [13] and MNIST [1]. The networks used for the datasets all consist of three stacked mlpconv layers, and the mlpconv layers in all the experiments are followed by a spatial max pooling layer which downsamples the input image by a factor of two. As a regularizer, dropout is applied on the outputs of all but the last mlpconv layers. Unless stated specifically, all the networks used in the experiment section use global average pooling instead of fully connected layers at the top of the network. Another regularizer applied is weight decay as used by Krizhevsky et al. [4]. Figure 2 illustrates the overall structure of NIN network used in this section. The detailed settings of the parameters are provided in the supplementary materials.We implement our network on the super fast cuda-convnet code developed by Alex Krizhevsky [4]. Preprocessing of the datasets, splitting of training and validation sets all follow Goodfellow et al. [8]