【翻译】【ZFNet】Visualizing and Understanding Convolutional Networks

Visualizing and Understanding Convolutional Networks

论文:https://link.springer.com/content/pdf/10.1007/978-3-319-10590-1_53.pdf

Abstract(摘要)

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-ofthe-art results on Caltech-101 and Caltech-256 datasets.
  大型卷积网络模型最近在ImageNet基准测试中表现出令人印象深刻的分类性能,如Krizhevsky等人(AlexNet)[18]。然而,对于它们为何表现如此出色,或如何改进它们,还没有明确的认识。在本文中,我们探讨了这两个问题。我们引入了一种新的可视化技术,使人们能够深入了解中间特征层的功能和分类器的运作。在诊断的作用下,这些可视化技术使我们能够找到在ImageNet分类基准上表现优于Krizhevsky等人的模型架构。我们还进行了一项消融研究,以发现不同模型层的性能贡献。我们表明我们的ImageNet模型在其他数据集上有很好的通用性:当softmax分类器被重新训练时,它在Caltech-101和Caltech-256数据集上令人信服地击败了当前的技术水平。

1 Introduction(介绍)

Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last 18 months, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. Ciresan et al. [4] demonstrate state-ofthe-art performance on NORB and CIFAR-10 datasets. Most notably, Krizhevsky et al. [18] show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Following on from this work, Girshick et al. [10] have shown leading detection performance on the PASCAL VOC dataset. Several factors are responsible for this dramatic improvement in performance: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout [14].
  自从LeCun等人[20]在20世纪90年代初提出以来,卷积网络(convnets)在诸如手写数字分类和人脸检测等任务中表现出了卓越的性能。在过去的18个月里,有几篇论文表明它们在更具挑战性的视觉分类任务上也能提供出色的表现。Ciresan等人[4]在NORB和CIFAR-10数据集上展示了最先进的性能。最值得注意的是,Krizhevsky等人(AlexNet)[18]在ImageNet 2012分类基准上展示了创纪录的性能,他们的convnet模型取得了16.4%的错误率,而第二名的结果是26.1%。继这项工作之后,Girshick等人[10]在PASCAL VOC数据集上显示了领先的检测性能。有几个因素导致了性能的大幅提高:(i)有了更大的训练集,有了数以百万计的标记实例;(ii)强大的GPU实现,使训练非常大的模型成为现实;(iii)更好的模型正则化策略,如Dropout[14]。
  Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by Zeiler et al. [29], to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.
  尽管取得了这一令人鼓舞的进展,但人们对这些复杂模型的内部运作和行为仍然没有什么深入了解,也不知道它们是如何取得如此好的性能的。从科学的角度来看,这是很不令人满意的。如果不清楚它们是如何工作和为什么工作的,开发更好的模型就会沦为试错。在本文中,我们介绍了一种可视化技术,揭示了在模型的任何一层激发单个特征图的输入刺激。它还允许我们在训练过程中观察特征的演变,并诊断出模型的潜在问题。我们提出的可视化技术使用Zeiler等人[29]提出的多层反卷积网络(deconvnet),将特征激活投射回输入像素空间。我们还通过遮挡输入图像的部分内容对分类器的输出进行了敏感性分析,揭示了场景的哪些部分对分类是重要的。
  Using these tools, we start with the architecture of Krizhevsky et al. [18] and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by Hinton et al. [13] and others [1,26].
  利用这些工具,我们从Krizhevsky等人[18]的架构开始,探索不同的架构,发现在ImageNet上的结果优于他们。然后我们探索该模型对其他数据集的泛化能力,只是在上面重新训练softmax分类器。因此,这是一种有监督的预训练,与Hinton等人[13]和其他人[1,26]流行的无监督的预训练方法形成鲜明对比。

1.1 Related Work(相关工作)

Visualization: Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers alternate methods must be used. [8] find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, [19] (extending an idea by [2]) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. Our approach is similar to contemporary work by Simonyan et al. [23] who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network, instead of the convolutional features that we use. Girshick et al. [10] show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.
  可视化:可视化特征以获得对网络的直觉是常见的做法,但大多限于第一层,在那里可以投射到像素空间。在较高的层中,必须使用其他方法。[8]通过在图像空间中进行梯度下降,找到每个单元的最佳刺激,使单元的激活最大化。这需要仔细的初始化,并且不提供任何关于单元不变性的信息。受后者缺点的启发,[19](扩展了[2]的想法)展示了如何围绕最优响应数值计算给定单元的Hessian,给出了一些关于不变性的洞察力。问题是,对于较高的层,不变性是非常复杂的,所以很难被简单的二次近似所捕获。相比之下,我们的方法提供了一个非参数化的不变性观点,显示了训练集的哪些模式激活了特征图。我们的方法类似于Simonyan等人[23]的当代工作,他们展示了如何通过从网络的全连接层向后投射,而不是我们使用的卷积特征,从convnet中获得显著性特征图。Girshick等人(R-CNN)[10]展示了识别数据集中的patches的可视化,这些patches是模型中较高层的强激活的原因。我们的可视化不同的是,它们不仅仅是输入图像的剪切,而是自上而下的投影,揭示了每个patches内刺激特定特征图的结构。
  Feature Generalization: Our demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al. [7] and Girshick et al. [10]. They use the convnet features to obtain state-of-the-art performance on Caltech-101 and the Sun scenes dataset in the former case, and for object detection on the PASCAL VOC dataset, in the latter.
  特征泛化。Donahue等人[7]和Girshick等人(R-CNN)[10]在同时进行的工作中也探讨了我们对convnet特征的泛化能力的证明。他们使用convnet特征在Caltech-101和太阳场景数据集上获得了最先进的性能,而在后者,则是在PASCAL VOC数据集上进行物体检测。

2 Approach(方法)

We use standard fully supervised convnet models throughout the paper, as defined by LeCun et al. [20] and Krizhevsky et al. [18]. These models map a color 2D input image x i x_i xi, via a series of layers, to a probability vector y i ^ \hat{y_i} yi^ over the C C C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function ( r e l u ( x ) = m a x ( x , 0 ) relu(x)=max(x, 0) relu(x)=max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see [18] and [16]. The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.
  我们在本文中使用标准的完全监督的convnet模型,如LeCun等人[20]和Krizhevsky等人[18]所定义。这些模型通过一系列的层将彩色二维输入图像 x i x_i xi映射到 C C C不同类别的概率向量 y i ^ \hat{y_i} yi^。每一层包括:(i)将前一层的输出(或在第一层的情况下,输入图像)与一组学习过的滤波器进行卷积;(ii)将响应通过一个整流的线性函数( r e l u ( x ) = m a x ( x , 0 ) relu(x)=max(x, 0) relu(x)=max(x,0));(iii)[可选择]对局部邻域进行最大集合;(iv)[可选择]进行局部对比操作,使各特征图的响应正常化。关于这些操作的更多细节,见[18]和[16]。网络的前几层是传统的全连接网络,最后一层是一个softmax分类器。图3显示了我们许多实验中使用的模型。
在这里插入图片描述
  We train these models using a large set of N labeled images { x , y } \{x, y\} {x,y}, where label y i y_i yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare y i ^ \hat{y_i} yi^ and y i y_i yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fullyconnected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Details of training are given in Section 3.
  我们使用一大组有标签的图像 { x , y } \{x, y\} {x,y}来训练这些模型,其中标签 y i y_i yi是一个表示真实类别的离散变量。一个适用于图像分类的交叉熵损失函数被用来比较 y i ^ \hat{y_i} yi^ y i y_i yi。网络的参数(卷积层中的滤波器、完全连接层中的权重矩阵和偏置)通过反向传播损失对整个网络参数的导数来训练,并通过随机梯度下降(SGD)来更新参数。训练的细节将在第3节给出。

2.1 Visualization with a Deconvnet(用一个Deconvnet进行可视化)

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) Zeiler et al. [29]. A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.
  理解convnet的运作需要解释中间层的特征活动。我们提出了一种新的方法来将这些活动映射回输入像素空间,显示什么输入模式最初导致了特征图中的特定激活。我们用一个反卷积网络(deconvnet))来进行这种映射(Zeiler等人[29])。反卷积网络可以被认为是一个使用相同组件(过滤、汇集)的convnet模型,但是是反向的,所以不是将像素映射到特征,而是做相反的事情。在Zeiler等人[29]中,去神经网络被提出来作为进行无监督学习的一种方式。在这里,他们不以任何学习的身份使用,只是作为已经训练好的convnet的探针。
  To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.
  如图1(顶部)所示,为了检查一个convnet,在它的每个层上都有一个deconvnet,提供了一个回到图像像素的连续路径。开始时,一个输入图像被提交给convnet,并在各层中计算出特征。为了检查一个给定的convnet激活,我们将该层的所有其他激活设置为零,并将特征图作为输入传递给附属的deconvnet层。然后,我们先后(i)unpool,(ii)rectify和(iii)filter,以重建下面的层中产生所选激活的活动。然后重复这一过程,直到达到输入像素空间。
在这里插入图片描述
  Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.
  Unpooling。在convnet中,最大池化操作是不可逆转的,但是我们可以通过在一组开关变量中记录每个集合区域内的最大值的位置来获得一个近似的逆操作。在deconvnet中,unpooling操作使用这些开关,将上面一层的重建置于适当的位置,保留了刺激的结构。请看图1(底部)对该程序的说明。
  Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity(We also tried rectifying using the binary mask imposed by the feed-forward relu operation, but the resulting visualizations were significantly less clear.).
  Rectification。convnet使用relu非线性,对特征图进行校正,从而确保特征图始终为正值。为了在每一层获得有效的特征重建(也应该是正的),我们将重建的信号通过一个relu非线性(我们也尝试过使用前馈relu操作所施加的二进制掩码进行校正,但所产生的可视化效果明显不够清晰)。
  Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.
  Filtering。convnet使用学习过的滤波器对上一层的特征图进行卷积。为了近似地反转这一点,deconvnet使用同样的滤波器的转置版本(就像其他自动编码器模型,如RBM),但应用于rectified特征图,而不是下面一层的输出。在实践中,这意味着将每个滤波器垂直和水平地翻转。
  Note that we do not use any contrast normalization operations when in this reconstruction path. Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. The whole procedure is similar to backpropping a single strong activation (rather than the usual gradients), i.e. computing ∂ h ∂ X n \frac{∂h}{∂X_n} Xnh ,whereh is the element of the feature map with the strong activation and X n X_n Xn is the input image. However, it differs in that (i) the the relu is imposed independently and (ii) contrast normalization operations are not used. A general shortcoming of our approach is that it only visualizes a single activation, not the joint activity present in a layer. Nevertheless, as we show in Fig. 6, these visualizations are accurate representations of the input pattern that stimulates the given feature map in the model: when the parts of the original input image corresponding to the pattern are occluded, we see a distinct drop in activity within the feature map.
  注意,在这个重建路径中,我们不使用任何对比度归一化操作。从高层往下投射时,使用的是上层convnet中的最大集合所产生的switch设置。由于这些switch设置是特定的输入图像所特有的,因此从单一的激活中得到的重建结果类似于原始输入图像的一小部分,其结构根据其对特征激活的贡献而加权。由于模型的训练是辨别性的,它们隐含地显示了输入图像的哪些部分是辨别性的。请注意,这些投影不是模型的样本,因为没有涉及生成过程。整个过程类似于反推单一的强激活(而不是通常的梯度),即计算 ∂ h ∂ X n \frac{∂h}{∂X_n} Xnh ,其中h是具有强激活的特征图的元素, X n X_n Xn是输入图像。然而,它的不同之处在于:(i)relu是独立施加的;(ii)不使用对比度归一化操作。我们的方法的一个普遍缺点是,它只显示单一的激活,而不是一个层中存在的联合活动。然而,正如我们在图6中所示,这些可视化是刺激模型中给定特征图的输入模式的准确表示:当原始输入图像中与该模式相对应的部分被遮挡时,我们看到特征图中的活动明显下降。
在这里插入图片描述

3 Training Details(训练详情)

We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by Krizhevsky et al. [18] for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 5, as described in Section 4.1.
  我们现在描述一下大型convnet模型,该模型将在第4节中进行可视化。图3所示的结构与Krizhevsky等人[18]用于ImageNet分类的结构类似。一个区别是,在Krizhevsky的第3、4、5层中使用的稀疏连接(由于该模型被分割在两个GPU上)在我们的模型中被密集连接取代。其他与第1层和第2层有关的重要区别是在检查了图5中的可视化后作出的,如第4.1节所述。
在这里插入图片描述
  The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) [6]. Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 1 0 − 2 10^{−2} 102, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 1 0 − 2 10^{−2} 102 and biases are set to 0.
  该模型是在ImageNet 2012训练集(130万张图像,分布在1000个不同的类别中)[6]上训练的。每张RGB图像都经过预处理,将最小尺寸调整为256,裁剪中心256x256区域,减去每像素平均值(所有图像),然后使用大小为224x224的10个不同的子裁剪(角落+中心,有(无)水平翻转)。随机梯度下降法的小型批次大小为128,用于更新参数,开始时的学习率为 1 0 − 2 10^{−2} 102,与0.9的动量项相结合。在整个训练过程中,当验证误差趋于平稳时,我们手动对学习率进行退火。Dropout[14]被用于全连接层(6和7),ratio为0.5。所有的权重被初始化为 1 0 − 2 10^{-2} 102,偏置被设置为0。
  Visualization of the first layer filters during training reveals that a few of them dominate. To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 1 0 − 1 10^{−1} 101 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in Krizhevsky et al. [18], we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on [18].
  在训练过程中,第一层滤波器的可视化显示,其中有几个滤波器占主导地位。为了解决这个问题,我们将卷积层中RMS值超过 1 0 − 1 10^{-1} 101固定半径的每个滤波器重新归一到这个固定半径。这一点很关键,特别是在模型的第一层,输入图像大致在[-128,128]范围内。如同Krizhevsky等人[18]一样,我们对每个训练实例进行了多次不同的裁剪和翻转,以提高训练集的大小。我们在70个epochs后停止训练,使用基于[18]的实现,在单个GTX580 GPU上花费了大约12天。

4 Convnet Visualization(Convnet可视化)

Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.
  使用第3节中描述的模型,我们现在使用deconvnet来可视化ImageNet验证集上的特征激活。
  Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. For a given feature map, we show the top 9 activations, each projected separately down to pixel space, revealing the different structures that excite that map and showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations which solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.
  特征视觉化。图2显示了训练完成后我们模型的特征可视化。对于一个给定的特征图,我们显示了前9个激活,每个激活都分别投射到像素空间,揭示了激发该图的不同结构,并显示其对输入变形的不变性。在这些视觉效果的旁边,我们显示了相应的图像patches。与只关注每个patches内的判别结构的可视化相比,这些patches的变化更大。例如,在第5层第1行第2列,这些patches似乎没有什么共同点,但可视化显示,这个特定的特征图关注的是背景中的草,而不是前景物体。
在这里插入图片描述
  The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, and is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).
  各层的投影显示了网络中特征的层次性。第2层对角落和其他边缘/颜色的结点作出反应。第3层有更复杂的不变性,捕捉类似的纹理(如网状图案(第1行,第1列);文字(R2,C4))。第4层显示了显著的变化,而且更具有阶级性:狗脸(R1,C1);鸟的腿(R4,C2)。第5层显示了具有显著姿势变化的整个物体,例如键盘(R1,C11)和狗(R4)。
  Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.
  训练过程中的特征演变:图4直观地显示了训练过程中最强的激活(横跨所有训练实例)在一个给定的特征图中投射回像素空间的进展情况。外观上的突然跳动是由于最强激活的图像发生了变化。可以看到模型的下层在几个 epochs内就收敛了。然而,上层只有在相当多的epochs(40-50)后才会发展起来,这表明需要让模型训练到完全收敛。
在这里插入图片描述

4.1 Architecture Selection(框架选择)

While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 5(a) & ©), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 5(b) & (d). More importantly, it also improves the classification performance as shown in Section 5.1.
  虽然对一个训练有素的模型进行可视化可以深入了解其运行情况,但它也可以帮助首先选择好的架构。通过对Krizhevsky等人的架构的第一层和第二层的可视化(图5(a)和©),各种问题都很明显。第一层过滤器是极高和极低频率信息的混合体,对中频的覆盖很少。此外,第二层的可视化显示了由第一层卷积中使用的大strite=4造成的混叠伪影。为了解决这些问题,我们(i)将第一层滤波器的尺寸从11x11减少到7x7,(ii)使卷积的步长为2,而不是4。这个新的结构在第一和第二层特征中保留了更多的信息,如图5(b)和(d)所示。更重要的是,它还提高了分类性能,如第5.1节所示。

4.2 Occlusion Sensitivity(遮蔽敏感度)

With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 6 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 6 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.
  对于图像分类方法,一个自然的问题是,该模型是否真正识别了图像中物体的位置,或者只是利用周围的环境。图6试图回答这个问题,通过系统地将输入图像的不同部分用灰色方块遮挡起来,并监测分类器的输出。这些例子清楚地表明,该模型正在对场景中的物体进行定位,因为当物体被遮挡时,正确类别的概率明显下降。图6还显示了来自顶级卷积层最强特征图的可视化,以及该图中作为遮挡者位置函数的活动(在空间位置上加总)。当遮挡者覆盖了出现在可视化中的图像区域时,我们看到特征图中的活动出现了强烈的下降。这表明,可视化真正对应于刺激该特征图的图像结构,因此验证了图4和图2所示的其他可视化。

5 Experiments (实验)

5.1 ImageNet 2012

This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 1 shows our results on this dataset.
  这个数据集由1.3M/50K/100K训练/验证/测试实例组成,分布在1000个类别上。表1显示了我们在这个数据集上的结果。
在这里插入图片描述
  Using the exact architecture specified in Krizhevsky et al. [18], we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.
  我们使用Krizhevsky等人[18]中指定的确切架构,试图在验证集上复制他们的结果。在ImageNet 2012验证集上,我们取得的错误率在他们报告值的0.1%以内。
  Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7 × 7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly outperforms the architecture of Krizhevsky et al. [18], beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, an improvement of 1.6%. This result is close to that produced by the data-augmentation approaches of Howard [15], which could easily be combined with our architecture. However, our model is some way short of the winner of the 2013 Imagenet classification competition [28].
  接下来,我们分析了我们的模型在第4.1节中概述的架构变化(第1层的7×7滤波器和第1、2层的跨度2卷积)下的性能。这个模型,如图3所示,明显优于Krizhevsky等人[18]的架构,比他们的单一模型结果高出1.7%(测试前5名)。当我们结合多个模型时,我们得到的测试误差为14.8%,提高了1.6%。这个结果接近于Howard[15]的数据增强方法所产生的结果,它可以很容易地与我们的架构相结合。然而,我们的模型与2013年Imagenet分类竞赛的冠军有一定差距[28]。
  Varying ImageNet Model Sizes: In Table 2, we first explore the architecture of Krizhevsky et al. [18] by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6,7) only gives a slight increase in error (in the following, we refer to top-5 validation error). This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. We then modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of Krizhevsky et al. [18]). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.
  不同的ImageNet模型大小。在表2中,我们首先探索了Krizhevsky等人[18]的架构,调整了层的大小,或完全删除它们。在每种情况下,模型都是用修改后的结构从头开始训练的。去掉全连接层(6,7)只会使误差略有增加(在下文中,我们指的是前5个验证误差)。这是令人惊讶的,因为它们包含了大部分的模型参数。去掉中间卷积层中的两个,对错误率的影响也比较小。然而,去除中间卷积层和全连接层后,得到的模型只有4层,其性能大大降低。这表明,模型的整体深度对于获得良好的性能是很重要的。然后我们修改我们的模型,如图3所示。改变全连接层的大小对性能没有什么影响(Krizhevsky等人的模型也是如此[18])。然而,增加中间卷积层的大小会给性能带来有益的提高。但是,在增加这些层的同时,扩大全连接层的规模会导致过度拟合。
在这里插入图片描述

5.2 Feature Generalization(特征泛化)

The experiments above show the importance of the convolutional part of our ImageNet model in obtaining state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex invariances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 [9], Caltech-256 [11] and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.
  上述实验表明,我们的ImageNet模型的卷积部分对获得最先进的性能非常重要。图2的可视化显示了在卷积层中学习到的复杂不变性,这一点得到了支持。我们现在探索这些特征提取层对其他数据集的概括能力,即Caltech-101[9]、Caltech-256[11]和PASCAL VOC 2012。为了做到这一点,我们保持ImageNet训练模型的第1-7层不变,并使用新数据集的训练图像在上面训练一个新的softmax分类器(针对适当数量的类)。由于softmax包含的参数相对较少,它可以从相对较少的例子中快速训练,某些数据集就是如此。
  The experiments compare our feature representation, obtained from ImageNet, with the hand-crafted features used by other methods. In both our approach and existing ones the Caltech/PASCAL training data is only used to train the classifier. As they are of similar complexity (ours: softmax, others: linear SVM), the feature representation is crucial to performance. It is important to note that both representations were built using images beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descriptors were determined through systematic experiments on a pedestrian dataset [5].
  实验将我们从ImageNet获得的特征表示与其他方法使用的手工制作的特征进行了比较。在我们的方法和现有的方法中,Caltech/PASCAL的训练数据都只用于训练分类器。由于它们的复杂度相似(我们的:softmax,其他的:线性SVM),特征表示对性能至关重要。值得注意的是,这两种表征都是用Caltech和PASCAL训练集以外的图像建立的。例如,HOG描述符中的超参数是通过对行人数据集的系统实验确定的[5]。
  We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the PASCAL/Caltech dataset.
  我们还尝试了从头开始训练模型的第二种策略,即把第1-7层重设为随机值,在PASCAL/Caltech数据集的训练图像上训练它们以及softmax。
  One complication is that some of the Caltech datasets have some images that are also in the ImageNet training data. Using normalized correlation, we identified these few “overlap” images and removed them from our Imagenet training set and then retrained our Imagenet models, so avoiding the possibility of train/test contamination.
  一个复杂的问题是,加州理工学院的一些数据集有一些图像也在ImageNet的训练数据中。利用归一化的相关性,我们确定了这些少数的 "重叠 "图像,并将它们从我们的Imagenet训练集中删除,然后重新训练我们的Imagenet模型,这样就避免了训练/测试污染的可能性。
  Caltech-101: We follow the procedure of [9] and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Table 3, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from [3] by 2.2%. Our result agrees with the recently published result of Donahue et al. [7], who obtain 86.1% accuracy (30 imgs/class). The convnet model trained from scratch however does terribly, only achieving 46.5%, showing the impossibility of training a large convnet on such a small dataset.
  Caltech-101。我们遵循[9]的程序,每类随机选择15或30张图片进行训练,并在每类多达50张图片上进行测试,在表3中报告每类准确率的平均值,使用5个训练/测试褶皱。训练30张图片/类时,花费了17分钟。预训练的模型比[3]报道的30张图像/类的最佳结果高出2.2%。我们的结果与Donahue等人[7]最近发表的结果一致,他们获得了86.1%的准确性(30张图片/类)。然而,从头开始训练的convnet模型表现得很糟糕,只达到了46.5%,这表明在这么小的数据集上训练一个大型convnet是不可能的。
在这里插入图片描述
  Caltech-256: We follow the procedure of [11], selecting 15, 30, 45, or 60 training images per class, reporting the average of the per-class accuracies in Table 4. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. [3] by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 7, we explore the “one-shot learning” [9] regime. With our pretrained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.
  Caltech-256。我们按照[11]的程序,每类选择15、30、45或60张训练图像,在表4中报告每类的平均准确率。我们的ImageNet训练模型以很大的优势击败了Bo等人[3]获得的当前最先进的结果:60张训练图像/类的74.2% vs 55.2%。然而,与Caltech-101一样,从头开始训练的模型表现很差。在图7中,我们探讨了 “一次性学习”[9]制度。使用我们的预训练模型,只需要6张Caltech-256的训练图像就可以打败使用10倍图像的领先方法。这显示了ImageNet特征提取器的力量。
在这里插入图片描述
在这里插入图片描述
  PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 5 shows the results on the test set, comparing to the leading methods: the top 2 entries in the competition and concurrent work from Oquab et al. [21]whouseaconvnetwithamore appropriate classifier. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading competition result [27], however we do beat them on 5 classes, sometimes by large margins.
  pascal 2012。我们使用标准的训练图像和验证图像,在ImageNet-retrained convnet的基础上训练一个20路softmax。这并不理想,因为PASCAL图像可能包含多个对象,而我们的模型只是为每张图像提供一个独家预测。表5显示了在测试集上的结果,与领先的方法进行了比较:竞赛中的前两名和同时进行的Oquab等人[21]的工作,他们使用了更合适的分类器的vnetw。PASCAL和ImageNet图像的性质完全不同,前者是完整的场景,而后者则不同。这可能是我们的平均成绩比领先的竞赛结果[27]低3.2%的原因,但是我们确实在5个类别上击败了他们,有时差距还很大。
在这里插入图片描述

5.3 Feature Analysis(特征分析)

We explore how discriminative the features in each layer of our Imagenetpretrained model are. We do this by varying the number of layers retained from the ImageNet model and place either a linear SVM or softmax classifier on top. Table 6 shows results on Caltech-101 and Caltech-256. For both datasets, a steady improvement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.
  我们探讨了在我们的Imagenet训练模型的每一层中的特征有多大的辨别力。我们通过改变ImageNet模型的层数,并在上面放置一个线性SVM或softmax分类器来实现这一目的。表6显示了Caltech-101和Caltech-256的结果。对于这两个数据集,随着我们对模型的提升,可以看到一个稳定的改善,使用所有的层可以获得最好的结果。这支持了这样一个前提:随着特征层次的深入,它们会学习到越来越强大的特征。
在这里插入图片描述

6 Discussion(讨论)

We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also show how these visualization can be used to identify problems with the model and so obtain better results, for example improving on Krizhevsky et al. ’s [18] impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.
  我们以多种方式探索了为图像分类训练的大型卷积神经网络模型。首先,我们提出了一种新的方法,将模型内的活动可视化。这揭示了这些特征远不是随机的、无法解释的模式。相反,它们显示了许多直观的理想属性,如构成性、不断增加的不变性和随着我们层层上升的类别区分。我们还展示了这些可视化如何用于识别模型的问题,从而获得更好的结果,例如改进Krizhevsky等人的[18]令人印象深刻的ImageNet 2012结果。然后,我们通过一系列的闭塞实验证明,该模型在接受分类训练的同时,对图像中的局部结构高度敏感,而不仅仅是利用广泛的场景背景。对该模型的消融研究表明,对网络有一个最小的深度,而不是任何单独的部分,对该模型的性能至关重要。
  Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias [25], although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.
  最后,我们展示了ImageNet训练的模型如何能够很好地推广到其他数据集。对于Caltech-101和Caltech-256,这两个数据集足够相似,我们可以击败最好的报告结果,在后者的情况下,有很大的差距。我们的convnet模型对PASCAL数据的概括性较差,也许是受到了数据集偏见的影响[25],尽管没有对该任务进行调整,但它仍然在3.2%的最佳报告结果之内。例如,如果使用不同的损失函数,允许每幅图像有多个物体,我们的性能可能会提高。这自然会使网络也能解决目标检测的问题。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值