VGG(论文翻译)

VGG(论文翻译)

Very Deep Convolutional Networks for Large-Scale Image Recognition

用于大规模图像识别的超深度卷积网络

ABSTRACT

摘要
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
本文研究了在大规模图像识别中,卷积网络深度对其识别精度的影响。我们的主要贡献是使用具有非常小(3×3)卷积滤波器的体系结构对增加深度的网络进行全面评估,这表明通过将深度推到16-19个权重层,可以实现对现有技术配置的重大改进。这些发现是我们在ImageNet Challenge 2014提交报告的基础,我们的团队在本地化和分类方面分别获得了第一和第二名。我们还表明,我们的表示可以很好地推广到其他数据集,在这些数据集中,它们可以获得最新的结果。我们已经公开了两个性能最好的ConvNet模型,以促进进一步研究在计算机视觉中使用深度视觉表示。

1 INTRODUCTION

引言
Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition which has become possible due to the large public image repositories, such as ImageNet, and high-performance computing systems, such as GPUs or large-scale distributed clusters. In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings to deep ConvNets.
卷积网络(ConvNets)最近在大规模的图像和视频识别方面取得了巨大的成功,这归功于大型公共图像存储库(例如ImageNet)和高性能计算系统(例如GPU或大规模分布式系统)集群。特别是ImageNet大规模视觉识别挑战赛(ILSVRC)在深度视觉识别体系结构的发展中发挥了重要作用,它为从高维浅特征编码到深度卷积网络的几代大规模图像分类系统提供了试验台。

With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC-2013 utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales. In this paper, we address another important aspect of ConvNet architecture design - its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3×3) convolution filters in all layers.
随着ConvNets在计算机视觉领域的应用越来越广泛,人们进行了许多尝试来改进Krizhevsky等人的原始体系结构,以期获得更高的准确性。例如,向ILSVRC-2013提交的最佳表现是使用较小的接收窗口大小和较小的第一卷积层跨度。另一项改进涉及在整个图像上和多个尺度上密集地训练和测试网络。在本文中,我们讨论了ConvNet体系结构设计的另一个重要方面:深度。为此,我们固定了体系结构的其他参数,并通过添加更多的卷积层来稳步增加网络的深度,这是可行的,因为在所有层中都使用了非常小的(3×3)卷积滤波器。

As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models to facilitate further research.
因此,我们重大地提出了更精确的ConvNet体系结构,它不仅在ILSVRC分类和定位任务上达到了最先进的精度,而且也适用于其他图像识别数据集,即使作为相对简单的流水线的一部分使用(例如,通过线性SVM进行分类而无需微调的深层特征),它们也能获得优异的性能。为了便于进一步研究,我们发布了两个性能最好的模型。

The rest of the paper is organised as follows. In Sect.2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect.3, and the configurations are compared on the ILSVRC classification task in Sect.4. Sect.5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
论文的其余部分安排如下。在第2节中,我们描述了ConvNet配置。第三节介绍了图像分类训练和评估的详细信息,并在第四节的ILSVRC分类任务中进行了配置比较。第五节是对本文的总结。为了完整起见,我们还在附录A中描述和评估了ILSVRC-2014目标定位系统,并在附录B中讨论了将超深度特性推广到其他数据集的问题。最后,附录C包含了主要的论文修订列表。

2 CONVNET CONFIGURATIONS

CONVNET配置
To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect2.1) and then detail the specific configurations used in the evaluation (Sect2.2). Our design choices are then discussed and compared to the prior art in Sect2.3.
为了在公平的环境下测量增加的ConvNet深度带来的改进,我们所有的ConvNet层配置均采用相同的原理设计,灵感来自Ciresan和Krizhevsky等人。在本节中,我们首先描述ConvNet配置的一般布局(第2.1节),然后详细说明评估中使用的特定配置(第2.2节)。然后讨论我们的设计选择,并与第2.3节中的现有技术进行比较。

2.1 ARCHITECTURE

结构体系
During training, the input to our ConvNets is a fixed-size 224×224 RGB image. The only pre- processing we do is subtracting the mean RGB value, computed on the training set, from each pixel.The image is passed through a stack of convolutional(conv.) layers, where we use filters with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down,center). In one of the configurations we also utilise 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv.layer input is such that the spatial resolutionis preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv.layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv.layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
在训练过程中,我们的ConvNets的输入是一个固定大小的224×224 RGB图像。我们所做的唯一预处理就是从每个像素中减去在训练集上计算出的RGB平均值,图像通过一堆卷积层传递,在这里我们使用一个感受野非常小的滤波器:3×3(这是捕获左/右、上/下、中心概念的最小尺寸)。在其中一种配置中,我们还使用1×1卷积滤波器,这可以看作是输入通道的线性变换(其次是非线性)。卷积步长固定为1像素;空间填充转换层输入使得卷积后保留的空间分辨率,即对于3×3的转换层,填充为1像素。空间池化是由五个最大池化层执行的,它们遵循一些卷积层(不是所有卷积层都跟随最大池化)。最大池化是在2×2的像素窗口上执行的,步长为2。

A stack of convolutional layers (which has a different depth in different architectures)is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
卷积层的堆栈(在不同的体系结构中具有不同的深度)之后是三个全连接层:前两层各有4096个信道,第三个层执行1000路ILSVRC分类,因此包含1000个信道(每个类一个)。最后一层是softmax层。全连接层的配置在所有网络中都是相同的。

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al.) non-linearity.We note that none of our networks (except for one) contain Local Response Normalisation(LRN) normalisation (Krizhevsky et al.): as will be shown in Sect.4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and com- putation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al.).
所有隐藏层均配有非线性校正(ReLU)。我们请注意,我们的网络(除了一个)都不包含本地响应规范化(LRN):如第4节所示。这种规范化不会提高ILSVRC数据集的性能,但会增加内存消耗和计算时间。适用时,LRN层的参数为(Krizhevsky等人,2012年)的参数。

2.2 CONFIGURATIONS

配置
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
本文评估的ConvNet配置如表1所示,每列一个。在下文中,我们将按网络的名称(A-E)来指代网络。所有配置均遵循第2.1节所介绍的通用设计,仅在深度上不同:从网络A中的11个权重层(8个卷积层和3个全连接层)到网络E中的19个权重层(16个卷积层和3个全连接层)。卷积层的宽度(通道数)相当小,从第一层的64个开始,然后在每个最大池化层之后增加2倍,直到达到512个为止。

In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
在表2中,我们报告了每个配置的参数数量。尽管深度较大,但我们的网络中的权重数不大于具有较大卷积层宽度和接收场的较浅网络中的权重数(144M权重(Sermanet et al.,2014))。

2.3 DISCUSSION

讨论
Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 and ILSVRC-2013 competitions. Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv.layers (without spatial poolingin between) has an effective receptive field of 5×5; three
我们的ConvNet配置与ILSVRC-2012和ILSVRC-2013比赛中表现最好的参赛作品中使用的配置大不相同。与其在第一转换层中使用相对较大的接收场(例如,使用步长为4的11×11(Krizhevsky et al.,2012),或者使用步长为2的7×7,我们在整个网络中使用非常小的3×3接收场,这些接收场与每个像素的输入进行卷积(步长为1)。很容易看出两个3×3的堆栈卷积层(中间没有空间池化操作)有效感受野为5×5;三个

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv-”. The ReLU activation function is not shown for brevity.
表1:ConvNet配置(在列中显示)。随着添加更多层(添加的层以粗体显示),配置的深度从左侧(A)向右侧(E)增加。卷积层参数表示为“conv<感受野大小>-<信道数>”。为了简洁起见,不显示ReLU激活函数。
ConvNet Configuration
A A-LRN B C D E
11 weight 11 weight 13 weight 16 weight 16 weight 19 weight
layers layers layers layers layers layers

input (224 × 224 RGB image)
conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64
LRN conv3-64 conv3-64 conv3-64 conv3-64
maxpool
conv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128
conv3-128 conv3-128 conv3-128 conv3-128
maxpool
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256
conv3-256 conv3-256 conv3-256
conv3-256
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512
conv3-512
maxpool
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512
conv3-512 conv3-512 conv3-512
conv3-512
maxpool
FC-4096
FC-4096
FC-1000
softmax

Table 2: Number of parameters (in millions).
Network A,A-LRN B C D E
Number of parameters 133 133 134 138 144

such layers have a 7×7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×3 conv.layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3×3 convolution stack has C channels, the stack is parametrised by weights; at the same time, a single 7×7 conv. layer would requireparameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7×7 conv. filters, forcing them to have a decomposition through the 3×3 filters (with non-linearity injected in between).
这些层有7×7的有效感受野。例如,通过使用三个3×3的卷积层而不是一个7×7层的卷积层,我们得到了什么?首先,我们合并了三个非线性校正层,而不是一个单一的校正层,这使得决策函数更具判别性。其次,我们减少了参数的数目:假设三层3×3卷积堆栈的输入和输出都有C个通道,则堆栈是按个权重进行参数化的;同时,单个7×7卷积层将需要个参数,即增加81%。这可以被视为对7

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值