论文阅读-VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

作者: Karen Simonyan et al.
日期: 2015
类型: conference article
来源: ICLR
评价: veyr deep networks
论文链接: https://arxiv.org/abs/1409.1556

1 Purpose

  • Investigating the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting.

Challenges

2 The previous work

  1. Krizhevsky, A. et al. ImageNet classification with deep convolutional neural networks. NIPS ,2012
  2. Zeiler, M. D. et al. Visualizing and understanding convolutional networks. ECCV, 2014
  3. Szegedy, C. et al. Going deeper with convolutions. CVPR, 2015

3 The proposed method

They fix other parameters of the architecture, steadily increase the depth of the network by add more convolutional layers. To implementing this, they use very small(3x3) convolutional filters in all layers.

3.1 Data sets

ILSVRC2012: images of 1000 classes, spilt into three sets: training(1.3M), validation(50K), testing(100K).
Measures: Top-1: multi-class classification error. i.e. the proportion of incorrectly classified images;
Top-5: the proportion of images such that the groud-truth catogery is outside the top-5 predicted categories.

3.2 The architecture

在这里插入图片描述

  • All hidden layers are followed by rectification(ReLU)
  • None of their networks contain Local Response Normalization(LRN Krizhevsky et al. 2012) except for A-LRN. They find that it does not improve the performance, but leads to increase in memory consumption and computation time.
  • In configuration C, they use 1x1 conv.layers, they think it is a way to increase the non-linearity of the decison function without affecting the receptive fields of the conv.layers.(As a current perspective, 1x1 conv layer used in Bottle-Neck conv block(ResNet) is an efficient way to decrease the amount of parameters)

3.3 Classification Framework

the training procedure generaly follows Krizhevsky et al. (except for sampling the input crops from multi-scale training images.)

  1. training
    • Objective: multinomial logistic regression
    • Tricks1: Train shallow networks and then use it to initialize certain layers of a deeper one.
    • Tricks2: pre-initialize with weights sampled from a normal distribution.
    • single scale training: fix the smallest side of an image, and then crop the image with fixed size of 224x224
    • multi-scale training: each training image is individually rescaled by randomly sampling S from a certain range [ S m i n , S m a x ] [S_min, S_max] [Smin,Smax](S denotes the smallest side of an image).
  2. testing
    • dense evaluation:
      • Introducing FCN in testing: in testing procedure, the author did not crop the images, but a whole image are allowed to go through the network; to gain a fixed class vector in the last layer, they use an average pooling operation. (similar to Sermanet et al. 2014 OverFeat)
      • Multiple crops low efficient: as to multiple crops which adopted by Krizhevsky(AlexNet) and Szegedy(GoogleNet), they thought it is comutation expensive and not worth to do.
      • Single scale evaluation: prove that scale jittering in training set does works, and as the network go deeper the performance get better.
      • Multi scale evalutation: running a model over several rescaled versions of a test image, followed by averaging the resulting class posteriors. It proves to be better than single scale evaluation.
    • Multi-crop evalutation: using multi-crops(150crops, crop on three scales with each using 5x5 grids and horizontal flips), it proves to be slightly better than dense evaluation.
    • Averaging the softmax outputs(posteriors) of dense evaluation and multi-crop evaluation and perform a little better. The author hypothesize that it is due to different treatment of convolutional boundary conditions. See Table 5
      在这里插入图片描述

3.4 Results

在这里插入图片描述
They secured the 2nd place in the competition while slighly behind GoogleNet.

Are the data sets sufficient?

yes, the author also test on lots of other datasets and prove the generality of their model.

3.6 Advantages

  1. The advantages of using very small(3x3) fliters, the author make a comparison between three 3x3 filters with single 7x7 filter, cause three 3x3 fiters have the same receptive field of 7x7:

    • A stack of three 3x3 conv.layers incorporate three non-linear rectification layers in stead of a sngle one, which makes the decison function more discriminative.
    • Decrease the number of parameters: assuming that both the input and the output of a three-layer 3x3 convolutional stack has C channels, the stack is parametrised by 3 ( 3 2 C 2 ) = 27 C 2 3(3^2C^2)=27C^2 3(32C2)=27C2 weights; at the same time, single 7x7 conv.layer would require 7 2 C 2 = 49 C 2 7^2C^2=49C^2 72C2=49C2 weights, i.e. 81% more.
      (Actually, I don’t agree with the author, the most cases are: the input channel always at least half of the output channel, for example, input channel is 3, output channel is 64, then we compute the amount of parameters and find a result of this 3 ∗ 3 ∗ 3 ∗ 64 + 3 ∗ 3 ∗ 64 ∗ 64 + 3 ∗ 3 ∗ 64 ∗ 64 7 ∗ 7 ∗ 3 ∗ 64 = 75456 9408 = 8.02... \frac{3*3*3*64+3*3*64*64+3*3*64*64}{7*7*3*64}=\frac{75456}{9408}=8.02... 7736433364+336464+336464=940875456=8.02...). 8 times!!! Only when the number of input channel is half of output channel, which is most common case, the small scheme slightly decrease the parameters by 0.918 of the large scheme.)
  2. As to multi-model fusion style, achieve the 2nd place of 6.8% test error by combining only two models; As to single model performance, it is the best with 7.0% test error.

  3. Compare to AlexNet, the nets require less epochs to converge due to (a) implicit regularization imposed by greater depth and smaller filter sizes; (b)pre-initialization of certain layers. (this is the author’s interpretion, whatever, it works.)

3.7 Weakness

  1. It still did not explain why the error rate of their architecture saturate when the depth reaches 19, cause their work is to investigate the effect of a ConvNet’s depth.

4 What is the author’s next step?

not mentioned

5 What do other researchers say about his work?

  • Kaiming He et al. Deep Residual Learning for Image Recognition. CVPR, 2015.(ResNet) (VGG refered as 41)
    • On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8×deeper than VGG nets [41] but still having lower complexity.
    • Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41]
    • Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256; 480] for scale augmentation [41].
    • In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [41, 13]
    • We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [41] with ResNet-101.
  • Ben Limonchik et al. 3D Model-Based Data Augmentation for Hand Gesture Recognition. ARXIV, 2017
    • We experimented with various custom CNN architectures as well as pretrained models such as VGG-16 and Inception-ResNet-v2 in order to optimize our classification.
  • Shaoqing Ren et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ARXIV, 2016. (refer to 3)
    - For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
    - The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals.
    - Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU.
    - In our experiments, we investigate the Zeiler and Fergus model [32] (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model [3] (VGG-16), which has 13 shareable convolutional layers.

Deduction: How to calculate receptive field (如何计算感受野)

Let’s begin from a simplest way:

  1. Asuming that all kernels’ size are 3, one side of receptive field of one pixel in Lth layer is :
    2 L + 1 2L+1 2L+1
  2. Asuming that all kernels’ size are k, one side of receptive field of one pixel in Lth layer is :
    L ( k − 1 ) + 1 L(k-1)+1 L(k1)+1
  3. Asuming that kernels’ size are k 1 , k 2 , . . . , k l k_1, k_2, ...,k_l k1,k2,...,kl respectively, one side ofreceptive field of one pixel in Lth layer is :
    ∑ [ k 1 + ( k 2 − 1 ) + . . . + ( k l − 1 ) ] = ∑ i L ( k i − 1 ) + 1 \sum[k_1+(k_2-1)+...+(k_l-1)] = \sum_i^L(k_i-1)+1 [k1+(k21)+...+(kl1)]=iL(ki1)+1
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值