Deep Residual Learning for Image Recognition

 

Abstract

A residual learning framework is presented. Comprehensive empirical evidence show that these residual networks are easier to optimize, and can gain accuracy from considerable increased depth.


Introduction

1. Deep networks intergate low/mid/high-level features.The “levels” of features can be enriched by the number of stacked layers(depth).

2. Is learning better network as easy as stacking more layers? No, The vanishing/exploding gradients hampers convergence from the begining.But this problem has been largely addressed by normalization initialization and intermediate normailzation layers, which enable networks with tens of layers to start converging for stochastic gradient descent(SGD) with back-propagation.

3. When deeper networks start converging, a degradation problem has been exposed:with the network depth incresing, accuracy gets saturated and then degrades rapidly. Such degradation is not caused by overfitting.  And deeper model leads to higher traning error.

 

4. Consider a shallower architecture and its deeper countrepart that adds more layers onto it. There is a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. Theoretically, a deeper model should produce no higher training error than its shallower counterpart, but the fact is just the opposite.There is no constructed solution about this phenomenon.

5. In this paper, authors address the degradation problem by introducing a deep residual learning framework. They explicitly let layers fit a residual mapping of .H(x) is the original, unreferenced mapping.

The formulation of F(x) + x can be realized by feedforward nerual networds with “ shortcunt connections”.The shortcut connections simpy perform identity mapping.

 

Through evaluating their method, They show: 1) Extremely deep residual nets are easy to optimize,but the counterpart “plain” nets exhibit higher training error when the depth increase; 2) It can easily enjoy accuracy gains from greatly increased depth.


Related Work

1. Residual Representations.

2. Shortcut Connections:  An early practice of training multi-layer perceptrons(MLPs) is to add a linear layer connected from the network input to the output.

3. Deep Residual Learning

If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.

Identity Mapping by Shortscut:

 

 

 The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers, while more layers are possible.

Network Architectures

Three types

When the dimensions increase,We consider two options:(A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut inis used to match dimensions.

 

 

Implementation

1. The image is resized with its shorter side randomly sample in [256,480] for scale augmentation. A 224x224 crop is randomly sampled from an image or its horizontal flip,with the per-pixel mean subtracted.

2. The standard color augmentation is used.

3. We adopt batch normalization(BN) right after each convolution and before activation.

4. We initialize the weights and as in[13] and train all plain/residual nets from scratch.

5. We use SGD with a mini-batch size of 256.

6. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60×10^4 iterations.

7. We use a weight decay of 0.0001 and a momentum of 0.9.

8. We do not use dropout.

Findings

1. The ResNet eases the optimization by providing faster convergence at the early stage, comparing to the plain nets.

2. Deeper Bottleneck Architectures

 

The three layers are 1×1 ,3×3,and1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing(restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/ouput dimensions.

If the identity shortcut is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

Deep non-bottleneck ResNets also gain accuracy from increased depth, but are not as economical as the bottleneck ResNet. So the usage of bottleneck designs is mainly due to practical considerations.

3.

 

4.The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet and MNIST, suggesting that such an optimization difficulty is a fundamental problem.

5.The residual functions might be generally closer to zero than the non-residual functions

6.The testing result of this 1202-layer network is worse than that of our 110-layer network,although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large for this small dataset.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值