Deep Residual Learning for Image Recognition

最新推荐文章于 2023-04-08 15:50:02 发布

菜鸡要努力

最新推荐文章于 2023-04-08 15:50:02 发布

阅读量255

点赞数

分类专栏： Image Recognition

本文链接：https://blog.csdn.net/ls83776736/article/details/80707011

版权

Image Recognition 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Abstract

A residual learning framework is presented. Comprehensive empirical evidence show that these residual networks are easier to optimize, and can gain accuracy from considerable increased depth.

Introduction

1. Deep networks intergate low/mid/high-level features.The “levels” of features can be enriched by the number of stacked layers(depth).

2. Is learning better network as easy as stacking more layers? No, The vanishing/exploding gradients hampers convergence from the begining.But this problem has been largely addressed by normalization initialization and intermediate normailzation layers, which enable networks with tens of layers to start converging for stochastic gradient descent(SGD) with back-propagation.

3. When deeper networks start converging, a degradation problem has been exposed:with the network depth incresing, accuracy gets saturated and then degrades rapidly. Such degradation is not caused by overfitting. And deeper model leads to higher traning error.

4. Consider a shallower architecture and its deeper countrepart that adds more layers onto it. There is a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. Theoretically, a deeper model should produce no higher training error than its shallower counterpart, but the fact is just the opposite.There is no constructed solution about this phenomenon.

5. In this paper, authors address the degradation problem by introducing a deep residual learning framework. They explicitly let layers fit a residual mapping of .H(x) is the original, unreferenced mapping.

The formulation of F(x) + x can be realized by feedforward nerual networds with “ shortcunt connections”.The shortcut connections simpy perform identity mapping.

Through evaluating their method, They show: 1) Extremely deep residual nets are easy to optimize,but the counterpart “plain” nets exhibit higher training error when the depth increase; 2) It can easily enjoy accuracy gains from greatly increased depth.

Related Work

1. Residual Representations.

2. Shortcut Connections: An early practice of training multi-layer perceptrons(MLPs) is to add a linear layer connected from the network input to the output.

3. Deep Residual Learning

If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.

Identity Mapping by Shortscut:

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers, while more layers are possible.

Network Architectures

Three types

When the dimensions increase,We consider two options:(A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut inis used to match dimensions.

Implementation

1. The image is resized with its shorter side randomly sample in [256,480] for scale augmentation. A 224x224 crop is randomly sampled from an image or its horizontal flip,with the per-pixel mean subtracted.

2. The standard color augmentation is used.

3. We adopt batch normalization(BN) right after each convolution and before activation.

4. We initialize the weights and as in[13] and train all plain/residual nets from scratch.

5. We use SGD with a mini-batch size of 256.

6. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60×10^4 iterations.

7. We use a weight decay of 0.0001 and a momentum of 0.9.

8. We do not use dropout.

Findings

1. The ResNet eases the optimization by providing faster convergence at the early stage, comparing to the plain nets.

2. Deeper Bottleneck Architectures

The three layers are 1×1 ,3×3,and1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing(restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/ouput dimensions.

If the identity shortcut is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

Deep non-bottleneck ResNets also gain accuracy from increased depth, but are not as economical as the bottleneck ResNet. So the usage of bottleneck designs is mainly due to practical considerations.

4.The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet and MNIST, suggesting that such an optimization difficulty is a fundamental problem.

5.The residual functions might be generally closer to zero than the non-residual functions

6.The testing result of this 1202-layer network is worse than that of our 110-layer network,although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large for this small dataset.

菜鸡要努力

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Deep Residual Learning for Image Recognition

AbstractA residual learning framework is presented. Comprehensive empirical evidence show that these residual networks are easier to optimize, and can gain accuracy from considerable increased depth....
复制链接

扫一扫

专栏目录