Key Point list
1 Vanishing/exploding gradients are largely addressed by normalized initialization and intermediate normalization layers(BN)
2 The plain network with depth increasing gets saturated and then degrades rapidly, which is not caused by overfitting. (degradation problem)
3 ResNet assump that it is easier to optimize the residual mapping than optize the origin, unreferenced mapping
Denote the desired underlying mapping as
H(x)
, residual mapping as
F(x)
F(x):=H(x)−x
Reasoning: The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.
If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one
4 The Design of ResNet:
Bottleneck
Reduce the parameters and calculations to develop more deeper network by reducing the channels.
The parameter-free identity shortcuts are particularly important for the bottleneck architecturesIdentity Mapping by Shortcuts(See Figure 2 )
when dimension increases,
method:
A) identity mapping with zeros padding
B) projection shortcut by 1 x 1 convolutionsNetwork Framework
5 Experiment on ImageNet show:
- ResNet are easier to optimize, the plain net exhibit higher training error with depth increasing
- ResNet can easily enjoy accuracy from increased depth producing results substantially better than previous networks
6 Experiment Result: