Key question
- Vanishing/exploding gradients hamper convergence from the beginning, as the network becomes more deeper.
- with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.
Methods
- skip connections
- The form of the residual function F is flexible
- The function F(x; fWig) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.
Architecture
Architectures for ImageNet
Experiments
(Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.)