Training a CNN is hard because
• Large number of parameters requires heavy computation.
• The learning objective is non-convex, and has many poor local minima.
• Deep network has vanishing/exploding gradients problem.
• Need large amount of training data.
As the matter of to handle vanishing/exploding gradients, methods mainly include
• Careful set the learning rate.
• Design better CNN architecture, activation functions, etc.
• Careful initialization of weights.
• Tuning the data distribution.
• Large number of parameters requires heavy computation.
• The learning objective is non-convex, and has many poor local minima.
• Deep network has vanishing/exploding gradients problem.
• Need large amount of training data.
As the matter of to handle vanishing/exploding gradients, methods mainly include
• Careful set the learning rate.
• Design better CNN architecture, activation functions, etc.
• Careful initialization of weights.
• Tuning the data distribution.
We will focus on the last two topics to handle vanishing/exploding gradients problem in this section.
Weight initialization is very important in deep learning. I think one of the reasons that early networks did not work as well is that people did not care about it too much.
Initializing all the weights to 0 is a bad idea since all the neurons learn the same