Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks.” Aistats. Vol. 9. 2010. [Citations: 722].
Plateaus are less present with the softmax cost function, while there are more severe plateaus with the quadratic cost.
2 Xavier Initialization
[Motivation] Ensure that all neurons in the network initially have approximately the same output distribution, and empiracally improves the rate of convergence.
[Forward Pass] Consider linear activation function (or we are in the linear regime at the initialization).