Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015). [Citations: 532].
[Internal Covariate Shift] Distributions of each layers’ inputs change during training, as the parameters of all preceding layers change.
• Small changes to the network parameters amplify as the network becomes deeper.
• Layers need to continuously adapt to the new distributions.
• Optimizer would be more likely to get stuck in the saturated regime of nonlinearity, and slow down the convergence.
[Goal] Having the same distributions over time for the ease of training.
[Idea] Normalize the network’s input to have mean ⃗ 0 and covariance I (Whitening).
• Then we have fixed distributions of inputs and remove the ill effects of the internal covariate shift.
• But whitening is expensive, we will normalize each dimension of feature independently. I.e., have each x_j to have mean 0 and variance 1.
• μ_j ’s and σ_j ’s are estimated from each mini-batch.
[Issue] Normalization may change what the network can represent.
• E.g., normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
[Solution] Scale and shift the normalized value:
• γ_j ’s and β_j ’s are learnt form data.
• If γ_j = σ_j , β_j = μ_j , then a_j = x_j .
[Batch Normalizing Transform Algorithm] See Alg. 1.
[Testing] Using moving averages μ_j ’s and σ_j^2 ’s over training mini-batches instead.
[BN Convolutional Networks] Add BN between conv and relu layers.
• Conv layers’ output is more likely to be Gaussian.
• The shape of its output distribution of other nonlinearity is likely to change during training.
We want different elements of the same feature map, at different locations, are normalized in the same way.
• We learn a pair of parameters γ_j and β_j per feature map.
• The effective mini-batch size is mHW .
[BN Enables Higher Learning Rates] BN prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients.
[BN Reduces the Strong Dependence on Initialization]
[BN Regularizes the Model] A training example is seen in conjunction with other examples in the mini-batch, which acts as a form of regularization. And it slightly reduces the need for dropout, maybe.
[BN Helps the Network Train Faster and Achieve Higher Accuracy]
[Experiment] GoogLeNet + BN + ensemble of 6 networks
• 4.9% top-5 validation error.
• 3.8% test error.
• Exceeding the accuracy of human raters.
• Cf., GoogLeNet ensemble: 6.67%
[Internal Covariate Shift Revisited] Have the same mean and variance in each input does not mean that the data distributions are the same.
• BN actually prevents gradient vanishing.
• BN will make the scale of activations larger which used to be small.
[When to Use BN] When learning is slow, or meet gradient exploration.
. X.-S. Wei. https://www.zhihu.com/question/38102762.
. F.-F. Li, A. Karpathy and J. Johnson. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf.