Algorithms
The batch normalization operation normalizes the elements
xi of the input by first calculating the mean
μB and variance
σB2 over the
spatial, time, and observation dimensions for each channel independently. Then, it
calculates the normalized activations as
xi^=xi−μBσB2+ϵ,
where ϵ is a constant that improves numerical
stability when the variance is very small.
To allow for the possibility that inputs with zero mean and unit variance are not optimal for
the operations that follow batch normalization, the batch normalization operation further
shifts and scales the activations using the transformation
yi=γx^i+β,
where the offset β and scale factor
γ are learnable parameters that are updated during network
training.
To make predictions with the network after training, batch normalization requires a fixed
mean and variance to normalize the data. This fixed mean and variance can be calculated from
the training data after training, or approximated during training using running statistic
computations.
If the 'BatchNormalizationStatistics' training option is 'moving', then the software approximates the batch normalization statistics during training using a running estimate and, after training, sets the TrainedMean and TrainedVariance properties to the latest values of the moving estimates of the mean and variance, respectively.
If the 'BatchNormalizationStatistics' training option is 'population', then after network training finishes, the software passes through the data once more and sets the TrainedMean and TrainedVariance properties to mean and variance computed from the entire training data set, respectively.
The layer uses the TrainedMean and TrainedVariance to normalize the input during prediction.