An article on batch processing, detailing the specific advantages of using BN
1. The relationship between weight initialization and preprocessing methods in neural networks
Preprocessing the data, such as whitening or zscore, or even a simple subtraction of the mean operation can speed up the convergence, for example, a simple example shown in the following figure:
The red dots in the figure represent 2-dimensional data points. Since each dimension of the image data is generally a number between 0-255, the data points will only fall in the first quadrant, and the image data has a strong correlation. For example, if the first gray value is 30, which is relatively dark, the value of a pixel next to it will generally not exceed 100, otherwise it will feel like noise. Due to the strong correlation, the data points will only fall in a small area of the first quadrant, forming a long and narrow distribution similar to the one shown in the figure above.
When the neural network model is initialized, the weight W is randomly sampled and generated. A common neuron is expressed as: R e L U ( W x + b ) = m a x ( W x + b , 0 ) ReLU(Wx+b) = max(Wx+b,0) ReLU(Wx+b)=max(Wx+b,0), that is, when W x + b = 0 Wx+b=0 Wx+b=0. Also, different operation methods are used for the data. Specifically, ReLU means that one side shrinks and one side remains unchanged.
The random W x + b = 0 Wx+b=0 Wx+b=0 is shown as the random dotted line in the above figure. Note that the two green dotted lines are actually meaningless. When using gradient descent, it may take many iterations to make these dotted lines perform data points Effective segmentation, like the purple dotted line, will inevitably bring about the problem of slowing down the solution rate. What’s more, ours is just a two-dimensional presentation. The data occupies one of the four quadrants. How about hundreds, thousands, or tens of thousands of dimensions? Moreover, the data only occupies a small area in the first quadrant. It is conceivable how much computing resources are wasted if the data is not preprocessed, and a large number of data external partitions are likely to encountered a local optimum in the iterative process, leading to the problem of overfit.
If we subtract the mean value of the data, the data points are no longer only distributed in the first quadrant. The probability of a random interface falling into the data distribution will increase 2^n times! If we use de-correlation algorithms, such as PCA and ZCA whitening, the data is no longer a long and narrow distribution, and the probability that the random interface is effective will greatly increase.
However, calculating the eigenvalues of the covariance matrix is too time-consuming and too space-consuming. We generally only use z-score processing at most, that is, each dimension subtracts its own mean value, and then divides by its own standard deviation, so that the data points can be The dimensions have similar widths, which can increase the data distribution range to a certain extent, thereby making more random interfaces meaningful.
Batch Normalization
Motivation of Batch Normalization (BN)
Generally speaking, if the input characteristics of the model are not relevant and meet the standard normal distribution N ( 0 , 1 ) N(0, 1) N(0,1), the performance of the model is generally better. When training the neural network model, we can decorate the features in advance and make them meet a better distribution. In this way, the first layer network of the model generally has a better input feature, but as the number of layers of the model deepens , The nonlinear transformation of the network makes the results of each layer become relevant, and no longer satisfy the N(0, 1) distribution. To make matters worse, the feature distribution of these hidden layers may have shifted.
The authors of the paper believe that the above problems will make the training of neural networks difficult. In order to solve this problem, they proposed to add a Batch Normalization layer between layers. During training, the BN layer uses the mean μ B \mu_\mathcal{B} μB and variance σ B 2 \sigma^2_\mathcal{B} σB2 of the output results of the hidden layer to standardize the distribution of features in each layer, and maintains the mean and variance of all mini-batch data , And finally use the unbiased estimator of sample mean and variance for testing.
In view of the fact that in some cases the non-standardized distribution of layer features may be optimal, standardizing the output features of each layer will make the expressive ability of the network worse, the author adds two learnable zooms to the BN layer The parameters γ \gamma γ and offset parameters β \beta β allow the model to adjust the layer feature distribution adaptively.
1. Forward Algorithm
The implementation of the Batch Normalization layer is very simple, and the main process is given by the following algorithm:
First calculate the mean value of the mini-batch μ B \mu_\mathcal{B}