Paper--Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

最新推荐文章于 2023-03-13 10:59:02 发布

jerry173985

最新推荐文章于 2023-03-13 10:59:02 发布

阅读量209

点赞数

分类专栏： Paper

本文链接：https://blog.csdn.net/jerry173985/article/details/110822856

版权

本文探讨了神经网络中权重初始化和预处理方法的关系，以及批量归一化的动机和作用。批量归一化通过减少内部协变量漂移，加快了深度网络的训练速度。在训练过程中，它使用每个小批量数据的均值和方差对隐藏层特征进行标准化，以稳定模型输出特征的分布，从而提高模型训练的收敛速度和稳定性。

摘要由CSDN通过智能技术生成

An article on batch processing, detailing the specific advantages of using BN

1. The relationship between weight initialization and preprocessing methods in neural networks

Preprocessing the data, such as whitening or zscore, or even a simple subtraction of the mean operation can speed up the convergence, for example, a simple example shown in the following figure:

在这里插入图片描述

The red dots in the figure represent 2-dimensional data points. Since each dimension of the image data is generally a number between 0-255, the data points will only fall in the first quadrant, and the image data has a strong correlation. For example, if the first gray value is 30, which is relatively dark, the value of a pixel next to it will generally not exceed 100, otherwise it will feel like noise. Due to the strong correlation, the data points will only fall in a small area of the first quadrant, forming a long and narrow distribution similar to the one shown in the figure above.

When the neural network model is initialized, the weight W is randomly sampled and generated. A common neuron is expressed as: $R e L U (W x + b) = m a x (W x + b, 0)$ , that is, when $W x + b = 0$ . Also, different operation methods are used for the data. Specifically, ReLU means that one side shrinks and one side remains unchanged.

The random $W x + b = 0$ is shown as the random dotted line in the above figure. Note that the two green dotted lines are actually meaningless. When using gradient descent, it may take many iterations to make these dotted lines perform data points Effective segmentation, like the purple dotted line, will inevitably bring about the problem of slowing down the solution rate. What’s more, ours is just a two-dimensional presentation. The data occupies one of the four quadrants. How about hundreds, thousands, or tens of thousands of dimensions? Moreover, the data only occupies a small area in the first quadrant. It is conceivable how much computing resources are wasted if the data is not preprocessed, and a large number of data external partitions are likely to encountered a local optimum in the iterative process, leading to the problem of overfit.

If we subtract the mean value of the data, the data points are no longer only distributed in the first quadrant. The probability of a random interface falling into the data distribution will increase 2^n times! If we use de-correlation algorithms, such as PCA and ZCA whitening, the data is no longer a long and narrow distribution, and the probability that the random interface is effective will greatly increase.

However, calculating the eigenvalues of the covariance matrix is too time-consuming and too space-consuming. We generally only use z-score processing at most, that is, each dimension subtracts its own mean value, and then divides by its own standard deviation, so that the data points can be The dimensions have similar widths, which can increase the data distribution range to a certain extent, thereby making more random interfaces meaningful.

Batch Normalization

Motivation of Batch Normalization (BN)

Generally speaking, if the input characteristics of the model are not relevant and meet the standard normal distribution $N (0, 1)$ , the performance of the model is generally better. When training the neural network model, we can decorate the features in advance and make them meet a better distribution. In this way, the first layer network of the model generally has a better input feature, but as the number of layers of the model deepens , The nonlinear transformation of the network makes the results of each layer become relevant, and no longer satisfy the N(0, 1) distribution. To make matters worse, the feature distribution of these hidden layers may have shifted.

The authors of the paper believe that the above problems will make the training of neural networks difficult. In order to solve this problem, they proposed to add a Batch Normalization layer between layers. During training, the BN layer uses the mean $\mu_\mathcal{B}$ and variance $\sigma^2_\mathcal{B}$ of the output results of the hidden layer to standardize the distribution of features in each layer, and maintains the mean and variance of all mini-batch data , And finally use the unbiased estimator of sample mean and variance for testing.

In view of the fact that in some cases the non-standardized distribution of layer features may be optimal, standardizing the output features of each layer will make the expressive ability of the network worse, the author adds two learnable zooms to the BN layer The parameters $\gamma$ and offset parameters $\beta$ allow the model to adjust the layer feature distribution adaptively.

在这里插入图片描述

1. Forward Algorithm

The implementation of the Batch Normalization layer is very simple, and the main process is given by the following algorithm:

First calculate the mean value of the mini-batch $\mu_\mathcal{B}$

最低0.47元/天解锁文章

jerry173985

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Paper--Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

An article on batch processing, detailing the specific advantages of using BN1. The relationship between weight initialization and preprocessing methods in neural networksIf you have done the dnn experiment, you may find that preprocessing the data, such
复制链接

扫一扫