Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

最新推荐文章于 2020-01-02 00:26:28 发布

fourye007

最新推荐文章于 2020-01-02 00:26:28 发布

阅读量289

点赞数

分类专栏： deep-learning

本文链接：https://blog.csdn.net/yzf0011/article/details/78382861

版权

20 篇文章 0 订阅

订阅专栏

the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution.

Batch Normalization allows us to use much higher learning rates and be less careful about initialization.
It also acts as a regularizer, in some cases eliminating the need for Dropout.
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

minimize the loss
gradient of the loss function
covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer

gradient descent step
normalize the inputs
normalizing the inputs of a sigmoid would constrain them to
the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform.

advantages of mini-batch
- First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases.
- computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms.
While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters.
the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.

关注