Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Key Problems

  • the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
  • This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.
  • The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution.

Contributions

  • Batch Normalization allows us to use much higher learning rates and be less careful about initialization.
  • It also acts as a regularizer, in some cases eliminating the need for Dropout.
  • Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
  • Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
    这里写图片描述

Methods

  • minimize the loss
    这里写图片描述
  • gradient of the loss function
  • 这里写图片描述
  • covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer
    这里写图片描述

    这里写图片描述
    gradient descent step
    这里写图片描述

  • normalize the inputs
    这里写图片描述
  • normalizing the inputs of a sigmoid would constrain them to
    the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform.
    这里写图片描述

Others

  • advantages of mini-batch
    • First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases.
    • computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms.
  • While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters.
  • the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值