Batch Normalization 论文阅读笔记

Batch Normalization 论文阅读笔记

原文:Batch Normalization:Accelerating Deep Network Traning by Reducing Internal Covariate Shift

概括

BN减少了internal covariate shift,后者指代训练过程中数据经过每层后发生的分布变化,因为每层微小的参数更新即会影响其输出的数据的分布,而随着网络越深,这种影响会越来越大,使得学习率设置和网络初始化时我们要谨小慎微,对越深的网络尤是如此,而BN的引入减少了internal covariate shift,从而使得网络更容易初始化、可以使用更大的学习率、网络更快收敛、减少了饱和非线性单元造成的梯度弥散作用、作为正则化使用(可能不需要Dropout)

文章解决了什么问题

一些定义

covariate shift means the input distribution to a learning system changes

internal covariate shift means the distribution of each leayer’s inputs changes during training, which slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.

问题

small changes to the network parameters amplify as the network becomes deeper

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution

用了什么方法

Batch Normalization: Take a step towrads reducing internal covariate shift to accelerate the trainning of deep neural nets normalizing layers inputs

Batch Normalizing Transform:
在这里插入图片描述
operate independently on each feature of x x x

Any layer that previously received x x x as input,now receives B N ( x ) BN(x) BN(x)

At inference time:
在这里插入图片描述

效果如何

  • 【Higher learning rate】allow us to user much higher learning rates and

  • 【less bother when initializing】make us less careful about initialization

  • 【call back saturating nonlinearities】makes it possible to user saturating nonlinearities by preventing the network from getting stuck in the saturated modes

  • 【Faster training】applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin

  • 【some invariance】has a beneficial effect on the gradient flow through the network by reducing the dependence of gradients on the scale of the parameters or of their initial values,allows us to use much higher learning rates without the risk of divergence. Back-propagation through a layer is unaffected by the scale of its parameters
    B N ( W u ) = B N ( ( a W ) u ) BN(Wu)=BN((aW)u) BN(Wu)=BN((aW)u)
    and
    ∂ B N ( ( a W ) u ) ∂ u = ∂ B N ( W u ) ∂ u \dfrac{\partial BN((aW)u)}{\partial u}=\dfrac{\partial BN(Wu)}{\partial u} uBN((aW)u)=uBN(Wu)

    ∂ B N ( ( a W ) u ) ∂ ( a W ) = 1 a ⋅ ∂ B N ( W u ) ∂ W \dfrac{\partial BN((aW)u)}{\partial (aW)}=\dfrac{1}{a}\cdot\dfrac{\partial BN(Wu)}{\partial W} (aW)BN((aW)u)=a1WBN(Wu)
    The scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.

  • 【Regularization】Act as a regularizer, in some cases eliminating the need(or strength) for Dropout

  • 【In practice】using an ensemble of batch-normalized networks, improve upon the best published result on ImageNet classification:reaching 4.9% top-5 validation error(and 4.8% test error),exceeding the accuracy of human raters.

存在什么不足

文中似乎没有给出一些不足之处

其他

1.mini-batch的好处

  • mini-batch相对单个样本来说,是对总体的损失值的更好的估计,batch size越大,估计越准确
  • mini-batch相比单个样本训练更快,因为可以使用并行

2.BN理论上不会使得网络表现更差

γ = v a r ( x ) β = m e a n ( x ) \gamma = var(x)\\ \beta = mean(x) γ=var(x)β=mean(x)
时,BN是一个恒等变换,这使得网络(理论上)至少不会变差

3.卷积网络上的BN
卷积网络中为了与卷积的特性一致,对于每个激活图分配一个 γ \gamma γ β \beta β,而非之前的对每个激活单元分配一个 γ \gamma γ β \beta β,此时若激活图是 p × q p\times q p×q的,就有$m’=mpq $

4.BN层放在哪
文中辨析了给定一层
x = W u + b z = g ( x ) x=Wu+b\\ z=g(x) x=Wu+bz=g(x)
那么究竟是对 u u u进行BN还是对 x x x进行BN?
由于 u u u一般是其他非线性层的输出,它的分布形状在训练中在训练中可能会变化,且限制它的一阶和二阶矩不会消除covariate shift,而 W u + b Wu+b Wu+b更可能有对称、非稀疏的分布,更加Gaussian(文中此处有引用),将其正规化更可能让激活单元有一个稳定的分布

Future Work

  • 在RNN上应用BN
  • 检测BN是否能帮助domain adaptation——whether the normalization performed by the network would allow it to more easily generalize to new data distributions,perhaps with just a recomputation of the population means and variances

问题

1.为什么不是0
文章中
∂ l ∂ μ B = ( ∑ i = 1 m ∂ l ∂ x ^ i ⋅ − 1 σ B 2 + ϵ ) + ∂ l ∂ σ B 2 ⋅ ∑ i = 1 m − 2 ( x i − μ B ) m \dfrac{\partial l}{\partial \mu_B}=(\sum_{i=1}^{m}\dfrac{\partial l}{\partial \hat{x}_i}\cdot \dfrac{-1}{\sqrt{\sigma_B^2+\epsilon}})+\dfrac{\partial l}{\partial \sigma_B^2}\cdot\dfrac{\sum_{i=1}^m-2(x_i-\mu_B)}{m} μBl=(i=1mx^ilσB2+ϵ 1)+σB2lmi=1m2(xiμB)
但因为
μ B = 1 m ∑ i = 1 m x i \mu_B = \dfrac{1}{m}\sum_{i=1}^mx_i μB=m1i=1mxi
就有
∑ i = 1 m ( x i − μ B ) = ∑ i = 1 m x i − m μ B = m μ B − m μ B = 0 \sum^m_{i=1}(x_i-\mu_B)=\sum_{i=1}^{m}x_i -m\mu_B=m\mu_B-m\mu_B=0 i=1m(xiμB)=i=1mximμB=mμBmμB=0
那第二项就为0,进过实际计算,也是如此,那为何要将其写在公式中
2.一些单词

  • inference,可能指网络测试时
  • population,可能指训练集全集
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值