Batch Normalization 的反向传播

附加:

Group Normalization作业应该是cs231N在2018年新出的

https://www.jianshu.com/p/aaeb9cd4d70c

https://github.com/cnscott/JupyterNotebook/blob/master/CS231n/assignment2/ConvolutionalNetworks.ipynb

def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):

    out, cache = None, None
    eps = gn_param.get('eps',1e-5)
    ###########################################################################
    # TODO: Implement the forward pass for spatial group normalization.       #
    # This will be extremely similar to the layer norm implementation.        #
    # In particular, think about how you could transform the matrix so that   #
    # the bulk of the code is similar to both train-time batch normalization  #
    # and layer normalization!                                                # 
    ###########################################################################
    N,C,H,W = x.shape
    x_group = np.reshape(x, (N, G, C//G, H, W)) #按G将C分组
    mean = np.mean(x_group, axis=(2,3,4), keepdims=True) #均值
    var = np.var(x_group, axis=(2,3,4), keepdims=True) #方差
    x_groupnorm = (x_group-mean)/np.sqrt(var+eps) #归一化
    x_norm = np.reshape(x_groupnorm, (N,C,H,W)) #还原维度
    out = x_norm*gamma+beta #还原C
    cache = (G, x, x_norm, mean, var, beta, gamma, eps)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return out, cache


def spatial_groupnorm_backward(dout, cache):

    dx, dgamma, dbeta = None, None, None

    ###########################################################################
    # TODO: Implement the backward pass for spatial group normalization.      #
    # This will be extremely similar to the layer norm implementation.        #
    ###########################################################################
    N,C,H,W = dout.shape
    G, x, x_norm, mean, var, beta, gamma, eps = cache
    # dbeta,dgamma
    dbeta = np.sum(dout, axis=(0,2,3), keepdims=True)
    dgamma = np.sum(dout*x_norm, axis=(0,2,3), keepdims=True)

    # 计算dx_group,(N, G, C // G, H, W)
    # dx_groupnorm
    dx_norm = dout * gamma
    dx_groupnorm = dx_norm.reshape((N, G, C // G, H, W))
    # dvar
    x_group = x.reshape((N, G, C // G, H, W))
    dvar = np.sum(dx_groupnorm * -1.0 / 2 * (x_group - mean) / (var + eps) ** (3.0 / 2), axis=(2,3,4), keepdims=True)
    # dmean
    N_GROUP = C//G*H*W
    dmean1 = np.sum(dx_groupnorm * -1.0 / np.sqrt(var + eps), axis=(2,3,4), keepdims=True)
    dmean2_var = dvar * -2.0 / N_GROUP * np.sum(x_group - mean, axis=(2,3,4), keepdims=True)
    dmean = dmean1 + dmean2_var
    # dx_group
    dx_group1 = dx_groupnorm * 1.0 / np.sqrt(var + eps)
    dx_group2_mean = dmean * 1.0 / N_GROUP
    dx_group3_var = dvar * 2.0 / N_GROUP * (x_group - mean)
    dx_group = dx_group1 + dx_group2_mean + dx_group3_var

    # 还原C得到dx
    dx = dx_group.reshape((N, C, H, W))
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dgamma, dbeta

https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

附带cs231N笔记:

http://cs231n.github.io/optimization-2/#sigmoid

 

http://cthorey.github.io./backpropagation/

https://blog.csdn.net/kevin_hee/article/details/80783698


正文开始: 

def batchnorm_backward(dout, cache):

  #unfold the variables stored in cache
  xhat,gamma,xmu,ivar,sqrtvar,var,eps = cache

  #get the dimensions of the input/output
  N,D = dout.shape

  #step9
  dbeta = np.sum(dout, axis=0)
  dgammax = dout #not necessary, but more understandable

  #step8
  dgamma = np.sum(dgammax*xhat, axis=0)
  dxhat = dgammax * gamma

  #step7
  divar = np.sum(dxhat*xmu, axis=0)
  dxmu1 = dxhat * ivar

  #step6
  dsqrtvar = -1. /(sqrtvar**2) * divar

  #step5
  dvar = 0.5 * 1. /np.sqrt(var+eps) * dsqrtvar

  #step4
  dsq = 1. /N * np.ones((N,D)) * dvar

  #step3
  dxmu2 = 2 * xmu * dsq

  #step2
  dx1 = (dxmu1 + dxmu2)
  dmu = -1 * np.sum(dxmu1+dxmu2, axis=0)

  #step1
  dx2 = 1. /N * np.ones((N,D)) * dmu

  #step0
  dx = dx1 + dx2

  return dx, dgamma, dbeta

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Batch Normalization反向传播可以用链式法则来推导。下面是一个简单的推导过程: 假设输入为$x$,BN层的输出为$y$,其归一化后的值为$\hat{y}$,缩放和移位后的输出为$z$,BN层的参数为$\gamma$和$\beta$,损失函数为$L$。 则有: $$\hat{y}=\frac{x-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$$ $$y=\gamma\hat{y}+\beta$$ 其中,$\mu_B$和$\sigma_B^2$分别是批量的均值和方差,$\epsilon$是一个很小的常数,防止分母为零。 BN层的反向传播分为两部分:对$\gamma$和$\beta$的梯度和对输入$x$的梯度。 首先,对$\gamma$和$\beta$的梯度可以直接计算: $$\frac{\partial L}{\partial \gamma}=\sum_{i=1}^n\frac{\partial L}{\partial y_i}\hat{y_i}$$ $$\frac{\partial L}{\partial \beta}=\sum_{i=1}^n\frac{\partial L}{\partial y_i}$$ 接下来,我们需要计算对输入$x$的梯度。我们可以先计算$\frac{\partial L}{\partial \hat{y}}$,然后通过链式法则计算出$\frac{\partial L}{\partial x}$: $$\frac{\partial L}{\partial \hat{y_i}}=\frac{\partial L}{\partial y_i}\gamma$$ $$\frac{\partial L}{\partial \sigma_B^2}=\sum_{i=1}^n\frac{\partial L}{\partial \hat{y_i}}(x_i-\mu_B)(-\frac{1}{2})(\sigma_B^2+\epsilon)^{-\frac{3}{2}}$$ $$\frac{\partial L}{\partial \mu_B}=-\frac{1}{\sqrt{\sigma_B^2+\epsilon}}\sum_{i=1}^n\frac{\partial L}{\partial \hat{y_i}}+\frac{\partial L}{\partial \sigma_B^2}\frac{1}{n}\sum_{i=1}^n-2(x_i-\mu_B)$$ $$\frac{\partial L}{\partial x_i}=\frac{\partial L}{\partial \hat{y_i}}\frac{1}{\sqrt{\sigma_B^2+\epsilon}}+\frac{\partial L}{\partial \sigma_B^2}\frac{2(x_i-\mu_B)}{n}+\frac{\partial L}{\partial \mu_B}\frac{1}{n}$$ 其中,$\frac{\partial L}{\partial \mu_B}$和$\frac{\partial L}{\partial \sigma_B^2}$可以通过反向传播递归计算得到。 最后,我们可以使用$\frac{\partial L}{\partial x}$更新网络的权重参数。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值