Batch Normalization和Dropout

目录

导包和处理数据

BatchNorm

forward

backward

训练BatchNorm并显示结果

 Batch Normalization 和初始化

 Batch Normalization 和 Batch Size

 Layer Normalization

Layer Normalization 和 Batch Size

卷积层的batch norm--spatial batchnorm

Spatial Group Normalization

dropout

Regularization Experiment

 Solver和网络

solver

 网络

网络用到的辅助函数(各层)

损失函数


The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, they propose to insert into the network layers that normalize batches. At training time, such a layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

Sergey Ioffe and Christian Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML 2015.

导包和处理数据

# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """Returns relative error."""
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def print_mean_std(x,axis=0):
    print(f"  means: {x.mean(axis=axis)}")
    print(f"  stds:  {x.std(axis=axis)}\n")
    
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
    print(f"{k}: {v.shape}")

BatchNorm

forward

The outputs from the last layer of the network should not be normalized. 

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        x_mean = np.mean(x, axis = 0)
        x_var = np.var(x, axis = 0) + eps
        x_norm = (x - x_mean) / np.sqrt(x_var)
        out = x_norm * gamma  + beta  #  (N, D)
        cache = (x, x_norm, gamma, x_mean, x_var)
        
        # Store the updated running means back into bn_param
        bn_param["running_mean"] = momentum * running_mean + (1 - momentum) * x_mean
        bn_param["running_var"] = momentum * running_var + (1 - momentum) * x_var        

    elif mode == "test":
        x_norm = (x - running_mean) / np.sqrt(running_var + eps) 
        out = gamma * x_norm + beta
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    return out, cache

backward

可以参考:

【cs231n】Batchnorm及其反向传播_JoeYF_的博客-CSDN博客_batchnorm反向传播

Batch Normalization的反向传播解说_macan_dct的博客-CSDN博客_batchnorm反向传播

Understanding the backward pass through Batch Normalization Layer

理解Batch Normalization(批量归一化)_坚硬果壳_的博客-CSDN博客

步骤9:

步骤8:

步骤7:

步骤6:

步骤5:

步骤4:

步骤3:

步骤2:

步骤1:

步骤0:

复杂版

def batchnorm_backward(dout, cache):
    """  
    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None

    x, x_norm, gamma, mean, var = cache
    N, D = dout.shape
    
    dbeta = np.sum(dout, axis = 0)
    dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)
    # 注意求dgamma的时候没有使用矩阵乘法。在FC层的反向传播需要中矩阵乘法
    #可能因为正向传播的时候就没有使用真正的矩阵乘法,用的是(N, D) * (D,)
         
    dx_norm = dout * gamma  # (N, D) * (D,) x_norm是归一处理后的x      
    x_mu = x - mean
    std = np.sqrt(var)  # var已经包含了eps
    ivar = 1 / std # (D,)   
    # 求x_norm的式子中,分子分母中都含有x_mu,先求对这个x_mu的导数
    # 将std的倒数视作一个乘数
    dxmu1 = dx_norm * ivar  # (N, D) * (D,) 
    divar = np.sum(dx_norm * x_mu, axis = 0)  # 
    dsqrtvar = - 1/ (std ** 2) * divar  # 倒数对根号项的导数 根号项正是std
    dvar = 0.5 / std * dsqrtvar  # 根号项对var(根号项中非eps项)的导数
    dsquare = 1/N * np.ones((N, D)) * dvar  #求var时对(N,D)矩阵进行了求列平均操作
    dxmu2 = 2 * x_mu * dsquare # 平方项对x_mu求导
    dx1 = dxmu1 + dxmu2
    
    #下面求x_mu对x的导数,注意平均值mu也是x的函数,所以导数是一个和
    # 先对mu求导,然后mu对x求导    
    dmu = -1 * np.sum(dx1, axis = 0)  # (D,)  广播的反向传播
    dx2 = 1/N * np.ones((N, D)) * dmu  # 求mu时采用了求平均的操作
    dx = dx1 + dx2
    return dx, dgamma, dbeta

 快速版

def batchnorm_backward_alt(dout, cache):
    """
    derive a simple expression for the backward pass.   

    """
    dx, dgamma, dbeta = None, None, None
    x, x_norm, gamma, mean, var = cache
    std = np.sqrt(var)
    
    dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)
    dbeta = np.sum(dout, axis = 0)
    
    N = 1.0 * x.shape[0]
    dfdu = dout * gamma
    # 将x_norm拆成x,mu,var三项,分别对x求导,然后加起来
    dfdv = np.sum(dfdu * (x - mean) * -0.5 * var ** -1.5, axis = 0) # f对var求导   
    dfdw = np.sum(dfdu * -1 / std, axis = 0) + \
        dfdv * np.sum(-2/N * (x - mean), axis = 0)  # f对mu求导,包括var项对mu求导
    dx = dfdu / std + dfdv * 2/N * (x - mean) + dfdw / N  # 三项分别对x求导并求和

    '''
    # 也可以用下面方式算,看不懂
    N = dout.shape[0]         
    dfdz = dout * gamma                          # 下面用 z 指代 x_norm  [NxD]
    dfdz_sum = np.sum(dfdz,axis=
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值