Batch Normalization和Dropout

iwill323

已于 2022-09-16 18:23:11 修改

阅读量1k

点赞数

分类专栏： CS231n代码文章标签： python 深度学习机器学习

于 2022-09-16 14:29:32 首次发布

本文链接：https://blog.csdn.net/iwill323/article/details/126885192

版权

Batch Normalization 和初始化

Batch Normalization 和 Batch Size

Layer Normalization

Layer Normalization 和 Batch Size

卷积层的batch norm--spatial batchnorm

Spatial Group Normalization

dropout

Regularization Experiment

The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, they propose to insert into the network layers that normalize batches. At training time, such a layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

Sergey Ioffe and Christian Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML 2015.

导包和处理数据

# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """Returns relative error."""
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def print_mean_std(x,axis=0):
    print(f"  means: {x.mean(axis=axis)}")
    print(f"  stds:  {x.std(axis=axis)}\n")
    
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
    print(f"{k}: {v.shape}")

BatchNorm

forward

The outputs from the last layer of the network should not be normalized.

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        x_mean = np.mean(x, axis = 0)
        x_var = np.var(x, axis = 0) + eps
        x_norm = (x - x_mean) / np.sqrt(x_var)
        out = x_norm * gamma  + beta  #  (N, D)
        cache = (x, x_norm, gamma, x_mean, x_var)
        
        # Store the updated running means back into bn_param
        bn_param["running_mean"] = momentum * running_mean + (1 - momentum) * x_mean
        bn_param["running_var"] = momentum * running_var + (1 - momentum) * x_var        

    elif mode == "test":
        x_norm = (x - running_mean) / np.sqrt(running_var + eps) 
        out = gamma * x_norm + beta
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    return out, cache

backward

可以参考：

【cs231n】Batchnorm及其反向传播_JoeYF_的博客-CSDN博客_batchnorm反向传播

Batch Normalization的反向传播解说_macan_dct的博客-CSDN博客_batchnorm反向传播

Understanding the backward pass through Batch Normalization Layer

理解Batch Normalization（批量归一化）_坚硬果壳_的博客-CSDN博客

步骤9：

步骤8：

步骤7：

步骤6：

步骤5：

步骤4：

步骤3：

步骤2：

步骤1：

步骤0：

复杂版

def batchnorm_backward(dout, cache):
    """  
    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None

    x, x_norm, gamma, mean, var = cache
    N, D = dout.shape
    
    dbeta = np.sum(dout, axis = 0)
    dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)
    # 注意求dgamma的时候没有使用矩阵乘法。在FC层的反向传播需要中矩阵乘法
    #可能因为正向传播的时候就没有使用真正的矩阵乘法，用的是(N, D) * (D,)
         
    dx_norm = dout * gamma  # (N, D) * (D,) x_norm是归一处理后的x      
    x_mu = x - mean
    std = np.sqrt(var)  # var已经包含了eps
    ivar = 1 / std # (D,)   
    # 求x_norm的式子中，分子分母中都含有x_mu，先求对这个x_mu的导数
    # 将std的倒数视作一个乘数
    dxmu1 = dx_norm * ivar  # (N, D) * (D,) 
    divar = np.sum(dx_norm * x_mu, axis = 0)  # 
    dsqrtvar = - 1/ (std ** 2) * divar  # 倒数对根号项的导数 根号项正是std
    dvar = 0.5 / std * dsqrtvar  # 根号项对var(根号项中非eps项)的导数
    dsquare = 1/N * np.ones((N, D)) * dvar  #求var时对（N,D）矩阵进行了求列平均操作
    dxmu2 = 2 * x_mu * dsquare # 平方项对x_mu求导
    dx1 = dxmu1 + dxmu2
    
    #下面求x_mu对x的导数，注意平均值mu也是x的函数，所以导数是一个和
    # 先对mu求导，然后mu对x求导    
    dmu = -1 * np.sum(dx1, axis = 0)  # (D,)  广播的反向传播
    dx2 = 1/N * np.ones((N, D)) * dmu  # 求mu时采用了求平均的操作
    dx = dx1 + dx2
    return dx, dgamma, dbeta

快速版

def batchnorm_backward_alt(dout, cache):
    """
    derive a simple expression for the backward pass.   

    """
    dx, dgamma, dbeta = None, None, None
    x, x_norm, gamma, mean, var = cache
    std = np.sqrt(var)
    
    dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)
    dbeta = np.sum(dout, axis = 0)
    
    N = 1.0 * x.shape[0]
    dfdu = dout * gamma
    # 将x_norm拆成x，mu，var三项，分别对x求导，然后加起来
    dfdv = np.sum(dfdu * (x - mean) * -0.5 * var ** -1.5, axis = 0) # f对var求导   
    dfdw = np.sum(dfdu * -1 / std, axis = 0) + \
        dfdv * np.sum(-2/N * (x - mean), axis = 0)  # f对mu求导，包括var项对mu求导
    dx = dfdu / std + dfdv * 2/N * (x - mean) + dfdw / N  # 三项分别对x求导并求和

    '''
    # 也可以用下面方式算，看不懂
    N = dout.shape[0]         
    dfdz = dout * gamma                          # 下面用 z 指代 x_norm  [NxD]
    dfdz_sum = np.sum(dfdz,axis=