cs231n assignment 2 q2 batchnormalization

理智点

于 2023-08-04 21:29:22 发布

阅读量1.4k

点赞数 8

分类专栏： cs231n 文章标签： python 开发语言人工智能深度学习机器学习

本文链接：https://blog.csdn.net/leezed525/article/details/132092685

版权

cs231n 专栏收录该内容

15 篇文章 14 订阅

订阅专栏

嫌啰嗦直接看源码

批量归一化讲解

我强烈建议，去读一下论文原文
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
下附论文两张挺重要的两张图
在这里插入图片描述

同时结合课程视频，和以下一些资料来看，这样方便你理解批量归一化做了什么

下面是我的手写推导过程

在这里插入图片描述

Q2 batchNormalization

batchnorm_forward

题面

在这里插入图片描述

就是让我们实现两种，一种是训练时的前向计算，一种是测试时的前向计算，具体怎么实现看上面的推导

解析

看本文开头的推导
在这里插入图片描述

代码

def batchnorm_forward(x, gamma, beta, bn_param):
    """Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #                                                                     #
        # Note that though you should be keeping track of the running         #
        # variance, you should normalize the data based on the standard       #
        # deviation (square root of variance) instead!                        #
        # Referencing the original paper (https://arxiv.org/abs/1502.03167)   #
        # might prove to be helpful.                                          #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x_mean = np.mean(x, axis=0)  # 计算均值
        x_var = np.var(x, axis=0)  # 计算方差
        x_std = np.sqrt(x_var + eps)  # 计算标准差
        x_norm = (x - x_mean) / x_std  # 归一化
        out = gamma * x_norm + beta  # 计算输出

        cache = (x, x_mean, x_var, x_std, x_norm, out, gamma, beta)  # 保存中间变量

        # 更新running_mean和running_var
        running_mean = momentum * running_mean + (1 - momentum) * x_mean
        running_var = momentum * running_var + (1 - momentum) * x_var

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == "test":
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x_norm = (x - running_mean) / np.sqrt(running_var + eps)  # 归一化
        out = gamma * x_norm + beta  # 计算输出

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param["running_mean"] = running_mean
    bn_param["running_var"] = running_var

    return out, cache

输出

在这里插入图片描述

batchnorm_backward

题面

在这里插入图片描述
就是让我们计算他的梯度
怎么算上面都有

解析

看本文最开始的内容
在这里插入图片描述

蛋疼了，我写到后面一个函数的时候发现，这一题是让我们用计算图的方式来求反向传播，我没看题直接用了最熟悉的链式求导法写完了。。。。计算图的方式太抽象了，我接受不了，我还是喜欢链式求导法，因此，本题的代码我直接用了别人的使用计算图的方式来，我会转载其他的人的回答

代码我也就直接用他们的了，反正我是不喜欢计算图的方式，大家仁者见仁吧

代码

def batchnorm_backward(dout, cache):
    """Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    # Referencing the original paper (https://arxiv.org/abs/1502.03167)       #
    # might prove to be helpful.                                              #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_std, x_norm, out, gamma, beta, eps = cache
    m = x.shape[0]
    dx_hat = dout * gamma
    dvar = np.sum(dx_hat * (x - x_mean) * (-0.5) * np.power((x_var + eps), -1.5), axis=0)
    dmean = np.sum(dx_hat * (-1) / np.sqrt(x_var + eps), axis=0) + dvar * np.sum(-2 * (x - x_mean), axis=0) / m

    dx_1 = dout * gamma
    dx_2_b = np.sum((x - x_mean) * dx_1, axis=0)
    dx_2_a = ((x_var + eps) ** -0.5) * dx_1
    dx_3_b = -0.5 * ((x_var + eps) ** -1.5) * dx_2_b
    dx_4_b = dx_3_b * 1
    dx_5_b = np.ones_like(x) / m * dx_4_b
    dx_6_b = 2 * (x - x_mean) * dx_5_b
    dx_7_a = dx_6_b * 1 + dx_2_a * 1
    dx_7_b = dx_6_b * 1 + dx_2_a * 1
    dx_8_b = -1 * np.sum(dx_7_b, axis=0)
    dx_9_b = np.ones_like(x) / m * dx_8_b
    dx_10 = dx_9_b + dx_7_a

    dx = dx_10
    dgamma = np.sum(dout * x_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

输出

在这里插入图片描述

batchnorm_backward_alt

题面

在这里插入图片描述

这题让我们用链式求导法来求梯度

解析

看上面。。。

代码

def batchnorm_backward_alt(dout, cache):
    """Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass.
    See the jupyter notebook for more hints.

    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    #                                                                         #
    # After computing the gradient with respect to the centered inputs, you   #
    # should be able to compute gradients with respect to the inputs in a     #
    # single statement; our implementation fits on a single 80-character line.#
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, x_mean, x_var, x_std, x_norm, out, gamma, beta, eps = cache
    dgamma = np.sum(dout * x_norm, axis=0)  # 计算dgamma
    dbeta = np.sum(dout, axis=0)  # 计算dbeta

    dx_norm = dout * gamma  # 计算dx_norm
    dx_var = np.sum(dx_norm * (x - x_mean) * (-0.5) * np.power(x_var + eps, -1.5), axis=0)  # 计算dx_var
    dx_mean = np.sum(dx_norm * (-1) / x_std, axis=0) + dx_var * np.sum(-2 * (x - x_mean), axis=0) / x.shape[0]  # 计算dx_mean
    dx = dx_norm / x_std + dx_var * 2 * (x - x_mean) / x.shape[0] + dx_mean / x.shape[0]  # 计算dx

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

输出

在这里插入图片描述
这个speedup 不是特别准，主要是看电脑的状态，不过大部分时间使用链式求导法来算的比计算图的快

神经网络中添加批量归一层

题面

在这里插入图片描述

没有具体的题面，就是让我们在上一个作业的基础上，修改fc_net里定义的init ,loss 函数
以及layer_utils.py

其中layer_utils里的内容是这样的
在这里插入图片描述
应该是想让我们仿照上面写好的affine_relu_forward 和 affine_relu_backward来实现affine_bn_relu_forward 和 affine_bn_relu_backward

解析

如果一步一步走下来这里应该没啥疑问，就把batchnorm看成类似于affine 或者 relu的一层即可

代码

init的函数里我在上一讲任务就已经加好了batchnorm的函数，因此只需要修改Loss函数的定义就好了

fc_net.init

    def __init__(
            self,
            hidden_dims,
            input_dim=3 * 32 * 32,
            num_classes=10,
            dropout_keep_ratio=1,
            normalization=None,
            reg=0.0,
            weight_scale=1e-2,
            dtype=np.float32,
            seed=None,
    ):
        """Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout_keep_ratio: Scalar between 0 and 1 giving dropout strength.
            If dropout_keep_ratio=1 then the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
            are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
            initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
            this datatype. float32 is faster but less accurate, so you should use
            float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers.
            This will make the dropout layers deteriminstic so we can gradient check the model.
        """
        self.normalization = normalization
        self.use_dropout = dropout_keep_ratio != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # 获取所有层数的维度
        layer_dims = [input_dim] + hidden_dims + [num_classes]
        # 初始化所有层的参数 (这里的层数是上面的layer_dims的长度减1,因此不会下标越界)
        for i in range(self.num_layers):
            self.params['W' + str(i + 1)] = np.random.normal(0, weight_scale, size=(layer_dims[i], layer_dims[i + 1]))
            self.params['b' + str(i + 1)] = np.zeros(layer_dims[i + 1])
            # 接下来添加batch normalization 层，注意最后一层不需要添加
            if self.normalization == 'batchnorm' and i < self.num_layers - 1:
                self.params['gamma' + str(i + 1)] = np.ones(layer_dims[i + 1])
                self.params['beta' + str(i + 1)] = np.zeros(layer_dims[i + 1])

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
            if seed is not None:
                self.dropout_param["seed"] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == "batchnorm":
            self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
        if self.normalization == "layernorm":
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype.
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

layer_utils

# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

def affine_bn_relu_forward(x, w, b, gamma, beta, bn_param):
    affine_out,affine_cache = affine_forward(x, w, b)
    bn_out,bn_cache = batchnorm_forward(affine_out, gamma, beta, bn_param)
    relu_out,relu_cache = relu_forward(bn_out)
    cache = (affine_cache, bn_cache, relu_cache)
    return relu_out, cache

def affine_bn_relu_backward(dout, cache):
    affine_cache, bn_cache, relu_cache = cache
    drelu_out = relu_backward(dout, relu_cache)
    dbn_out, dgamma, dbeta = batchnorm_backward(drelu_out, bn_cache)
    dx, dw, db = affine_backward(dbn_out, affine_cache)
    return dx, dw, db, dgamma, dbeta

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

fc_net.loss

    def loss(self, X, y=None):
        """Compute loss and gradient for the fully connected net.
        
        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
            scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
            names to gradients of the loss with respect to those parameters.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # 我们网络的结果是这样的 {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

        # 用一个变量保存上一层的输出
        layer_input = X
        caches = {}
        # 对前面 L - 1层进行操作，因为最后一层的操作和前面的不一样
        for i in range(1, self.num_layers):
            W = self.params['W' + str(i)]
            b = self.params['b' + str(i)]
            if self.normalization == 'batchnorm':
                gamma = self.params['gamma' + str(i)]
                beta = self.params['beta' + str(i)]
                layer_input, caches['layer' + str(i)] = affine_bn_relu_forward(layer_input, W, b, gamma, beta, self.bn_params[i - 1])
            else:
                layer_input, caches['layer' + str(i)] = affine_relu_forward(layer_input, W, b)


        # 最后一层的操作
        W = self.params['W' + str(self.num_layers)]
        b = self.params['b' + str(self.num_layers)]

        scores, affine_cache = affine_forward(layer_input, W, b)
        caches['layer' + str(self.num_layers)] = affine_cache

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early.
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the   #
        # scale and shift parameters.                                              #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # 计算loss
        loss, dscores = softmax_loss(scores, y)

        # 先计算最后一层的梯度
        dx, dw, db = affine_backward(dscores, caches['layer' + str(self.num_layers)])
        grads['W' + str(self.num_layers)] = dw + self.reg * self.params['W' + str(self.num_layers)]
        grads['b' + str(self.num_layers)] = db

        for i in range(self.num_layers - 1, 0, -1):
            if self.normalization == 'batchnorm':
                dx, dw, db, dgamma, dbeta = affine_bn_relu_backward(dx, caches['layer' + str(i)])
                grads['gamma' + str(i)] = dgamma
                grads['beta' + str(i)] = dbeta
            else:
                dx, dw, db = affine_relu_backward(dx, caches['layer' + str(i)])
            grads['W' + str(i)] = dw + self.reg * self.params['W' + str(i)]
            grads['b' + str(i)] = db

        # 加上正则化项
        for i in range(1, self.num_layers + 1):
            W = self.params['W' + str(i)]
            loss += 0.5 * self.reg * np.sum(W * W)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

输出

在这里插入图片描述

layer normalization

题面

在这里插入图片描述
就是让我们去使用层归一化，为什么要使用层归一化，因为batchnormalization虽然好用，但是他有不可避免的缺点，就比如

Batch Normalization中batch的大小，会影响实验结果，主要是因为小的batch中计算的均值和方差可能与测试集数据中的均值与方差不匹配；
难以用于RNN。以 Seq2seq任务为例，同一个batch中输入的数据长短不一，不同的时态下需要保存不同的统计量，无法正确使用BN层，只能使用Layer Normalization。

因此，出现了layer normalization , 因为如果理解了batch normalization 的话再去理解 layer normalization 的话比较轻松，因此我就不手写理论了

先给几个学习资料

对于这两个normalization ,我是这么理解的，batch normalization 是对输入矩阵的每一列求平均值和方差，而layer normalization 是对输入矩阵的每一行求平均值和方差，所以batch normalziation 对输入数据的数量敏感，而layer normalization 因为是对每一行求方差，而每一行有多少元素是固定的，因此对输入数据的数量不敏感

所以，我们求ln_forward 和 ln_backward的时候其实运算跟bn基本类似，只要将输入数据转置一下，就是bn了，所以我们也可以反过来运用这一点来写代码

layernorm_forward

题面

在这里插入图片描述

解析

看前文

代码

def layernorm_forward(x, gamma, beta, ln_param):
    """Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.

    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get("eps", 1e-5)
    ###########################################################################
    # TODO: Implement the training-time forward pass for layer norm.          #
    # Normalize the incoming data, and scale and  shift the normalized data   #
    #  using gamma and beta.                                                  #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of  batch normalization, and inserting a line or two of  #
    # well-placed code. In particular, can you think of any matrix            #
    # transformations you could perform, that would enable you to copy over   #
    # the batch norm code and leave it almost unchanged?                      #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

 	# 对输入数据求转置
    x = x.T
    gamma, beta = np.atleast_2d(gamma).T, np.atleast_2d(beta).T

    # 直接复用bn代码
    x_mean = np.mean(x, axis=0)
    x_var = np.var(x, axis=0)
    x_std = np.sqrt(x_var + eps)
    x_norm = (x - x_mean) / x_std
    out = gamma * x_norm + beta

    # 转置回来
    out = out.T

    cache = (x, x_mean, x_var, x_std, x_norm, out, gamma, beta, eps)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return out, cache

输出

在这里插入图片描述

layernorm_backward

题面

在这里插入图片描述

解析

自己思考一下就好了

代码

def layernorm_backward(dout, cache):
    """Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    # 对输入数据求转置，直接复用bn代码
    x, x_mean, x_var, x_std, x_norm, out, gamma, beta, eps = cache

    dout = dout.T
    dgamma = np.sum(dout * x_norm, axis=1)  # 计算dgamma
    dbeta = np.sum(dout, axis=1)  # 计算dbeta

    dx_norm = dout * gamma  # 计算dx_norm
    dx_var = np.sum(dx_norm * (x - x_mean) * (-0.5) * np.power(x_var + eps, -1.5), axis=0)  # 计算dx_var
    dx_mean = np.sum(dx_norm * (-1) / x_std, axis=0) + dx_var * np.sum(-2 * (x - x_mean), axis=0) / x.shape[0]  # 计算dx_mean
    dx = dx_norm / x_std + dx_var * 2 * (x - x_mean) / x.shape[0] + dx_mean / x.shape[0]  # 计算dx

    dx = dx.T

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dgamma, dbeta