【cs231n】正则化_cs231n 正则化-CSDN博客

本文链接：https://blog.csdn.net/qyf394613530/article/details/89514672

文章目录

当训练数据过少、网络复杂或训练过多时，会出现过拟合。在训练集上的准确度不断提高，但在测试集上的准确率不高。为了解决这个问题，就需要在损失函数中加入正则化项，来降低网络的复杂度，提高其泛化能力。

L2 regularization

L2范数是指向量各元素的平方和然后再求平方根。L2正则化就是在损失函数后加一项 $Loss+\frac{1}{2} \lambda \sum _ { j = 1 } ^ { n } w_ { j } ^ { 2 }$ ，其中 $\lambda$ 为正则化项系数，用来平衡正则化项与损失函数，不计算 $w_0$ 的偏置项。当使用L2进行梯度下降时，通过求导可以得知，最终的梯度下降公式中会多出一项： $-\eta \lambda w_ { j }$ ，因此L2正则化可以减小权重的值从而达到正则化的目的。

过拟合的时候，拟合函数的系数往往非常大，尤其是高阶系数，为什么？如下图所示，过拟合，就是拟合函数需要顾忌每一个点，最终形成的拟合函数波动很大。在某些很小的区间里，函数值的变化很剧烈。这就意味着函数在某些小区间里的导数值（绝对值）非常大，由于自变量值可大可小，所以只有系数足够大，才能保证导数值很大。
而正则化是通过约束参数的范数使其不要太大，所以可以在一定程度上减少过拟合情况。

L1 regularization

与L2类似，其正则项为 $\lambda \sum _ { j = 1 } ^ { n } \left| w _ { j } \right|$ 。

对其求导，可知梯度下降中多了一项： $-\lambda sgn(w _ j )$ ，当 $w _ j >0$ 时， $sgn(w _ j )=1$ ；当 $w _ j <0$ 时， $sgn(w _ j )=-1$ 。也就是说，当 $w _ j >0$ 时， $w _ j$ 变小；当 $w _ j <0$ 时， $w _ j$ 变大。因此L1正则化的原理就是使 $w _ j$ 尽可能为0。
在这里插入图片描述
如上图所示，在二维情况下，左边代表L1正则化，右边代表L2正则化。蓝色圆圈代表损失函数的等值线，圆心处代表最优解。红色为正则化的约束条件，可以看到L2约束区域为一圆形，没有尖角，而L1有尖角。当取得最优解时，L1正则化的 $w_1$ 为0，L2正则化的 $w_1$ 很小，表明了L1正则化会进行特征选择，直接舍去某个特征。所以在所有特征中只有少数特征起重要作用的情况下，选择L1正则。而所有特征中大部分特征都能起作用，而且作用很平均，那么使用L2正则更合适。

Dropout

dropout不会修改网络的损失函数，只是在网络训练过程中所使用的一个技巧。如下图所示，在训练时以一定的概率随机删除神经元，并对网络进行训练和优化。在测试过程中，不随机删除，使用完整的网络进行测试。
在这里插入图片描述

其前向传播与反向传播如下：

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.
    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.
    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        mask = (np.random.rand(*x.shape) < p)   #以某一概率随机失活
        out = x * mask
    elif mode == 'test':
        out=x
    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.
    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        dx = dout * mask
    elif mode == 'test':
        dx = dout
    return dx