2020 cs231n 作业2笔记 FullyConnectedNets

最新推荐文章于 2023-06-01 12:06:30 发布

cheetah023

最新推荐文章于 2023-06-01 12:06:30 发布

阅读量2.4k

点赞数 5

分类专栏： cs231n 文章标签：神经网络深度学习卷积神经网络 python cs231n

本文链接：https://blog.csdn.net/cheetah023/article/details/107470350

版权

cs231n 专栏收录该内容

15 篇文章 16 订阅

订阅专栏

本文深入探讨了神经网络的基本架构，包括线性层、ReLU激活函数及全连接网络的前向与反向传播实现。此外，还介绍了两种层神经网络的搭建与训练过程，并对比了不同的优化算法如SGD+Momentum、RMSProp和Adam，最后通过调整超参数以提高验证集准确率。

摘要由CSDN通过智能技术生成

Tune your hyperparameters

FullyConnectedNets

简介：

神经网络的一般架构可以看作是把很多个layer拼接起来的，如果我们把每个layer的前向传播和反向传播单独实现，这样就可以比较方便的将多个任意layer拼接起来。

affine_forward

实现线性layer的前向传播：

def affine_forward(x, w, b):
    """
    Computes the forward pass for an affine (fully-connected) layer.

    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.

    Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    out = None
    ###########################################################################
    # TODO: Implement the affine forward pass. Store the result in out. You   #
    # will need to reshape the input into rows.                               #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x_temp = np.reshape(x,[x.shape[0], -1])
    out = x_temp.dot(w) + b

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, w, b)
    return out, cache

affine_backward

实现线性layer的反向传播：

def affine_backward(dout, cache):
    """
    Computes the backward pass for an affine layer.

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """
    x, w, b = cache
    dx, dw, db = None, None, None
    ###########################################################################
    # TODO: Implement the affine backward pass.                               #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    x_temp = np.reshape(x,[x.shape[0], -1])
    db = np.sum(dout, axis = 0)
    dw = np.dot(x_temp.T, dout)
    dx = np.dot(dout, w.T)
    dx = np.reshape(dx, x.shape)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dw, db

relu_forward

relu激活函数的前向传播：

def relu_forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    out = None
    ###########################################################################
    # TODO: Implement the ReLU forward pass.                                  #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    out = np.maximum(0, x)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = x
    return out, cache

relu_backward

relu激活函数的反向传播：

def relu_backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = None, cache
    ###########################################################################
    # TODO: Implement the ReLU backward pass.                                 #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x = cache
    dx = (x > 0).astype(int) * dout

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx

接下来就是使用我们已实现的layer，来拼接一个两层的神经网络。

TwoLayerNet

两层神经网络架构为：affine - relu - affine - softmax

记得给loss加上L2正则化项

class TwoLayerNet(object):
    """
    A two-layer fully-connected neural network with ReLU nonlinearity and
    softmax loss that uses a modular layer design. We assume an input dimension
    of D, a hidden dimension of H, and perform classification over C classes.

    The architecure should be affine - relu - affine - softmax.

    Note that this class does not implement gradient descent; instead, it
    will interact with a separate Solver object that is responsible for running
    optimization.

    The learnable parameters of the model are stored in the dictionary
    self.params that maps parameter names to numpy arrays.
    """

    def __init__(
        self,
        input_dim=3 * 32 * 32,
        hidden_dim=100,
        num_classes=10,
        weight_scale=1e-3,
        reg=0.0,
    ):
        """
        Initialize a new network.

        Inputs:
        - input_dim: An integer giving the size of the input
        - hidden_dim: An integer giving the size of the hidden layer
        - num_classes: An integer giving the number of classes to classify
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - reg: Scalar giving L2 regularization strength.
        """
        self.params = {}
        self.reg = reg

        ############################################################################
        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
        # should be initialized from a Gaussian centered at 0.0 with               #
        # standard deviation equal to weight_scale, and biases should be           #
        # initialized to zero. All weights and biases should be stored in the      #
        # dictionary self.params, with first layer weights                         #
        # and biases using the keys 'W1' and 'b1' and second layer                 #
        # weights and biases using the keys 'W2' and 'b2'.                         #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)
        self.params['b1'] = np.zeros(hidden_dim)
        self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params['b2'] = np.zeros(num_classes)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

    def loss(self, X, y=None):
        """
        Compute loss and gradient for a minibatch of data.

        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
          scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the two-layer net, computing the    #
        # class scores for X and storing them in the scores variable.              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #hidden N*H
        hidden, cache = affine_forward(X, self.params['W1'], self.params['b1'])
        hidden_relu, cache  = relu_forward(hidden)
        #scores N*C
        scores, cache = affine_forward(hidden_relu, self.params['W2'], self.params['b2'])
        
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
        # in the loss variable and gradients in the grads dictionary. Compute data #
        # loss using softmax, and make sure that grads[k] holds the gradients for  #
        # self.params[k]. Don't forget to add L2 regularization!                   #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #hidden N*H
        W1 = self.params['W1']
        b1 = self.params['b1']
        W2 = self.params['W2']
        b2 = self.params['b2']
        reg = self.reg
        hidden, cache1 = affine_forward(X, W1, b1)
        hidden_relu, cache2 = relu_forward(hidden)
        #scores N*C
        scores, cache3 = affine_forward(hidden_relu, W2, b2)
        loss, dscores = softmax_loss(scores, y)
        loss += 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
        dhidden_relu, dW2, db2 = affine_backward(dscores, cache3)
        grads['W2'] = dW2 + reg * W2
        grads['b2'] = db2
        dhidden = relu_backward(dhidden_relu, cache2)
        dX, dW1, db1 = affine_backward(dhidden, cache1)
        grads['W1'] = dW1 + reg * W1
        grads['b1'] = db1


        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

solver

接下来使用solver对象来训练上一步实现的两层神经网络

model = TwoLayerNet()
solver = None

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

solver = Solver(model, data,
                update_rule='sgd',
                optim_config={
                  'learning_rate': 1e-3,
                },
                lr_decay=0.95,
                num_epochs=10, batch_size=100,
                print_every=100)
solver.train()

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

运行结果：

(Iteration 1 / 4900) loss: 2.304060

(Epoch 0 / 10) train acc: 0.116000; val_acc: 0.094000

(Iteration 101 / 4900) loss: 1.829613

(Iteration 201 / 4900) loss: 1.857390

(Iteration 301 / 4900) loss: 1.744448

(Iteration 401 / 4900) loss: 1.420187

(Epoch 1 / 10) train acc: 0.407000; val_acc: 0.422000

(Iteration 501 / 4900) loss: 1.565913

(Iteration 601 / 4900) loss: 1.700510

(Iteration 701 / 4900) loss: 1.732213

(Iteration 801 / 4900) loss: 1.688361

(Iteration 901 / 4900) loss: 1.439529

(Epoch 2 / 10) train acc: 0.497000; val_acc: 0.468000

(Iteration 1001 / 4900) loss: 1.385772

(Iteration 1101 / 4900) loss: 1.278401

(Iteration 1201 / 4900) loss: 1.641580

(Iteration 1301 / 4900) loss: 1.438847

(Iteration 1401 / 4900) loss: 1.172536

(Epoch 3 / 10) train acc: 0.490000; val_acc: 0.466000

(Iteration 1501 / 4900) loss: 1.346286

(Iteration 1601 / 4900) loss: 1.268492

(Iteration 1701 / 4900) loss: 1.318215

(Iteration 1801 / 4900) loss: 1.395750

(Iteration 1901 / 4900) loss: 1.338233

(Epoch 4 / 10) train acc: 0.532000; val_acc: 0.497000

(Iteration 2001 / 4900) loss: 1.343165

(Iteration 2101 / 4900) loss: 1.393173

(Iteration 2201 / 4900) loss: 1.276734

(Iteration 2301 / 4900) loss: 1.287951

(Iteration 2401 / 4900) loss: 1.352778

(Epoch 5 / 10) train acc: 0.525000; val_acc: 0.475000

(Iteration 2501 / 4900) loss: 1.390234

(Iteration 2601 / 4900) loss: 1.276361

(Iteration 2701 / 4900) loss: 1.111768

(Iteration 2801 / 4900) loss: 1.271688

(Iteration 2901 / 4900) loss: 1.272039

(Epoch 6 / 10) train acc: 0.546000; val_acc: 0.509000

(Iteration 3001 / 4900) loss: 1.304489

(Iteration 3101 / 4900) loss: 1.346667

(Iteration 3201 / 4900) loss: 1.325510

(Iteration 3301 / 4900) loss: 1.392728

(Iteration 3401 / 4900) loss: 1.402001

(Epoch 7 / 10) train acc: 0.567000; val_acc: 0.505000

(Iteration 3501 / 4900) loss: 1.319024

(Iteration 3601 / 4900) loss: 1.153287

(Iteration 3701 / 4900) loss: 1.180922

(Iteration 3801 / 4900) loss: 1.093164

(Iteration 3901 / 4900) loss: 1.135902

(Epoch 8 / 10) train acc: 0.568000; val_acc: 0.490000

(Iteration 4001 / 4900) loss: 1.191735

(Iteration 4101 / 4900) loss: 1.359396

(Iteration 4201 / 4900) loss: 1.227283

(Iteration 4301 / 4900) loss: 1.024113

(Iteration 4401 / 4900) loss: 1.327583

(Epoch 9 / 10) train acc: 0.592000; val_acc: 0.504000

(Iteration 4501 / 4900) loss: 0.963330

(Iteration 4601 / 4900) loss: 1.445619

(Iteration 4701 / 4900) loss: 1.007542

(Iteration 4801 / 4900) loss: 1.005175

(Epoch 10 / 10) train acc: 0.611000; val_acc: 0.512000

FullyConnectedNet

接下来使用已有的layer拼接一个全连接网络

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(
        self,
        hidden_dims,
        input_dim=3 * 32 * 32,
        num_classes=10,
        dropout=1,
        normalization=None,
        reg=0.0,
        weight_scale=1e-2,
        dtype=np.float32,
        seed=None,
    ):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
        for i in range(self.num_layers):
          if i == 0:
            self.params['W1'] = np.random.randn(input_dim, hidden_dims[i]) * weight_scale
            self.params['b1'] = np.zeros(hidden_dims[i])
          elif i == self.num_layers -1:
            self.params['W'+str(i+1)] = np.random.randn(hidden_dims[i-1], num_classes) * weight_scale
            self.params['b'+str(i+1)] = np.zeros(num_classes)
          else:
            self.params['W'+str(i+1)] = np.random.randn(hidden_dims[i-1], hidden_dims[i]) * weight_scale
            self.params['b'+str(i+1)] = np.zeros(hidden_dims[i])

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout}
            if seed is not None:
                self.dropout_param["seed"] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == "batchnorm":
            self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
        if self.normalization == "layernorm":
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        x = X.copy()
        #cache = []
        #cache_relu = []
        #{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
        for i in range(self.num_layers - 1):
          w = self.params['W'+str(i+1)]
          b = self.params['b'+str(i+1)]
          x, cache_temp = affine_forward(x, w, b)
          #cache.append(cache_temp)
          x, cache_temp = relu_forward(x)
          #cache_relu.append(cache_temp)
        w = self.params['W'+str(self.num_layers)]
        b = self.params['b'+str(self.num_layers)]
        scores, cache_temp = affine_forward(x, w, b)
        
        
        


        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the scale   #
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        #do forward propagation
        x = X.copy()
        cache = []
        cache_relu = []
        #{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
        for i in range(self.num_layers - 1):
          w = self.params['W'+str(i+1)]
          b = self.params['b'+str(i+1)]
          x, cache_temp = affine_forward(x, w, b)
          cache.append(cache_temp)
          x, cache_temp = relu_forward(x)
          cache_relu.append(cache_temp)
        w = self.params['W'+str(self.num_layers)]
        b = self.params['b'+str(self.num_layers)]
        scores, cache_temp = affine_forward(x, w, b)
        cache.append(cache_temp)
        loss, dscores = softmax_loss(scores, y)
        for i in range(self.num_layers):
          w = self.params['W'+str(i+1)]
          loss += 0.5 * self.reg * np.sum(w * w)
        #do backward propagation
        dx, dw, db = affine_backward(dscores, cache.pop())
        grads['W'+str(self.num_layers)] = dw + self.reg * self.params['W'+str(self.num_layers)]
        grads['b'+str(self.num_layers)] = db
        for i in range(self.num_layers - 1)[::-1]:
          dx = relu_backward(dx, cache_relu.pop())
          dx, dw, db = affine_backward(dx, cache.pop())
          grads['W'+str(i+1)] = dw + self.reg * self.params['W'+str(i+1)]
          grads['b'+str(i+1)] = db
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

后面是实现优化方法，参考：https://cs231n.github.io/neural-networks-3/#sgd

SGD+Momentum

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("momentum", 0.9)
    v = config.get("velocity", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    v = config["momentum"] * v - dw * config["learning_rate"]
    next_w = w + v

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config["velocity"] = v

    return next_w, config

RMSProp

def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("decay_rate", 0.99)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("cache", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dw**2)
    next_w = w - config['learning_rate'] * dw / (np.sqrt(config['cache']) + config['epsilon'])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

adam

def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-3)
    config.setdefault("beta1", 0.9)
    config.setdefault("beta2", 0.999)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("m", np.zeros_like(w))
    config.setdefault("v", np.zeros_like(w))
    config.setdefault("t", 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    learning_rate = config['learning_rate']
    beta1 = config['beta1']
    beta2 = config['beta2']
    epsilon = config['epsilon']
    m = config['m']
    v = config['v']
    t = config['t'] + 1
    
    m = beta1*m + (1-beta1)*dw
    mt = m / (1-beta1**t)
    v = beta2*v + (1-beta2)*(dw**2)
    vt = v / (1-beta2**t)
    w += - learning_rate * mt / (np.sqrt(vt) + epsilon)

    next_w = w
    config['m'] = m
    config['v'] = v
    config['t'] = t


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

Tune your hyperparameters

调超参数以达到更高的Validation accuracy

best_model = None
################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# find batch/layer normalization and dropout useful. Store your best model in  #
# the best_model variable.                                                     #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

weight_scale = 6e-2   # Experiment with this!
learning_rate = 1e-3  # Experiment with this!
model = FullyConnectedNet([100, 100, 100, 100, 100],
              weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, data,
                print_every=100, num_epochs=10, batch_size=200,
                update_rule='adam',
                optim_config={
                  'learning_rate': learning_rate,
                }
         )
solver.train()
best_model = model
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                              END OF YOUR CODE                                #
################################################################################

运行结果：

(Iteration 1 / 2450) loss: 5.424461

(Epoch 0 / 10) train acc: 0.125000; val_acc: 0.109000

(Iteration 101 / 2450) loss: 1.733870

(Iteration 201 / 2450) loss: 1.694563

(Epoch 1 / 10) train acc: 0.416000; val_acc: 0.382000

(Iteration 301 / 2450) loss: 1.575924

(Iteration 401 / 2450) loss: 1.468797

(Epoch 2 / 10) train acc: 0.494000; val_acc: 0.450000

(Iteration 501 / 2450) loss: 1.414597

(Iteration 601 / 2450) loss: 1.595532

(Iteration 701 / 2450) loss: 1.466879