CS231N作业:A2Q1 完成FullyConnectedNets

在这个文件里实现了如下任务:
1、梯度检查(解析梯度与数值计算梯度进行比较,确定自己编写的梯度代码(forward和backward)没有问题)
2、对小数据集(这里是50个data),确认模型可以过拟合。以确定自己的模型对这个问题是可以有用的
3、更改update rules,查看了SGD,SGD+MOMENTUN,ADAM,ADAGRAD的不同
又温习一遍每个优化器的不同和优缺点;学会了用Matplotlib把不同优化方法的acc,loss画在同一张图里的代码;突然意识到在jupyternotebook的cell里直接import模型就可以在cell里运行,而不用一直更改py文件然后直接运行py文件(我好傻,估计没人会这么干吧)
以下是FCNN的ipynb文件:

# Multi-Layer Fully Connected Network
In this exercise, you will implement a fully connected network with an arbitrary number of hidden layers.

Read through the `FullyConnectedNet` class in the file `cs231n/classifiers/fc_net.py`.

Implement the network initialization, forward pass, and backward pass. Throughout this assignment, you will be implementing layers in `cs231n/layers.py`. You can re-use your implementations for `affine_forward`, `affine_backward`, `relu_forward`, `relu_backward`, and `softmax_loss` from Assignment 1. For right now, don't worry about implementing dropout or batch/layer normalization yet, as you will add those features later.



```python
# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """Returns relative error."""
    #计算解析梯度和数值计算梯度的相对误差,用来做梯度检查
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
=========== You can safely ignore the message below if you are NOT working on ConvolutionalNetworks.ipynb ===========
	You will need to compile a Cython extension for a portion of this assignment.
	The instructions to do this will be given in a section of the notebook below.
# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):
    print(f"{k}: {v.shape}")
X_train: (49000, 3, 32, 32)
y_train: (49000,)
X_val: (1000, 3, 32, 32)
y_val: (1000,)
X_test: (1000, 3, 32, 32)
y_test: (1000,)

Initial Loss and Gradient Check

As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. This is a good way to see if the initial losses seem reasonable.

For gradient checking, you should expect to see errors around 1e-7 or less.

np.random.seed(231)
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))
#随机产生X和y,进行梯度检查,没有用到数据集
for reg in [0, 3.14,0.1,0.001]:
    print("Running check with reg = ", reg)
    model = FullyConnectedNet(
        [H1, H2],
        input_dim=D,
        num_classes=C,
        reg=reg,
        weight_scale=5e-2,
        dtype=np.float64
    )

    loss, grads = model.loss(X, y)
    print("Initial loss: ", loss)

    # Most of the errors should be on the order of e-7 or smaller.   
    # NOTE: It is fine however to see an error for W2 on the order of e-5
    # for the check when reg = 0.0 因为数值计算梯度是近似梯度,不加reg的时候模型过于贴切原数据,所以相对误差会变大
    for name in sorted(grads):
        #grads是一个字典,里面是W1:grad,W2:grad...
        f = lambda _: model.loss(X, y)[0]
        grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
        #def eval_numerical_gradient(f, x, verbose=True, h=0.00001),要求f必须是一个函数,在这里f是匿名函数,f(x)返回的是W1,W2...的loss
        #在调用model.loss函数时,X,y这两个参数是从模型对象中获取的,而不是作为函数参数传递进去的。因此,f函数只需要定义一个参数,用于接收model.params[name]的值即可
        #在def eval_numerical_gradient中,f(x)是用W1,W2,W3...算出来的损失(因为算损失的时候不仅有X,y还有W的事),而f(x+h)是用W1+h,W2+h,...算出来的loss
        print(f"{name} relative error: {rel_error(grad_num, grads[name])}")
#     相对误差>1e-2:通常就意味着梯度可能出错。
#     1e-2>相对误差>1e-4:要对这个值感到不舒服才行。
#     1e-4>相对误差:这个值的相对误差对于有不可导点的目标函数是OK的。但如果目标函数中没有kink(使用tanh和softmax),那么相对误差值还是太高。
#     1e-7或者更小:好结果,可以高兴一把了。
#     网络的深度越深,相对误差就越高。所以如果你是在对一个10层网络的输入数据做梯度检查,那么1e-2的相对误差值可能就OK了,因为误差一直在累积。相反,如果一个可微函数的相对误差值是1e-2,那么通常说明梯度实现不正确。
Running check with reg =  0
Initial loss:  2.3004790897684924
W1 relative error: 7.696805129597462e-08
W2 relative error: 1.7087519156186903e-05
W3 relative error: 2.9508423550094495e-07
b1 relative error: 4.660094650186831e-09
b2 relative error: 2.085654124402131e-09
b3 relative error: 6.598642296022133e-11
Running check with reg =  3.14
Initial loss:  7.052114776533016
W1 relative error: 3.904542008453064e-09
W2 relative error: 6.86942277940646e-08
W3 relative error: 2.1311298702113723e-08
b1 relative error: 1.168319680050491e-08
b2 relative error: 1.7223752732008252e-09
b3 relative error: 1.3200479211447775e-10
Running check with reg =  0.1
Initial loss:  2.443068868512182
W1 relative error: 1.8834515350563675e-07
W2 relative error: 1.2829883899459636e-06
W3 relative error: 1.0870226373313394e-07
b1 relative error: 8.964994918407738e-09
b2 relative error: 1.0152013964125051e-08
b3 relative error: 1.2508015147866705e-10
Running check with reg =  0.001
Initial loss:  2.306880779919697
W1 relative error: 5.77846154296789e-06
W2 relative error: 1.1435009011947694e-05
W3 relative error: 1.4427610506984497e-05
b1 relative error: 1.875471113529925e-07
b2 relative error: 2.249384905396536e-09
b3 relative error: 9.869485336053175e-11

As another sanity check, make sure your network can overfit on a small dataset of 50 images. First, we will try a three-layer network with 100 units in each hidden layer. In the following cell, tweak the learning rate and weight initialization scale to overfit and achieve 100% training accuracy within 20 epochs.

# TODO: Use a three-layer Net to overfit 50 training examples by 
# tweaking just the learning rate and initialization scale.

num_train = 50
small_data = {
  "X_train": data["X_train"][:num_train],
  "y_train": data["y_train"][:num_train],
  "X_val": data["X_val"],
  "y_val": data["y_val"],
}

weight_scale = 1e-2   # Experiment with this!
learning_rate = 1e-2  # Experiment with this!
model = FullyConnectedNet(
    [100, 100],
    weight_scale=weight_scale,
    dtype=np.float64
)
solver = Solver(
    model,
    small_data,
    print_every=10,
    num_epochs=20,
    batch_size=25,
    update_rule="sgd",
    optim_config={"learning_rate": learning_rate},
)
solver.train()

plt.plot(solver.loss_history)
plt.title("Training loss history")
plt.xlabel("Iteration")
plt.ylabel("Training loss")
plt.grid(linestyle='--', linewidth=0.5)
plt.show()
(Iteration 1 / 40) loss: 2.321299
(Epoch 0 / 20) train acc: 0.240000; val_acc: 0.113000
(Epoch 1 / 20) train acc: 0.440000; val_acc: 0.157000
(Epoch 2 / 20) train acc: 0.420000; val_acc: 0.162000
(Epoch 3 / 20) train acc: 0.540000; val_acc: 0.166000
(Epoch 4 / 20) train acc: 0.580000; val_acc: 0.170000
(Epoch 5 / 20) train acc: 0.660000; val_acc: 0.195000
(Iteration 11 / 40) loss: 1.328666
(Epoch 6 / 20) train acc: 0.860000; val_acc: 0.178000
(Epoch 7 / 20) train acc: 0.820000; val_acc: 0.188000
(Epoch 8 / 20) train acc: 0.940000; val_acc: 0.194000
(Epoch 9 / 20) train acc: 0.840000; val_acc: 0.178000
(Epoch 10 / 20) train acc: 0.940000; val_acc: 0.206000
(Iteration 21 / 40) loss: 0.330410
(Epoch 11 / 20) train acc: 0.900000; val_acc: 0.187000
(Epoch 12 / 20) train acc: 0.880000; val_acc: 0.180000
(Epoch 13 / 20) train acc: 0.960000; val_acc: 0.191000
(Epoch 14 / 20) train acc: 0.980000; val_acc: 0.219000
(Epoch 15 / 20) train acc: 1.000000; val_acc: 0.215000
(Iteration 31 / 40) loss: 0.116799
(Epoch 16 / 20) train acc: 0.980000; val_acc: 0.207000
(Epoch 17 / 20) train acc: 0.960000; val_acc: 0.196000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.194000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.200000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.195000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X1t4GAxC-1685591813606)(output_9_1.png)]

Now, try to use a five-layer network with 100 units on each layer to overfit on 50 training examples. Again, you will have to adjust the learning rate and weight initialization scale, but you should be able to achieve 100% training accuracy within 20 epochs.

# TODO: Use a five-layer Net to overfit 50 training examples by 
# tweaking just the learning rate and initialization scale.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

learning_rate = 1e-3  # Experiment with this!
weight_scale = 1e-1   # Experiment with this!
#对于同样的数据,NN层数增多以后,想要过拟合,lr要减小,W的标准差要增加。lr要减小是因为如果不减小lr,容易出现loss=nan
model = FullyConnectedNet(
    [100, 100, 100, 100],
    weight_scale=weight_scale,
    dtype=np.float64
)
solver = Solver(
    model,
    small_data,
    print_every=10,
    num_epochs=20,
    batch_size=25,
    update_rule='sgd',
    optim_config={'learning_rate': learning_rate},
)
solver.train()

plt.plot(solver.loss_history)
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.grid(linestyle='--', linewidth=0.5)
plt.show()
(Iteration 1 / 40) loss: 111.214847
(Epoch 0 / 20) train acc: 0.160000; val_acc: 0.116000
(Epoch 1 / 20) train acc: 0.180000; val_acc: 0.127000
(Epoch 2 / 20) train acc: 0.280000; val_acc: 0.101000
(Epoch 3 / 20) train acc: 0.420000; val_acc: 0.107000
(Epoch 4 / 20) train acc: 0.600000; val_acc: 0.105000
(Epoch 5 / 20) train acc: 0.820000; val_acc: 0.120000
(Iteration 11 / 40) loss: 12.273820
(Epoch 6 / 20) train acc: 0.780000; val_acc: 0.111000
(Epoch 7 / 20) train acc: 0.940000; val_acc: 0.138000
(Epoch 8 / 20) train acc: 0.960000; val_acc: 0.137000
(Epoch 9 / 20) train acc: 0.980000; val_acc: 0.137000
(Epoch 10 / 20) train acc: 1.000000; val_acc: 0.134000
(Iteration 21 / 40) loss: 0.000077
(Epoch 11 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 12 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 13 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 14 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 15 / 20) train acc: 1.000000; val_acc: 0.134000
(Iteration 31 / 40) loss: 0.000311
(Epoch 16 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 17 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.134000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.134000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Eggb03BK-1685591813607)(output_11_1.png)]

Inline Question 1:

Did you notice anything about the comparative difficulty of training the three-layer network vs. training the five-layer network? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer:

[FILL THIS IN]

#5层NN对于W的标准差更敏感。因为它参数多,更复杂,每一层都要有W的初始化标准差。

Update rules

So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD.

SGD+Momentum

Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochastic gradient descent. See the Momentum Update section at http://cs231n.github.io/neural-networks-3/#sgd for more information.

Open the file cs231n/optim.py and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function sgd_momentum and run the following to check your implementation. You should see errors less than e-8.

from cs231n.optim import sgd_momentum

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {"learning_rate": 1e-3, "velocity": v}
next_w, _ = sgd_momentum(w, dw, config=config)

expected_next_w = np.asarray([
  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],
  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],
  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],
  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])
expected_velocity = np.asarray([
  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],
  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],
  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],
  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])

# Should see relative errors around e-8 or less
print("next_w error: ", rel_error(next_w, expected_next_w))
print("velocity error: ", rel_error(expected_velocity, config["velocity"]))
next_w error:  8.882347033505819e-09
velocity error:  4.269287743278663e-09

Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster.

num_train = 4000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

solvers = {}

for update_rule in ['sgd', 'sgd_momentum']:
    print('Running with ', update_rule)
    model = FullyConnectedNet(
        [100, 100, 100, 100, 100],
        weight_scale=5e-2
    )

    solver = Solver(
        model,
        small_data,
        num_epochs=5,
        batch_size=100,
        update_rule=update_rule,
        optim_config={'learning_rate': 5e-3},
        verbose=True,
    )
    solvers[update_rule] = solver
    solver.train()

fig, axes = plt.subplots(3, 1, figsize=(15, 15))

axes[0].set_title('Training loss')
axes[0].set_xlabel('Iteration')
axes[1].set_title('Training accuracy')
axes[1].set_xlabel('Epoch')
axes[2].set_title('Validation accuracy')
axes[2].set_xlabel('Epoch')

for update_rule, solver in solvers.items():
    axes[0].plot(solver.loss_history, label=f"loss_{update_rule}")
    axes[1].plot(solver.train_acc_history, label=f"train_acc_{update_rule}")
    axes[2].plot(solver.val_acc_history, label=f"val_acc_{update_rule}")
    
for ax in axes:
    ax.legend(loc="best", ncol=4)
    ax.grid(linestyle='--', linewidth=0.5)
# ax是一个包含多个子图的Axes对象列表。ax.legend(loc="best", ncol=4) 会在每个子图上添加一个图例,其中 "loc" 参数指定图例的位置,"best" 表示自动选择最佳位置。 "ncol" 参数指定图例的列数,这里设置为4列。
# ax.grid(linestyle='--', linewidth=0.5) 会在每个子图上添加网格线,其中 "linestyle" 参数指定网格线的样式为虚线,"linewidth" 参数指定网格线的宽度为0.5。
plt.show()
Running with  sgd
(Iteration 1 / 200) loss: 2.723869
(Epoch 0 / 5) train acc: 0.119000; val_acc: 0.101000
(Iteration 11 / 200) loss: 2.243481
(Iteration 21 / 200) loss: 2.201209
(Iteration 31 / 200) loss: 2.066481
(Epoch 1 / 5) train acc: 0.233000; val_acc: 0.222000
(Iteration 41 / 200) loss: 2.025198
(Iteration 51 / 200) loss: 2.016374
(Iteration 61 / 200) loss: 1.954067
(Iteration 71 / 200) loss: 1.912283
(Epoch 2 / 5) train acc: 0.303000; val_acc: 0.280000
(Iteration 81 / 200) loss: 2.053501
(Iteration 91 / 200) loss: 1.915960
(Iteration 101 / 200) loss: 2.017691
(Iteration 111 / 200) loss: 1.966786
(Epoch 3 / 5) train acc: 0.320000; val_acc: 0.289000
(Iteration 121 / 200) loss: 1.812466
(Iteration 131 / 200) loss: 1.884726
(Iteration 141 / 200) loss: 1.760283
(Iteration 151 / 200) loss: 1.888199
(Epoch 4 / 5) train acc: 0.335000; val_acc: 0.318000
(Iteration 161 / 200) loss: 1.870064
(Iteration 171 / 200) loss: 1.913483
(Iteration 181 / 200) loss: 1.645721
(Iteration 191 / 200) loss: 1.843821
(Epoch 5 / 5) train acc: 0.373000; val_acc: 0.312000
Running with  sgd_momentum
(Iteration 1 / 200) loss: 3.310699
(Epoch 0 / 5) train acc: 0.124000; val_acc: 0.122000
(Iteration 11 / 200) loss: 2.126889
(Iteration 21 / 200) loss: 2.171193
(Iteration 31 / 200) loss: 2.072664
(Epoch 1 / 5) train acc: 0.316000; val_acc: 0.281000
(Iteration 41 / 200) loss: 1.856631
(Iteration 51 / 200) loss: 1.845602
(Iteration 61 / 200) loss: 1.982693
(Iteration 71 / 200) loss: 1.782464
(Epoch 2 / 5) train acc: 0.364000; val_acc: 0.338000
(Iteration 81 / 200) loss: 1.557191
(Iteration 91 / 200) loss: 1.749715
(Iteration 101 / 200) loss: 1.632350
(Iteration 111 / 200) loss: 1.763286
(Epoch 3 / 5) train acc: 0.411000; val_acc: 0.327000
(Iteration 121 / 200) loss: 1.550437
(Iteration 131 / 200) loss: 1.490807
(Iteration 141 / 200) loss: 1.546287
(Iteration 151 / 200) loss: 1.378534
(Epoch 4 / 5) train acc: 0.439000; val_acc: 0.293000
(Iteration 161 / 200) loss: 1.615990
(Iteration 171 / 200) loss: 1.226020
(Iteration 181 / 200) loss: 1.516447
(Iteration 191 / 200) loss: 1.576345
(Epoch 5 / 5) train acc: 0.534000; val_acc: 0.331000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O4XPtERN-1685591813608)(output_18_1.png)]

RMSProp and Adam

RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.

In the file cs231n/optim.py, implement the RMSProp update rule in the rmsprop function and implement the Adam update rule in the adam function, and check your implementations using the tests below.

NOTE: Please implement the complete Adam update rule (with the bias correction mechanism), not the first simplified version mentioned in the course notes.

[1] Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning 4 (2012).

[2] Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, ICLR 2015.

# Test RMSProp implementation
from cs231n.optim import rmsprop

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'cache': cache}
next_w, _ = rmsprop(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],
  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],
  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])
expected_cache = np.asarray([
  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],
  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],
  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],
  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])

# You should see relative errors around e-7 or less
print('next_w error: ', rel_error(expected_next_w, next_w))
print('cache error: ', rel_error(expected_cache, config['cache']))
next_w error:  9.524687511038133e-08
cache error:  2.6477955807156126e-09
# Test Adam implementation
from cs231n.optim import adam

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}
next_w, _ = adam(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],
  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],
  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])
expected_v = np.asarray([
  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],
  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],
  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],
  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])
expected_m = np.asarray([
  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],
  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],
  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],
  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])

# You should see relative errors around e-7 or less
print('next_w error: ', rel_error(expected_next_w, next_w))
print('v error: ', rel_error(expected_v, config['v']))
print('m error: ', rel_error(expected_m, config['m']))
next_w error:  1.1395691798535431e-07
v error:  4.208314038113071e-09
m error:  4.214963193114416e-09

Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:

learning_rates = {'rmsprop': 1e-4, 'adam': 1e-4}
for update_rule in ['adam', 'rmsprop']:
    print('Running with ', update_rule)
    model = FullyConnectedNet(
        [100, 100, 100, 100, 100],
        weight_scale=5e-2
    )
    solver = Solver(
        model,
        small_data,
        num_epochs=5,
        batch_size=100,
        update_rule=update_rule,
        optim_config={'learning_rate': learning_rates[update_rule]},
        verbose=True
    )
    solvers[update_rule] = solver
    solver.train()
    print()
    
fig, axes = plt.subplots(3, 1, figsize=(15, 15))

axes[0].set_title('Training loss')
axes[0].set_xlabel('Iteration')
axes[1].set_title('Training accuracy')
axes[1].set_xlabel('Epoch')
axes[2].set_title('Validation accuracy')
axes[2].set_xlabel('Epoch')

for update_rule, solver in solvers.items():
    axes[0].plot(solver.loss_history, label=f"{update_rule}")
    axes[1].plot(solver.train_acc_history, label=f"{update_rule}")
    axes[2].plot(solver.val_acc_history, label=f"{update_rule}")
    
for ax in axes:
    ax.legend(loc='best', ncol=4)
    ax.grid(linestyle='--', linewidth=0.5)

plt.show()
Running with  adam
(Iteration 1 / 200) loss: 2.504476
(Epoch 0 / 5) train acc: 0.114000; val_acc: 0.101000
(Iteration 11 / 200) loss: 2.320703
(Iteration 21 / 200) loss: 2.297157
(Iteration 31 / 200) loss: 2.149343
(Epoch 1 / 5) train acc: 0.233000; val_acc: 0.229000
(Iteration 41 / 200) loss: 2.167822
(Iteration 51 / 200) loss: 2.033628
(Iteration 61 / 200) loss: 1.968268
(Iteration 71 / 200) loss: 1.979712
(Epoch 2 / 5) train acc: 0.322000; val_acc: 0.276000
(Iteration 81 / 200) loss: 1.978122
(Iteration 91 / 200) loss: 1.903067
(Iteration 101 / 200) loss: 1.692358
(Iteration 111 / 200) loss: 1.863013
(Epoch 3 / 5) train acc: 0.407000; val_acc: 0.305000
(Iteration 121 / 200) loss: 1.746651
(Iteration 131 / 200) loss: 1.705345
(Iteration 141 / 200) loss: 1.621473
(Iteration 151 / 200) loss: 1.550340
(Epoch 4 / 5) train acc: 0.414000; val_acc: 0.347000
(Iteration 161 / 200) loss: 1.723255
(Iteration 171 / 200) loss: 1.642489
(Iteration 181 / 200) loss: 1.585817
(Iteration 191 / 200) loss: 1.524533
(Epoch 5 / 5) train acc: 0.464000; val_acc: 0.353000

Running with  rmsprop
(Iteration 1 / 200) loss: 2.541891
(Epoch 0 / 5) train acc: 0.121000; val_acc: 0.129000
(Iteration 11 / 200) loss: 2.235184
(Iteration 21 / 200) loss: 1.946602
(Iteration 31 / 200) loss: 1.741395
(Epoch 1 / 5) train acc: 0.380000; val_acc: 0.303000
(Iteration 41 / 200) loss: 1.744624
(Iteration 51 / 200) loss: 1.843242
(Iteration 61 / 200) loss: 1.789016
(Iteration 71 / 200) loss: 1.681040
(Epoch 2 / 5) train acc: 0.442000; val_acc: 0.337000
(Iteration 81 / 200) loss: 1.733956
(Iteration 91 / 200) loss: 1.640961
(Iteration 101 / 200) loss: 1.637095
(Iteration 111 / 200) loss: 1.441220
(Epoch 3 / 5) train acc: 0.471000; val_acc: 0.340000
(Iteration 121 / 200) loss: 1.671299
(Iteration 131 / 200) loss: 1.551298
(Iteration 141 / 200) loss: 1.583967
(Iteration 151 / 200) loss: 1.420140
(Epoch 4 / 5) train acc: 0.518000; val_acc: 0.366000
(Iteration 161 / 200) loss: 1.430030
(Iteration 171 / 200) loss: 1.365771
(Iteration 181 / 200) loss: 1.482616
(Iteration 191 / 200) loss: 1.525441
(Epoch 5 / 5) train acc: 0.549000; val_acc: 0.379000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GcFxWzj9-1685591813609)(output_23_1.png)]

Inline Question 2:

AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)

John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

Answer:

[FILL THIS IN]

# AdaGrad随着时间累积平方梯度,并使用它们来调整每个参数的学习率。随着参数接收更多的更新,更新方程中的分母会增长,导致有效学习率降低。这就会导致更新变得非常小,从而减缓学习速度
# Adam使用梯度的第一和第二时刻的移动平均值来自适应地调整学习率。Adam中的自适应学习率机制可以根据每个参数的梯度情况自适应地调整学习率,从而更好地适应不同参数的更新需求。
#而Adam中的动量项可以帮助算法在梯度下降过程中保持惯性,从而加速收敛。所以不会发生学习很慢的情况

Train a Good Model!

Train the best fully connected model that you can on CIFAR-10, storing your best model in the best_model variable. We require you to get at least 50% accuracy on the validation set using a fully connected network.

If you are careful it should be possible to get accuracies above 55%, but we don’t require it for this part and won’t assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional networks rather than fully connected networks.

Note: You might find it useful to complete the BatchNormalization.ipynb and Dropout.ipynb notebooks before completing this part, since those techniques can help you train powerful models.

这里建议我先完成BatchNormalization.ipynb and Dropout.ipynb所以下面的先没做,等完成了BN和DROPOUT再来做

best_model = None

################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# find batch/layer normalization and dropout useful. Store your best model in  #
# the best_model variable.                                                     #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                              END OF YOUR CODE                                #
################################################################################

Test Your Model!

Run your best model on the validation and test sets. You should achieve at least 50% accuracy on the validation set.

y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print('Validation set accuracy: ', (y_val_pred == data['y_val']).mean())
print('Test set accuracy: ', (y_test_pred == data['y_test']).mean())

以下是fc_net的代码:

from builtins import range
from builtins import object
import numpy as np

from ..layers import *
from ..layer_utils import *


class FullyConnectedNet(object):
    """Class for a multi-layer fully connected neural network.

    Network contains an arbitrary number of hidden layers, ReLU nonlinearities,
    and a softmax loss function. This will also implement dropout and batch/layer
    normalization as options. For a network with L layers, the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional and the {...} block is
    repeated L - 1 times.

    Learnable parameters are stored in the self.params dictionary and will be learned
    using the Solver class.
    """

    def __init__(
        self,
        hidden_dims,
        input_dim=3 * 32 * 32,
        num_classes=10,
        dropout_keep_ratio=1,
        normalization=None,
        reg=0.0,
        weight_scale=1e-2,
        dtype=np.float32,
        seed=None,
    ):
        """Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.每个隐藏层有多少个神经元
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout_keep_ratio: Scalar between 0 and 1 giving dropout strength.
            If dropout_keep_ratio=1 then the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
            are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
            initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
            this datatype. float32 is faster but less accurate, so you should use
            float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers.
            This will make the dropout layers deteriminstic so we can gradient check the model.
        """
        self.normalization = normalization
        self.use_dropout = dropout_keep_ratio != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        layer_dims = np.hstack((input_dim,hidden_dims,num_classes))
        #hstack是NumPy库中的一个函数,用于将多个数组在水平方向上堆叠在一起,即按列方向进行拼接。其中hidden_dims可能包含很多值,每个值都代表了一个隐藏层含有的神经元。layer_dims=(2,3,3,4) inputdim是2,Hiddendims是(3,3),classes是4
        #layer_dims是一个一维数组,它包含了整个神经网络的每一层的维度大小。其中,第一个元素是输入层的维度input_dim,接下来的元素是隐藏层的维度hidden_dims,最后一个元素是输出层的维度num_classes。
        for i in range(self.num_layers):
            W = np.random.normal(0,weight_scale,(layer_dims[i],layer_dims[i+1]))
            b = np.zeros(layer_dims[i+1])
            self.params['W' + str(i+1)] = W
            self.params['b' + str(i+1)] = b
        if self.normalization != None:
            for i in range(self.num_layers - 1):
                gamma=np.ones(layer_dims[i+1])
                beta = np.zeros(layer_dims[i+1])
                self.params['gamma'+str(i+1)] = gamma
                self.params['beta'+str(i+1)] = beta
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
            if seed is not None:
                self.dropout_param["seed"] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == "batchnorm":
            self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]
        if self.normalization == "layernorm":
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype.
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """Compute loss and gradient for the fully connected net.
        
        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
            scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
            names to gradients of the loss with respect to those parameters.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x = X
        caches = []
        for i in range(self.num_layers -1):
            W = self.params['W'+str(i+1)]
            b = self.params['b'+str(i+1)]
            if self.normalization == None:
                out,cache = affine_relu_forward(x,W,b)
            caches.append(cache)
            x = out
        W = self.params['W'+str(self.num_layers)]
        b = self.params['b'+str(self.num_layers)]
        scores,cache = affine_forward(x,W,b)
        caches.append(cache)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early.
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the   #
        # scale and shift parameters.                                              #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        loss,dout = softmax_loss(scores,y)
        for i in range(self.num_layers):
            W = self.params['W' + str(i+1)]
            loss += 0.5 * self.reg * np.sum(W*W)
            #最终的loss是算出来的数据loss加上每一层的正则化损失
        dout,dw,db = affine_backward(dout,caches[self.num_layers - 1])
        #caches[self.num_layers - 1]是最后一层的cache
        dw += self.reg * self.params['W'+str(self.num_layers)]
        #dw也要加上当前层的正则化损失
        grads['W'+str(self.num_layers)] = dw
        grads['b'+str(self.num_layers)] = db
        
        for i in range(self.num_layers-2,-1,-1):
            #从self.num_layers-2开始,到-1结束(不包括-1),步长为-1的整数序列。
            if self.normalization == None:
                dout,dw,db = affine_relu_backward(dout,caches[i])
                #caches[self.num_layers-2]是倒数第二层的cache
                dw += self.reg * self.params['W'+str(i+1)]
                grads['W'+str(i+1)] = dw
                grads['b'+str(i+1)] = db
         

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

以下是optim.py的代码:

import numpy as np

"""
This file implements various first-order update rules that are commonly used
for training neural networks. Each update rule accepts current weights and the
gradient of the loss with respect to those weights and produces the next set of
weights. Each update rule has the same interface:

def update(w, dw, config=None):

Inputs:
  - w: A numpy array giving the current weights.
  - dw: A numpy array of the same shape as w giving the gradient of the
    loss with respect to w.
  - config: A dictionary containing hyperparameter values such as learning
    rate, momentum, etc. If the update rule requires caching values over many
    iterations, then config will also hold these cached values.

Returns:
  - next_w: The next point after the update.
  - config: The config dictionary to be passed to the next iteration of the
    update rule.

NOTE: For most update rules, the default learning rate will probably not
perform well; however the default values of the other hyperparameters should
work well for a variety of different problems.

For efficiency, update rules may perform in-place updates, mutating w and
setting next_w equal to w.
"""


def sgd(w, dw, config=None):
    """
    Performs vanilla stochastic gradient descent.

    config format:
    - learning_rate: Scalar learning rate.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)

    w -= config["learning_rate"] * dw
    return w, config


def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("momentum", 0.9)
    v = config.get("velocity", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    mu = config['momentum']
    lr = config['learning_rate']
    v = mu * v - lr * dw
    next_w = w + v

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config["velocity"] = v

    return next_w, config


def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-2)
    config.setdefault("decay_rate", 0.99)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("cache", np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    config['cache']=config['decay_rate']*config['cache'] + (1-config['decay_rate'])*dw*dw
    next_w = w- config['learning_rate']*dw/(np.sqrt(config['cache'])+config['epsilon'])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config


def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None:
        config = {}
    config.setdefault("learning_rate", 1e-3)
    config.setdefault("beta1", 0.9)
    config.setdefault("beta2", 0.999)
    config.setdefault("epsilon", 1e-8)
    config.setdefault("m", np.zeros_like(w))
    config.setdefault("v", np.zeros_like(w))
    config.setdefault("t", 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    #记得要更新t!!!!!
    config['t'] += 1
    config['m'] = config['m'] * config['beta1'] + (1 - config['beta1']) * dw
    mt = config['m'] / (1 - config['beta1'] ** config['t'])
    config['v'] = config['v'] * config['beta2'] + (1 - config['beta2']) * dw * dw
    vt = config['v'] / (1 - config['beta2'] ** config['t'])
    next_w = w - config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon'])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config
  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值