开篇
终于到了作业1的最后一个部分,也终于到了深度学习最核心的一部分,神经网络的搭建和训练。虽然这次作业只是一个双层的神经网络,但是体现了神经网络的一般架构和框架,更复杂的神经网络无非就是加一些batch normalization,dropout,identity map等等等等。所以这次作业还是要认真独立看懂代码并完成的。
神经网络的搭建
上次我们刚刚训练了我们的第一个模型——线性分类器,其实神经网络的训练和线性分类器异曲同工,都是在不断地训练中计算损失函数,然后梯度下降,对参数进行更新,从而预测出更加准确的结果,不同的是分类器之间的结构差异。
如图所示,我们这次训练的是一个双层神经网络:
它的基本结构是这样的:输入——全连接层——ReLU激活——全连接层——softmax——输出
我们需要实现的部分主要就是网络的中间结构。
损失函数的声明
首先求出网络的中间变量,初始化W1,b1,W2,b2,计算出网络的中间变量,即Z1 = W1 * X
之后Z1会经过ReLU激活函数:A1 = max(0,Z1)
。
然后将A1作为输入传递给第二层:Z2 = A1 * W
。而二层的输出应该是经过了softmax层的输出。
这些其实都不算难点,难点主要体现我们梯度下降的过程,即反向传播,我们应该从dZ2开始反向传播,dZ2怎么得到嗯?要通过最后的得分矩阵,并将分类正确的位置-1,最后除以样本数,即可得到dZ2。
其他变量的梯度都是通过反向传播得到的,很简单,在这里不赘述了,这部分如果都不懂的朋友一定要回头看一遍cs231n再来做作业哈。最后将dW1,db1,dW2,db2放入变量grads中。返回loss和grads。
训练
训练之前还是要将样本分批,如果样本数比批次小,那我们对一些数据重复选取,如果大于批次,就不需要重复选取(体现在参数replace 上)
分批后在指定循环迭代次数中对样本进行迭代训练,并计算损失和梯度,更新参数以获得最小的损失函数。迭代的最后预测出类别并与真是类别比较,计算训练集和验证集的准确率即可。
说到预测,就到了最后一部分,预测函数。
预测
有了模型,有了损失,预测函数就是最简单的一部分。用更新好的W1,b1,W2,b2计算出最终预测结果,或许是一个确定的类别,或许是一个得分矩阵,在这里应该是一个得分矩阵。然后通过np.argmax函数在axis=1
上找到最大值的索引(每一行的最大值),在这里每一行的最大值索引即为我们的类别,返回即可得到预测类别。
具体的一些细节和说明参照代码。
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network. The net has an input dimension of
N, a hidden layer dimension of H, and performs classification over C classes.
We train the network with a softmax loss function and L2 regularization on the
weight matrices. The network uses a ReLU nonlinearity after the first fully
connected layer.
In other words, the network has the following architecture:
input - fully connected layer - ReLU - fully connected layer - softmax
The outputs of the second fully-connected layer are the scores for each class.
"""
def __init__(self, input_size, hidden_size, output_size, std=1e-4):
"""
Initialize the model. Weights are initialized to small random values and
biases are initialized to zero. Weights and biases are stored in the
variable self.params, which is a dictionary with the following keys:
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)
Inputs:
- input_size: The dimension D of the input data.
- hidden_size: The number of neurons H in the hidden layer.
- output_size: The number of classes C.
"""
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.
Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
#############################################################################
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
#############################################################################
Z1 = np.dot(X, W1) + b1
A1 = np.maximum(0, Z1) # Relu
scores = np.dot(A1, W2) + b2
#############################################################################
# END OF YOUR CODE #
#############################################################################
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
loss = None
#############################################################################
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. #
#############################################################################
num_train = X.shape[0]
scores = np.exp(scores) / np.sum(np.exp(scores), axis=1).reshape(num_train, -1)
score_y = scores[np.arange(num_train), y]
loss = np.sum(-np.log(score_y)) / num_train
loss += reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
#############################################################################
# END OF YOUR CODE #
#############################################################################
# Backward pass: compute gradients
grads = {}
#############################################################################
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
#############################################################################
scores[np.arange(num_train), y] -= 1
dZ2 = scores / num_train
dW2 = np.dot(A1.T, dZ2) + 2 * reg * W2
db2 = np.sum(dZ2, axis=0, keepdims=True)
dA1 = np.dot(dZ2, W2.T)
dZ1 = dA1.copy()
# 因为dZ1是从dA1得来的,A1中没有负数,都当作0来处理
# 但是Z1中有负数,所以计算dZ1的时候要把Z1<0的部分变成0
dZ1[Z1 < 0] = 0
dW1 = np.dot(X.T, dZ1) + 2 * reg * W1
db1 = np.sum(dZ1, axis=0, keepdims=True)
grads['W2'] = dW2
grads['b2'] = db2
grads['W1'] = dW1
grads['b1'] = db1
#############################################################################
# END OF YOUR CODE #
#############################################################################
return loss, grads
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=5e-6, num_iters=100,
batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
# 选择一个batch
# X_var,y_var是验证集
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []
for it in range(num_iters):
X_batch = None
y_batch = None
#########################################################################
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
#########################################################################
if num_train < batch_size:
temp = np.random.choice(a=num_train, size=batch_size, replace=True)
else:
temp = np.random.choice(a=num_train, size=batch_size, replace=False)
X_batch = X[temp]
y_batch = y[temp]
#########################################################################
# END OF YOUR CODE #
#########################################################################
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)
#########################################################################
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
#########################################################################
# 更新权重和偏差
self.params['W1'] -= learning_rate * grads['W1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b1'] -= learning_rate * grads['b1'].reshape(-1)
self.params['b2'] -= learning_rate * grads['b2'].reshape(-1)
#########################################################################
# END OF YOUR CODE #
#########################################################################
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
# Decay learning rate
learning_rate *= learning_rate_decay
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}
def predict(self, X):
"""
Use the trained weights of this two-layer network to predict labels for
data points. For each data point we predict scores for each of the C
classes, and assign each data point to the class with the highest score.
Inputs:
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
classify.
Returns:
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
to have class c, where 0 <= c < C.
"""
y_pred = None
###########################################################################
# TODO: Implement this function; it should be VERY simple! #
###########################################################################
Z1 = np.dot(X, self.params['W1']) + self.params['b1']
A1 = np.maximum(0, Z1)
Z2 = np.dot(A1, self.params['W2']) + self.params['b2']
# 找出一行中(即各个类别)最大值的索引,即为类别
y_pred = np.argmax(Z2, axis=1)
###########################################################################
# END OF YOUR CODE #
###########################################################################
return y_pred
总结
至此,作业1结束了。作业1中主要是一些基础模型的训练以及往后会用的损失函数的定义经过我们自己的实现对这方面都有了一定程度的掌握。接下来的作业2,就是难点了,各位加油!