公式部分
SVM损失函数公式
L
i
=
∑
j
≠
y
i
max
(
0
,
s
j
−
s
y
i
+
1
)
\Large L_i = \sum_{j\neq{y_i}} \max(0,s_j-s_{y_i}+1)
Li=j̸=yi∑max(0,sj−syi+1)
L
(
W
)
=
1
N
∑
i
=
1
N
L
i
(
f
(
x
i
,
W
)
,
y
i
)
⎵
+
λ
R
(
W
)
⎵
\Large L(W) = \underbrace{ \frac{1}{N}\sum_{i=1}^N L_i(f(x_i,W),y_i) } + \underbrace{\lambda R(W)}
L(W)=
N1i=1∑NLi(f(xi,W),yi)+
λR(W)
正则化
提高模型的泛化能力
L2正则化(权重衰减)
R
(
W
)
=
∑
k
∑
l
W
k
,
l
2
\Large R(W) = \sum_k\sum_lW_{k,l}^2
R(W)=k∑l∑Wk,l2
梯度的求导
TODO:
代码部分
svm.ipynb
在这个练习中你将:
- 完成一个基于SVM的全向量化损失函数
- 完成解析梯度的全向量化表示
- 使用数值梯度来验证你的实现
- 使用一个验证集来优化学习率和正则化强度
- 使用随机梯度下降法(SGD)来优化
- 可视化最后学习得到的权重
数据预处理
# 把数据分成训练,验证,测试集。
# 除此之外我们将创建一个开发集作为训练集的子集,我们会使用这个开发集是我们的代码运行的更快。
num_training = 49000
num_validation = 1000
num_test = 1000
num_dev = 500
# 我们将原始训练集中的num_validation个点的作为验证集
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask]
y_val = y_train[mask]
# 我们将原始训练集中开始的num_train个点的作为训练集
mask = range(num_training)
X_train = X_train[mask]
y_train = y_train[mask]
# 我们还将创建一个开发集,它是训练集的一小部分
mask = np.random.choice(num_training, num_dev, replace=False)
X_dev = X_train[mask]
y_dev = y_train[mask]
# 我们使用原始测试集中的开始num_test个点作为测试集
mask = range(num_test)
X_test = X_test[mask]
y_test = y_test[mask]
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
Train data shape: (49000, 32, 32, 3)
Train labels shape: (49000,)
Validation data shape: (1000, 32, 32, 3)
Validation labels shape: (1000,)
Test data shape: (1000, 32, 32, 3)
Test labels shape: (1000,)
# 数据预处理:将图片数据形状变为向量
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
print('Training data shape: ', X_train.shape)
print('Validation data shape: ', X_val.shape)
print('Test data shape: ', X_test.shape)
print('dev data shape: ', X_dev.shape)
Training data shape: (49000, 3072)
Validation data shape: (1000, 3072)
Test data shape: (1000, 3072)
dev data shape: (500, 3072)
# 数据预处理: 减去图像的平均值
# first: 基于训练数据计算图像平均值
mean_image = np.mean(X_train, axis=0)
print(mean_image[:10]) # 输出一小部分
plt.figure(figsize=(4,4))
plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) # 可视化图像平均值
plt.show()
# second: 从训练和测试数据减去图像平均值
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
X_dev -= mean_image
# third:添加一列1作为偏置维度,使我们的SVM在优化时只需要考虑一个权重矩阵W
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
print(X_train.shape, X_val.shape, X_test.shape, X_dev.shape)
linear_svm.py
朴素方法实现SVM损失函数
def svm_loss_naive(W, X, y, reg):
"""
使用循环构造SVM损失函数
输入有维度D,有C类,我们使用N个样本作为一批输入
Inputs:
- W: 保存权重的numpy数组,形状为(D, C)
- X: 保存一批数据的numpy数组,形状为(N, D)
- y: 保存训练标签的numpy数组,形状为(N,); y[i] = c 表示X[i]标签为c, 其中 0 <= c < C.
- reg: (float) 正则化强度
Returns a tuple of:
- 一个存储为float的loss
- 权重W的梯度,和W大小相同的array
"""
dW = np.zeros(W.shape) # 初始化梯度为0
# 计算损失和梯度
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # 记住 delta = 1
if margin > 0:
loss += margin
dW[:, y[i]] += -X[i, :].T
dW[:, j] += X[i, :].T
# 现在loss值是所有训练样例loss的总数,现在我们想要通过除以num_train来求平均值
loss /= num_train
dW /= num_train
# 给loss添加正则项
loss += reg * np.sum(W * W)
# 计算损失函数的梯度并存储在dW中。
# 相比较第一次那样计算loss然后计算导数,在相同时间内它可能更快的计算出loss导数
# loss正在被计算的时候。你可能需要修改上面的一些代码来计算梯度。
dW += reg * W
return loss, dW
svm.ipynb
检查计算的loss和梯度
# Evaluate the naive implementation of the loss we provided for you:
from cs231n.classifiers.linear_svm import svm_loss_naive
import time
# 产生一个数字比较小的随机SVM权重矩阵
W = np.random.randn(3073, 10) * 0.0001
loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.000005)
print('loss: %f' % (loss, ))
loss: 9.190614
上面函数返回的梯度现在都为零。推导并实现SVM损失函数的梯度,并在函数svm_loss_naive内部实现。您会发现将新代码交错到现有函数中很有帮助。
为了检查是否正确实现了梯度,可以用数字估计损失函数的梯度,并将数值估计与计算的梯度进行比较。我们为您提供了这样做的代码:
# 你已经实现了梯度,用下面的代码重新计算它并且使用我们提供给你的函数验证梯度
# 计算loss和它在W中的梯度
loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)
# 对随机选的几个维度计算数值梯度,并把它和你计算的解析梯度比较.所有维度应该几乎相等.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad)
# 再次验证梯度.这次使用正则项.
loss, grad = svm_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 5e1)[0]
grad_numerical = grad_check_sparse(f, W, grad)
numerical: 19.849072 analytic: 19.849072, relative error: 5.520298e-12
numerical: -1.504892 analytic: -1.504892, relative error: 6.562606e-11
numerical: 11.812403 analytic: 11.812403, relative error: 8.660196e-12
numerical: -63.704597 analytic: -63.704597, relative error: 1.820453e-12
numerical: -5.407408 analytic: -5.407408, relative error: 4.599594e-11
numerical: -36.508291 analytic: -36.508291, relative error: 2.907698e-12
numerical: 23.779592 analytic: 23.779592, relative error: 1.577011e-11
numerical: 29.353032 analytic: 29.353032, relative error: 1.772875e-11
numerical: 12.224395 analytic: 12.180846, relative error: 1.784436e-03
numerical: 2.890912 analytic: 2.890912, relative error: 4.758505e-11
numerical: -44.356949 analytic: -44.356920, relative error: 3.171135e-07
numerical: 0.149256 analytic: 0.141826, relative error: 2.552577e-02
numerical: 11.976893 analytic: 11.972968, relative error: 1.638863e-04
numerical: -11.664448 analytic: -11.667394, relative error: 1.262617e-04
numerical: 10.137097 analytic: 10.133080, relative error: 1.981414e-04
numerical: -19.628820 analytic: -19.624164, relative error: 1.186037e-04
numerical: 20.765163 analytic: 20.723335, relative error: 1.008178e-03
numerical: 5.783737 analytic: 5.785277, relative error: 1.330772e-04
numerical: -29.571995 analytic: -29.667418, relative error: 1.610805e-03
numerical: -2.112462 analytic: -2.107824, relative error: 1.099038e-03
linear_svm.py
使用全向量的方法实现SVM损失函数
def svm_loss_vectorized(W, X, y, reg):
"""
构造一个SVM损失函数,全向量化实现
"""
loss = 0.0
dW = np.zeros(W.shape) # 梯度初始化为0
# 使用向量化的方法求loss
scores = X.dot(W)
num_classes = W.shape[1]
num_train = X.shape[0]
scores_correct = scores[np.arange(num_train), y]
scores_correct = np.reshape(scores_correct, (num_train, -1))
margins = scores - scores_correct + 1
margins = np.maximum(0,margins)
margins[np.arange(num_train), y] = 0
loss += np.sum(margins) / num_train
loss += 0.5 * reg * np.sum(W * W)
margins[margins > 0] = 1
row_sum = np.sum(margins, axis=1) # 1 by N
margins[np.arange(num_train), y] = -row_sum
dW += np.dot(X.T, margins)/num_train + reg * W # D by C
return loss, dW
svm.ipynb
比较两种方法得出的结果是否相同,以及计算所需的时间
# 接下来实现函数svm_loss_vectorized;
tic = time.time()
loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Naive loss: %e computed in %fs' % (loss_naive, toc - tic))
from cs231n.classifiers.linear_svm import svm_loss_vectorized
tic = time.time()
loss_vectorized, _ = svm_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))
# The losses should match but your vectorized implementation should be much faster.
print('difference: %f' % (loss_naive - loss_vectorized))
Naive loss: 9.023543e+00 computed in 0.016366s
Vectorized loss: 9.023543e+00 computed in 0.004959s
difference: 0.000000
# Complete the implementation of svm_loss_vectorized, and compute the gradient
# of the loss function in a vectorized way.
# The naive implementation and the vectorized implementation should match, but
# the vectorized version should still be much faster.
tic = time.time()
_, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Naive loss and gradient: computed in %fs' % (toc - tic))
tic = time.time()
_, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Vectorized loss and gradient: computed in %fs' % (toc - tic))
# The loss is a single number, so it is easy to compare the values computed
# by the two implementations. The gradient on the other hand is a matrix, so
# we use the Frobenius norm to compare them.
difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('difference: %f' % difference)
Naive loss and gradient: computed in 0.127969s
Vectorized loss and gradient: computed in 0.003472s
difference: 0.000000
linear_classifier.py
训练线性分类器,使用随机梯度下降(Stochastic Gradient Descent)找到最佳的W最小化损失
def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
batch_size=200, verbose=False):
"""
使用随机梯度下降法(stochastic gradient descent)来训练这个线性分类器
Inputs:
- X: 保存训练数据形状为(N, D)的Numpy数组,这里有N个训练样例,每个形状为D
- y: 保存训练标签形状为(N,)的numpy数组; y[i] = c意味着对于C类,x[i]的标签0<=c<C。
- learning_rate: (float) 优化后的学习速率
- reg: (float) 正则化强度.
- num_iters: (integer) 优化时跳的步数
- batch_size: (integer) 在每一步使用训练样例的数量
- verbose: (boolean) 如果true,输出优化的过程
Outputs:
一个列表,保存每一次训练迭代损失函数的值
"""
num_train, dim = X.shape
num_classes = np.max(y) + 1 # 假定y取值在0---k-1之间,K是类别数目
if self.W is None:
# 延迟初始化W
self.W = 0.001 * np.random.randn(dim, num_classes)
# 运行随机梯度下降法来优化W
loss_history = []
for it in range(num_iters):
X_batch = None
y_batch = None
#########################################################################
# TODO: #
# Sample batch_size elements from the training data and their #
# corresponding labels to use in this round of gradient descent. #
# Store the data in X_batch and their corresponding labels in #
# y_batch; after sampling X_batch should have shape (batch_size, dim) #
# and y_batch should have shape (batch_size,) #
# #
# Hint: Use np.random.choice to generate indices. Sampling with #
# replacement is faster than sampling without replacement. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
batch_inx = np.random.choice(num_train, batch_size)
X_batch = X[batch_inx,:]
y_batch = y[batch_inx]
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# evaluate loss and gradient
loss, grad = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)
# perform parameter update
#########################################################################
# TODO: #
# Update the weights using the gradient and the learning rate. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
self.W = self.W - learning_rate * grad
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
return loss_history
svm.ipynb
查看优化结果
# In the file linear_classifier.py, implement SGD in the function
# LinearClassifier.train() and then run it with the code below.
from cs231n.classifiers import LinearSVM
svm = LinearSVM()
tic = time.time()
loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=2.5e4,
num_iters=1500, verbose=True)
toc = time.time()
print('That took %fs' % (toc - tic))
iteration 0 / 1500: loss 416.840860
iteration 100 / 1500: loss 240.456190
iteration 200 / 1500: loss 145.978587
iteration 300 / 1500: loss 89.814848
iteration 400 / 1500: loss 56.253875
iteration 500 / 1500: loss 35.668818
iteration 600 / 1500: loss 23.341002
iteration 700 / 1500: loss 16.464000
iteration 800 / 1500: loss 11.249742
iteration 900 / 1500: loss 8.815990
iteration 1000 / 1500: loss 7.325125
iteration 1100 / 1500: loss 6.763167
iteration 1200 / 1500: loss 6.112349
iteration 1300 / 1500: loss 5.843079
iteration 1400 / 1500: loss 5.215074
That took 6.863122s
# A useful debugging strategy is to plot the loss as a function of
# iteration number:
plt.plot(loss_hist)
plt.xlabel('Iteration number')
plt.ylabel('Loss value')
plt.show()
linear_classifier.py
预测图片标签
def predict(self, X):
"""
Use the trained weights of this linear classifier to predict labels for
data points.
Inputs:
- X: A numpy array of shape (N, D) containing training data; there are N
training samples each of dimension D.
Returns:
- y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional
array of length N, and each element is an integer giving the predicted
class.
"""
y_pred = np.zeros(X.shape[0])
###########################################################################
# TODO: #
# Implement this method. Store the predicted labels in y_pred. #
###########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
score = X.dot(self.W)
y_pred = np.argmax(score,axis=1)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return y_pred
svm.ipynb
# Write the LinearSVM.predict function and evaluate the performance on both the
# training and validation set
y_train_pred = svm.predict(X_train)
print('training accuracy: %f' % (np.mean(y_train == y_train_pred), ))
y_val_pred = svm.predict(X_val)
print('validation accuracy: %f' % (np.mean(y_val == y_val_pred), ))
training accuracy: 0.382776
validation accuracy: 0.384000
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of about 0.39 on the validation set.
#Note: you may see runtime/overflow warnings during hyper-parameter search.
# This may be caused by extreme values, and is not a bug.
learning_rates = [1e-7, 5e-5]
regularization_strengths = [2.5e4, 5e4]
# results is dictionary mapping tuples of the form
# (learning_rate, regularization_strength) to tuples of the form
# (training_accuracy, validation_accuracy). The accuracy is simply the fraction
# of data points that are correctly classified.
results = {}
best_val = -1 # The highest validation accuracy that we have seen so far.
best_svm = None # The LinearSVM object that achieved the highest validation rate.
for rate in learning_rates:
for regular in regularization_strengths:
svm = LinearSVM()
svm.train(X_train, y_train, learning_rate=rate, reg=regular,
num_iters=1000)
y_train_pred = svm.predict(X_train)
accuracy_train = np.mean(y_train == y_train_pred)
y_val_pred = svm.predict(X_val)
accuracy_val = np.mean(y_val == y_val_pred)
results[(rate, regular)]=(accuracy_train, accuracy_val)
if (best_val < accuracy_val):
best_val = accuracy_val
best_svm = svm
# Print out results.
for lr, reg in sorted(results):
train_accuracy, val_accuracy = results[(lr, reg)]
print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
lr, reg, train_accuracy, val_accuracy))
print('best validation accuracy achieved during cross-validation: %f' % best_val)
lr 1.000000e-07 reg 2.500000e+04 train accuracy: 0.372857 val accuracy: 0.395000
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.370347 val accuracy: 0.387000
lr 5.000000e-05 reg 2.500000e+04 train accuracy: 0.156204 val accuracy: 0.168000
lr 5.000000e-05 reg 5.000000e+04 train accuracy: 0.052796 val accuracy: 0.054000
best validation accuracy achieved during cross-validation: 0.395000
# Visualize the cross-validation results
import math
x_scatter = [math.log10(x[0]) for x in results]
y_scatter = [math.log10(x[1]) for x in results]
# plot training accuracy
marker_size = 100
colors = [results[x][0] for x in results]
plt.subplot(2, 1, 1)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors, cmap=plt.cm.coolwarm)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 training accuracy')
# plot validation accuracy
colors = [results[x][1] for x in results] # default size of markers is 20
plt.subplot(2, 1, 2)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors, cmap=plt.cm.coolwarm)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 validation accuracy')
plt.show()
# Evaluate the best svm on test set
y_test_pred = best_svm.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('linear SVM on raw pixels final test set accuracy: %f' % test_accuracy)
linear SVM on raw pixels final test set accuracy: 0.357000
# Visualize the learned weights for each class.
# Depending on your choice of learning rate and regularization strength, these may
# or may not be nice to look at.
w = best_svm.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
plt.subplot(2, 5, i + 1)
# Rescale the weights to be between 0 and 255
wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
plt.imshow(wimg.astype('uint8'))
plt.axis('off')
plt.title(classes[i])