Cs231n作业:Q1-3 Softmax exercise
Softmax exercise
梯度计算——摘自https://blog.csdn.net/yc461515457/article/details/51924604
首先是给出Loss公式:
L
=
1
N
∑
i
L
i
+
λ
R
(
W
)
−
−
−
(
1
)
L = { \frac{1}{N} \sum_i L_i }+ { \lambda R(W) } ---(1)
L=N1i∑Li+λR(W)−−−(1)
共有N个样本,每个样本带来的Loss是L_i:
L
i
=
−
log
p
y
i
=
−
log
(
e
f
y
i
∑
j
e
f
j
)
=
−
f
y
i
+
log
∑
j
e
f
j
−
−
−
(
2
)
L_i =-\log{p_{y_i}}= -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) = -f_{y_i} + \log\sum_j e^{f_j} ---(2)
Li=−logpyi=−log(∑jefjefyi)=−fyi+logj∑efj−−−(2)
对
于
每
一
个
样
本
X
i
,
由
于
s
o
f
t
m
a
x
的
分
母
对
所
有
的
f
j
进
行
了
累
积
求
和
,
对于每一个样本X_i,由于softmax的分母对所有的f_j进行了累积求和,
对于每一个样本Xi,由于softmax的分母对所有的fj进行了累积求和,
所
以
L
i
对
W
的
导
数
对
W
的
每
一
列
都
有
贡
献
,
即
∂
L
i
W
j
对
所
有
的
j
都
不
为
0
:
所以L_i对W的导数对W的每一列都有贡献,即\frac{\partial{L_i}}{W_j}对所有的j都不为0:
所以Li对W的导数对W的每一列都有贡献,即Wj∂Li对所有的j都不为0:
当
j
!
=
y
i
时
:
∂
L
i
∂
W
j
=
e
f
j
∑
j
e
f
j
∂
f
j
∂
W
j
=
e
f
j
∑
j
e
f
j
X
i
T
−
−
−
(
3
)
当j != y_i时:\frac {\partial{L_i}} {\partial{W_j}} = \frac{e^{f_{j}}}{ \sum_j e^{f_j} } \frac{\partial f_j}{\partial W_j} = \frac{e^{f_{j}}}{ \sum_j e^{f_j} } X_i^T ---(3)
当j!=yi时:∂Wj∂Li=∑jefjefj∂Wj∂fj=∑jefjefjXiT−−−(3)
当
j
=
=
y
i
时
:
∂
L
i
∂
W
j
=
e
f
j
∑
j
e
f
j
∂
f
j
∂
W
j
=
e
f
j
∑
j
e
f
j
X
i
T
−
X
i
T
−
−
−
(
4
)
当j == y_i时:\frac {\partial{L_i}} {\partial{W_j}} = \frac{e^{f_{j}}}{ \sum_j e^{f_j} } \frac{\partial f_j}{\partial W_j} = \frac{e^{f_{j}}}{ \sum_j e^{f_j} } X_i^T - X_i^T---(4)
当j==yi时:∂Wj∂Li=∑jefjefj∂Wj∂fj=∑jefjefjXiT−XiT−−−(4)
这个练习类似于SVM练习。你会:
•为Softmax分类器实现一个全矢量化的损失函数
•实现其解析梯度的全矢量表达式
•检查您的实现与数值梯度
•使用验证集来调整学习速度和正则化强度
•使用SGD优化损失函数
•想象最终学习到的重量
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
获取CIFAR10数据:
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
"""
Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
it for the linear classifier. These are the same steps as we used for the
SVM, but condensed to a single function.
"""
# Load the raw CIFAR-10 data
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# subsample the data(子样品的数据)
mask = list(range(num_training, num_training + num_validation)) # 49000-50000的1000个
X_val = X_train[mask] # 训练集后1000个,用于验证
y_val = y_train[mask]
mask = list(range(num_training)) # 49000
X_train = X_train[mask] # 训练集前49000个,用于训练
y_train = y_train[mask]
mask = list(range(num_test)) # 1000
X_test = X_test[mask] # 测试集前1000个,用于测试
y_test = y_test[mask]
mask = np.random.choice(num_training, num_dev, replace=False) # 从训练集0-49000中随机选500个数据作为开发集,并且不能重用元素
X_dev = X_train[mask]
y_dev = y_train[mask]
# Preprocessing: reshape the image data into rows(预处理,将数据图像重塑为行)
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
# Normalize the data: subtract the mean image(标准化数据:减去图像的平均值)
mean_image = np.mean(X_train, axis = 0) # 训练集每个特征求平均值
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
X_dev -= mean_image
# add bias dimension and transform into columns(添加偏差维度并转换为列)
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev
# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
print('dev data shape: ', X_dev.shape)
print('dev labels shape: ', y_dev.shape)
输出:
Train data shape: (49000, 3073)
Train labels shape: (49000,)
Validation data shape: (1000, 3073)
Validation labels shape: (1000,)
Test data shape: (1000, 3073)
Test labels shape: (1000,)
dev data shape: (500, 3073)
dev labels shape: (500,)
Softmax Classifier
Your code for this section will all be written inside cs231n/classifiers/softmax.py
.
打开文件cs231n/classifier /softmax.py
并实现softmax_loss_naive
函数。
from builtins import range
import numpy as np
from random import shuffle
from past.builtins import xrange
def softmax_loss_naive(W, X, y, reg):
"""
Softmax损失函数,简单的实现(带有循环)输入有维数D,有C类,我们操作的是小批量N的例子。
输入:
- W:一个包含权重的形状(D, C)的数字数组。
- X:形状(N, D)的数字数组,包含少量数据。
- y:包含训练标签的形状(N,)的numpy数组;y[i] = c表示
X[i]有标签c,其中0 <= c < c。
- reg:(float)正则化强度
返回一个元组:
-单浮子损失
-权值W的梯度;与W形状相同的数组
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W) # 3073x10
#############################################################################
# TODO: 使用显式循环计算软最大值损失及其梯度。 #
# 将损耗存储在损耗中,梯度存储在dW中。如果在这里不小心,很容易遇到数值不稳定。 #
# 不要忘记正则化! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
num_train = X.shape[0] # 获取样本数量500
num_class= W.shape[1] # 获取特征3073
for i in range(num_train):
scores_i = X[i].dot(W) # 1x3073 * 3073x10 = 1x10
exp_scores = np.exp(scores_i) # e^scores的数组 1x10
sum_scores = np.sum(exp_scores) # 求所有分数的总和 1x1
exp_scores /= sum_scores
correct_score = exp_scores[y[i]] # 获取真正标签所对应的分数 1x1
loss += -np.log(correct_score) # 计算该样本的损失
for j in range(num_class):
if j != y[i]:
dW[:,j] += exp_scores[j]*X[i]
else:
dW[:,j] += (exp_scores[y[i]]-1)*X[i]
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
loss /= num_train # 获取该样本的平均损失
dW /= num_train
loss += reg * np.sum(W * W) # 加入正则化,得到完整的损失函数
dW += reg * W
return loss, dW
# First implement the naive softmax loss function with nested loops.首先用嵌套循环实现朴素的softmax损失函数。
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.
# 打开文件cs231n/classifier /softmax.py并实现
# softmax_loss_naive函数。
from cs231n.classifiers.softmax import softmax_loss_naive
import time
# Generate a random softmax weight matrix and use it to compute the loss.
# 生成一个随机的软最大权矩阵,并使用它来计算损失。
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)
# As a rough sanity check, our loss should be something close to -log(0.1).
# 作为一个粗略的完整性检查,我们的损失应该接近-log(0.1)。
print('loss: %f' % loss)
print('sanity check: %f' % (-np.log(0.1)))
输出:
loss: 2.354422
sanity check: 2.302585
lnline Question 1
为什么我们期望损失接近-log(0.1)呢?简要解释。
Your Answer:
由
公
式
:
L
i
=
−
log
(
e
s
y
i
∑
j
e
s
j
)
知
:
由公式:L_i = -\log\left(\frac{e^{s_{y_i}}}{ \sum_j e^{s_j} }\right) 知:
由公式:Li=−log(∑jesjesyi)知:因为是随机初始化并且类的总和是10,所以正确预测类数的可能是1/10,那么损失将为-log(0.1)。
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
# 完成softmax_loss_naive的实现,并实现一个使用嵌套循环的渐变(naive)版本。
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)
# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
# 与我们对SVM所做的一样,使用数值梯度检查作为调试工具。数值梯度应接近解析梯度。
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)
# similar to SVM case, do another gradient check with regularization
# 与SVM的情况类似,用正则化方法再做一次梯度检验
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 5e1)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)
与我们对SVM所做的一样,使用数值梯度检查作为调试工具。数值梯度应接近解析梯度。
与SVM的情况类似,用正则化方法再做一次梯度检验:
numerical: 1.895536 analytic: 1.895536, relative error: 4.075329e-08
numerical: 1.145960 analytic: 1.145960, relative error: 7.754185e-09
numerical: 0.520816 analytic: 0.520816, relative error: 1.751680e-09
numerical: 0.019831 analytic: 0.019830, relative error: 1.837633e-06
numerical: -1.927560 analytic: -1.927560, relative error: 3.272616e-09
numerical: -2.331816 analytic: -2.331816, relative error: 1.606302e-08
numerical: 1.795485 analytic: 1.795485, relative error: 4.357714e-08
numerical: 2.946119 analytic: 2.946119, relative error: 2.266958e-08
numerical: 0.776307 analytic: 0.776307, relative error: 3.997289e-09
numerical: 2.914070 analytic: 2.914070, relative error: 6.318577e-09
numerical: -0.553924 analytic: -0.549033, relative error: 4.434225e-03
numerical: 0.000276 analytic: -0.001898, relative error: 1.000000e+00
numerical: 0.949624 analytic: 0.939115, relative error: 5.563964e-03
numerical: 1.651383 analytic: 1.655799, relative error: 1.335457e-03
numerical: -0.736941 analytic: -0.741462, relative error: 3.057579e-03
numerical: 2.944667 analytic: 2.947244, relative error: 4.372780e-04
numerical: 3.311702 analytic: 3.317417, relative error: 8.620659e-04
numerical: -1.625303 analytic: -1.628303, relative error: 9.221845e-04
numerical: 1.819605 analytic: 1.810454, relative error: 2.520883e-03
numerical: 0.428718 analytic: 0.432026, relative error: 3.843106e-03
既然我们已经有了softmax loss函数及其梯度的一个简单实现,那么就用softmax_loss_vectorized
实现一个矢量化的版本。这两个版本应该计算相同的结果,但是矢量化版本应该更快。
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
#############################################################################
# TODO: Compute the softmax loss and its gradient using no explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
scores = X.dot(W) # 计算分数矩阵。 500x3073 * 3073x10 = 500x10
exp_scores = np.exp(scores) # e^scores. 500x10
sum_scores = np.sum(exp_scores, axis = 1) # 各个类所对应的分数总和。 (500,)
exp_scores /= sum_scores[:,np.newaxis] # 标准化后的概率 500x10
loss_matrix = -np.log(exp_scores[range(num_train),y]) # 500,
loss += np.sum(loss_matrix)
exp_scores[range(num_train),y] -= 1 # 取正确标签处,减一。注:必须使用range(num_train)而非:
dW += np.dot(X.T, exp_scores) # 3073x500 * 500x10 = 3073x10
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
loss /= num_train
loss += reg * np.sum(W*W)
dW /= num_train
dW +=reg *W
return loss, dW
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))
from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))
# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
# 和SVM一样,我们使用Frobenius范数来比较两个版本的梯度。
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized))
print('Gradient difference: %f' % grad_difference)
输出:
和SVM一样,我们使用Frobenius范数来比较两个版本的梯度。
naive loss: 2.354422e+00 computed in 0.128090s
vectorized loss: 2.354422e+00 computed in 0.005005s
Loss difference: 0.000000
Gradient difference: 0.000000
使用验证集来调优超参数(正则化强度和学习率)。您应该尝试不同的学习速率和正则化强度范围;如果您足够小心,您应该能够在验证集上获得超过0.35的分类精度。
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [2.5e4, 5e4]
################################################################################
# TODO: #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save #
# the best trained softmax classifer in best_softmax. #
# 使用验证集设置学习速率和正则化强度。这应该与支持向量机的验证相同; #
# 保存最好的训练有素的softmax分类器在best_softmax。 #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for learning_rate in learning_rates:
for regularization_strength in regularization_strengths:
softmax = Softmax()
loss_history = softmax.train(X_train, y_train,
learning_rate=learning_rate,
reg=regularization_strength,
num_iters=1500,verbose=True)
y_train_pred = softmax.predict(X_train)
train_acc = np.mean(y_train == y_train_pred)
y_val_pred = softmax.predict(X_val)
val_acc = np.mean(y_val == y_val_pred)
if val_acc > best_val:
best_val = val_acc
best_softmax = softmax
results[(learning_rate, regularization_strength)] = [train_acc, val_acc]
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Print out results.
for lr, reg in sorted(results):
train_accuracy, val_accuracy = results[(lr, reg)]
print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
lr, reg, train_accuracy, val_accuracy))
print('best validation accuracy achieved during cross-validation: %f' % best_val)
输出:
lr 1.000000e-07 reg 2.500000e+04 train accuracy: 0.350510 val accuracy: 0.362000
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.329510 val accuracy: 0.337000
lr 5.000000e-07 reg 2.500000e+04 train accuracy: 0.343816 val accuracy: 0.365000
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.317388 val accuracy: 0.324000
best validation accuracy achieved during cross-validation: 0.365000
在测试集上评估最好的softmax:
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))
输出:
softmax on raw pixels final test set accuracy: 0.351000
lnline Question 2-Ture or False
假设总体训练损失定义为所有训练示例中每个数据点损失的和。可以在训练集中添加一个新的数据点,使SVM的损失保持不变,但Softmax分类器的损失不是这样。
Your Answer: 略… Orz
将每节课所学的权重形象化:
# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
plt.subplot(2, 5, i + 1)
# Rescale the weights to be between 0 and 255
wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
plt.imshow(wimg.astype('uint8'))
plt.axis('off')
plt.title(classes[i])
输出: