斯坦福CS231作业一（训练KNN分类器）

最新推荐文章于 2021-06-30 11:10:30 发布

schedule list

最新推荐文章于 2021-06-30 11:10:30 发布

阅读量1.1k

点赞数 1

原文链接：https://www.bilibili.com/video/BV1TJ411d7b7?p=6

版权

The kNN classifier consists of two stages:
•During training, the classifier takes the training data and simply remembers it
•During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
•The value of k is cross-validated

#导入必要的库
import random
import numpy as np
from cs231n,data_utils import load_CIFARI10
import matplotlib.pyplot as plt
from _future_ import print_function
#magic方法
%matplotlib inline
#设置画图的图片显示格式
plt.rcParams['figure.figsize']=(10.0,8.0)
plt.rcParams['image.interpolation']='nearest'
plt.rcParams['image.cmap']='gray'
#magic方法，修改函数的话无需重新从头到尾，可以直接load进notebook
%load_ext autoreload
%autoreload 2

#加载数据集
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
##清理变量以防止多次加载数据(这可能会导致内存问题)
try:
      del X_train, y_train
      del X_test, y_test
      print('Clear previously loaded data.')
except:
      pass
#划分数据集，打印出训练和测试数据的大小
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

运行结果：32*32 3通道图像
在这里插入图片描述

#从数据集中可视化一些例子，来自每个类的一些训练图片
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

十个分类的一些可视化结果（截取部分）：
在这里插入图片描述

#为了更有效的执行代码，对数据进行采样，划分得到训练集和测试集
num_training = 5000  #训练集
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500    #测试集
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

#将数据图像重塑为row，原格式是按列排列的，这里旋转成行排列
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

from cs231n.classifiers import KNearestNeighbor
#创建一个KNN分类器实例
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

#Remember that training a KNN classifier is a noop: the Classifier simply remembers the data and does no further processing

创建完分类器实例后，我们现在想用KNN分类器对测试数据进行分类。我们可以把这个过程分成两个步骤:
1.首先，我们必须计算所有测试样本和所有训练样本之间的距离。
2.给定这些距离，对于每个测试样本，我们找到k个最近的示例，并让它们为标签投票

首先，打开cs231n/classifiers/k_nearest_neighbor.py并在该文件中实现compute_distances_two_loops函数，该函数在所有对(测试、培训)示例上使用一个(非常低效的)双循环，并一次计算一个元素的距离矩阵。
计算所有训练和测试示例之间的距离矩阵：如果有Ntr训练示例和Nte测试示例，则此阶段应该生成Nte x Ntr矩阵，其中每个元素(i,j)是第i个测试和第j个训练示例之间的距离。

#打开cs231n/classifier /k_nearest_neighbor.py并实现compute_distances_two_loops。

for i in range (num_test):
    for j in range (num_train):
    diff = x[i] -self.x_train[j]   #(x1-y1.x2-y2,x3-y3……xd-yd)
    diff_2 = diff**2  #((x1-y1)^2,(x2-y2)^2,(x3-y3)^2……（xd-yd）^2)
    d =np.sqrt(np.sum(diff_2)) #sqrt((x1-y1)^2+(x2-y2)^2+……+(xd-yd)^2)
    dist[i,j] = d
    




#——以上6行为要求自己实现的代码
#测试你的实现:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

测试结果：
在这里插入图片描述
这里没完整截取结果漏了半截

#可视化距离矩阵:每一行都是一个测试实例
#它距离训练样本的距离
plt.imshow(dists, interpolation='none')
plt.show()

在这里插入图片描述
Inline Question #1: Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)
•What in the data is the cause behind the distinctly bright rows?
•What causes the columns?

answer：颜色比较亮表示距离比较远，如果某一行都比较亮，说明该测试数据和所有的训练数据都不相似，如果某一列比较亮，说明明该训练数据和所有的测试数据都不像

#现在实现预测函数predict_tags并运行以下代码:我们用k = 1(这是最近的邻居)。
y_test_pred = classifier.predict_labels(dists, k=1)
#计算并打印正确预测的样本的分数
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

预测结果：27%，对于一个简单的knn算法来说准确率还是可以的
在这里插入图片描述
若取k=5

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

准确率没有提高很多：
在这里插入图片描述
Inline Question 2 We can also other distance metrics such as L1 distance. The performance of a Nearest Neighbor classifier that uses L1 distance will not change if (Select all that apply.):
1.The data is preprocessed by subtracting the mean.
2.The data is preprocessed by subtracting the mean and dividing by the standard deviation.
3.The coordinate axes for the data are rotated.
4.None of the above.
answer：均值和标准差没有变化，因为数学公式的形式没变
旋转会发生变化。
在这里插入图片描述
比如黑色的x,y,z,是正交坐标轴上的三个点，那么zx=zy=a，当顺着原点z旋转45°之后，得到红色的x,y,z,显然可见zy，zy长度变了。

#现在使用部分向量化来加速距离矩阵的计算
#只有一个循环。实现compute_distances_one_loop函数并运行下面的代码:
#一层循环的代码：不需要对训练集遍历了，只对测试集遍历
for i in range (num_test):
    diff = self.x_train - x[i]  #整个训练集构成的矩阵-测试集，由于行列不一定一致，所以用到broadcast的知识，向量扩充成矩阵
    dist = np.sum(diff**2,axis=1)  #按行求和，（x1-y1）^2+(x2-y2)^2+……+(xd-yd)^2
    dists[i,:] = np.sqrt(dist) #平方和开根号





#——以上四行为要求自己实现的代码
dists_one = classifier.compute_distances_one_loop(X_test)
#为了确保向量化的实现是正确的，我们要确保它与初始实现是一致的。
#判断两个矩阵是否相似的方法有很多种;其中最简单的是弗洛贝尼乌斯范数。
#两个矩阵的弗洛贝尼乌斯范数是所有元素的差的平方和的平方根;换句话说，将这些矩阵重新组合成向量，然后计算它们之间的欧氏距离。
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

向量化正确的话应该与二层循环的结果是一致的，所以误差应该是0：
在这里插入图片描述

#现在在compute_distances_no_loops中实现完全向量化的版本并运行代码


#向量化
tain_sq=np.num(self.x_train**2,axis=1,keepdims=true) #(m,1)
test_sq=np.broadcast_to(train_sq,shape=(num_train,num_test)).T    #(n,m)
train.sq=np.sum(x**2,axis=1,keepdims=true)    #(n,1)
test.sq=np.brocast_to(test_sq,shape=(num_test,num_train))    #(n,m)
cross=np.dot(x,self.x_train.T)   #(n,m)
dist=np.sqrt(train_sq+test_sq-2*cross)  #(x1^2+x2^2)+(y1^2+y2^2)-2*(x1*y1+x2*y2)








#——以上6行为自己实现




dists_two = classifier.compute_distances_no_loops(X_test)
#检查距离矩阵是否与我们之前计算的距离矩阵一致:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

#比较一下一层循环、二层循环、向量化实现的速度有多快
def time_function(f, *args):
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)


one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)


no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

在这里插入图片描述
可以看到一层循环反而比二层循环慢，这是因为一层循环计算中大量用到了广播机制，占用了内存

Cross-validation:通过交叉验证来确定KNN的K这个超参数的最佳值。

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []

test_dist = dists[i]  #每个test样本和训练集的距离
sort_dist = np.argsort(test_dist)  #从小到大排序，得到下标，第一个下标表示最近的
valid_idx = sort_dist[:k]   #取前K个
closest_y = self.y_train[valid_idx]  #获得前K个的label

for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

#绘制原始观察结果
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

#用与标准偏差相对应的误差条绘制趋势线
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

#根据以上交叉验证结果，选择k的最佳值，使用所有训练数据对分类器进行再训练，并在测试数据上进行测试。应该能够获得超过28%的测试数据的准确性。
best_k = 1
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

#计算和显示精度
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

schedule list

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
斯坦福CS231作业一（训练KNN分类器）

The kNN classifier consists of two stages:•During training, the classifier takes the training data and simply remembers it•During testing, kNN classifies every test image by comparing to all trainin...
复制链接

扫一扫