k-Nearest Neighbor classifier
KNN分类器包括两个阶段:
在训练期间,分类器获取训练数据并简单地记住它
在测试过程中,kNN通过与所有训练图像进行比较并转移k个最相似的训练示例的标签来对每个测试图像进行分类
注意:K的值是交叉验证的
任务目标:将实现训练和测试两个阶段,并了解基本的图像分类管道、交叉验证并提高熟练编写高效矢量化代码的能力。
一、KNN的原理
1)计算测试集每张图片的每个像素点与训练集每张图片的每个像素点的距离,本文采用了欧氏距离;
2)将距离排序,输出与测试集距离最小的前k个训练集图像的类别
3)对得到的k个数进行投票,选取出现次数最多的类别作为最后的预测类别,当k=1时,closest_y=y_pred。
KNN是一种投票机制,依赖少数服从多数的原则,根据最近样本的标签进行分类的方法,属于局部近似。
优点:
1.简单(原因在于几乎不存在训练,测试时直接计算);
2.适用于样本无法一次性拿到的情况;
3.KNN是根据周围邻近样本的标签进行分类的,所以适合于样本类别交叉或重叠较多的情况;
缺点:
1.测试时间太长,需要计算所有样本与测试样本的距离,因此需要提前去除对分类结果影响不大的样本;
2.不存在概率评分,仅根据样本标签判别;
3.当不同类别的样本数目差异较大时,数目较大的那一类别对KNN判别结果影响较大,因此可能产生误判;
4.无法解决高维问题
二、KNN的代码实现
2.1
# Run some setup code for this notebook.
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
# 2.1.1
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
# 2.1.2
%load_ext autoreload
%autoreload 2
2.1.1 plt.rcParams
作用:设置matplotlib的配置参数
2.1.2 auto_reload
作用:在调试的过程中,如果代码发生更新,实现ipython中引用的模块也能自动更新。
2.2
# Load the raw CIFAR-10 data.
# 2.2.1
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
# 2.2.2
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
# 2.2.3
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
2.2.1 载入数据集
2.2.2 清理变量防止多次载入
2.2.3 输出训练和测试数据、标签的大小,作为检查
输出:
Training data shape: (50000, 32, 32, 3)
Training labels shape: (50000,)
Test data shape: (10000, 32, 32, 3)
Test labels shape: (10000,)
2.3
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes) #取得每个类别的长度
samples_per_class = 7 #每个类别选出七张图例
for y, cls in enumerate(classes):
idxs = np.flatnonzero(y_train == y) #2.3.1
idxs = np.random.choice(idxs, samples_per_class, replace=False) #2.3.2
for i, idx in enumerate(idxs):
plt_idx = i * num_classes + y + 1
plt.subplot(samples_per_class, num_classes, plt_idx)
plt.imshow(X_train[idx].astype('uint8'))
plt.axis('off')
if i == 0:
plt.title(cls)
plt.show()
可视化示例图
2.3.1 np.flatnonzero
矩阵扁平化后返回非零元素的位置:找出标签中y类的位置
2.3.2 np.random.choice
随机选取数组idxs中的值,每组选7个
replace=False:选出但不替换
2.4
# Subsample the data for more efficient code execution in this exercise
#2.4.1
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]
num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
# Reshape the image data into rows
#2.4.2
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)
2.4.1 二次采样
我们采用的