理解KNN算法

KNN主要包括训练过程和分类过程。在训练过程上,需要将训练集存储起来。在分类过程中,将测试集和训练集中的每一张图片去比较,选取差别最小的那张图片。

如果数据集多,就把训练集分成两部分,一小部分作为验证集(假的测试集),剩下的都为训练集(一般来说是70%-90%,具体多少取决于需要调整的超参数的多少,如果超参数多,验证集占比就更大一点)。验证集的好处是用来调节超参数,如果数据集不多,使用交叉验证的方法来调节参数。但是交叉验证的代价比较高,K折交叉验证,K越大越好,但是代价也更高。
决策分类
明确K个邻居中所有数据类别的个数,将测试数据划分给个数最多的那一类。即由输入实例的 K 个最临近的训练实例中的多数类决定输入实例的类别。
常用决策规则:
多数表决法:多数表决法和我们日常生活中的投票表决是一样的,少数服从多数,是最常用的一种方法。

加权表决法:有些情况下会使用到加权表决法,比如投票的时候裁判投票的权重更大,而一般人的权重较小。所以在数据之间有权重的情况下,一般采用加权表决法。

优点:
所选择的邻居都是已经正确分类的对象
KNN算法本身比较简单,分类器不需要使用训练集进行训练,训练时间复杂度为0。本算法分类的复杂度与训练集中数据的个数成正比。
对于类域的交叉或重叠较多的待分类样本,KNN算法比其他方法跟合适。
缺点:
当样本分布不平衡时,很难做到正确分类
计算量较大,因为每次都要计算测试数据到全部数据的距离。

python代码实现:

import numpy as np

class kNearestNeighbor:
def init(self):
pass

def train(self, X, y):
    self.Xtr = X
    self.ytr = y

def predict(self, X, k=1):
    num_test = X.shape[0]
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
    for i in range(num_test):
        distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
        closest_y = y_train[np.argsort(distances)[:k]]
        u, indices = np.unique(closest_y, return_inverse=True)
        Ypred[i] = u[np.argmax(np.bincount(indices))]
    return Ypred

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
load_CIFAR_batch()和load_CIFAR10()是用来加载CIFAR-10数据集的

import pickle
def load_CIFAR_batch(filename):
“”" load single batch of cifar “”"
with open(filename, ‘rb’) as f:
datadict = pickle.load(f, encoding=‘latin1’)
X = datadict[‘data’]
Y = datadict[‘labels’]
X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype(“float”)
Y = np.array(Y)
return X, Y
1
2
3
4
5
6
7
8
9
10
import os
def load_CIFAR10(ROOT):
“”" load all of cifar “”"
xs = []
ys = []
for b in range(1,6):
f = os.path.join(ROOT, ‘data_batch_%d’ %(b))
X, Y = load_CIFAR_batch(f)
xs.append(X)
ys.append(Y)
Xtr = np.concatenate(xs) #使变成行向量
Ytr = np.concatenate(ys)
del X,Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, ‘test_batch’))
return Xtr, Ytr, Xte, Yte
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Xtr, Ytr, Xte, Yte = load_CIFAR10(‘cifar10’)
Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3)
Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3)
1
2
3
#由于数据集稍微有点大,在电脑上跑的很慢,所以取训练集5000个,测试集500个
num_training = 5000
num_test = 500
x_train = Xtr_rows[:num_training, :]
y_train = Ytr[:num_training]

x_test = Xte_rows[:num_test, :]
y_test = Yte[:num_test]
1
2
3
4
5
6
7
8
9
knn = kNearestNeighbor()
knn.train(x_train, y_train)
y_predict = knn.predict(x_test, k=7)
acc = np.mean(y_predict == y_test)
print(‘accuracy : %f’ %(acc))
1
2
3
4
5
accuracy : 0.302000
1
#k值取什么最后的效果会更好呢?可以使用交叉验证的方法,这里使用的是5折交叉验证
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

x_train_folds = np.array_split(x_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)

k_to_accuracies = {}

for k_val in k_choices:
print('k = ’ + str(k_val))
k_to_accuracies[k_val] = []
for i in range(num_folds):
x_train_cycle = np.concatenate([f for j,f in enumerate (x_train_folds) if j!=i])
y_train_cycle = np.concatenate([f for j,f in enumerate (y_train_folds) if j!=i])
x_val_cycle = x_train_folds[i]
y_val_cycle = y_train_folds[i]
knn = kNearestNeighbor()
knn.train(x_train_cycle, y_train_cycle)
y_val_pred = knn.predict(x_val_cycle, k_val)
num_correct = np.sum(y_val_cycle == y_val_pred)
k_to_accuracies[k_val].append(float(num_correct) / float(len(y_val_cycle)))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
k = 1
k = 3
k = 5
k = 8
k = 10
k = 12
k = 15
k = 20
k = 50
k = 100
1
2
3
4
5
6
7
8
9
10
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print(‘k = %d, accuracy = %f’ % (int(k), accuracy))
1
2
3
k = 1, accuracy = 0.098000
k = 1, accuracy = 0.148000
k = 1, accuracy = 0.205000
k = 1, accuracy = 0.233000
k = 1, accuracy = 0.308000
k = 3, accuracy = 0.089000
k = 3, accuracy = 0.142000
k = 3, accuracy = 0.215000
k = 3, accuracy = 0.251000
k = 3, accuracy = 0.296000
k = 5, accuracy = 0.096000
k = 5, accuracy = 0.176000
k = 5, accuracy = 0.240000
k = 5, accuracy = 0.284000
k = 5, accuracy = 0.309000
k = 8, accuracy = 0.100000
k = 8, accuracy = 0.175000
k = 8, accuracy = 0.263000
k = 8, accuracy = 0.289000
k = 8, accuracy = 0.310000
k = 10, accuracy = 0.099000
k = 10, accuracy = 0.174000
k = 10, accuracy = 0.264000
k = 10, accuracy = 0.318000
k = 10, accuracy = 0.313000
k = 12, accuracy = 0.100000
k = 12, accuracy = 0.192000
k = 12, accuracy = 0.261000
k = 12, accuracy = 0.316000
k = 12, accuracy = 0.318000
k = 15, accuracy = 0.087000
k = 15, accuracy = 0.197000
k = 15, accuracy = 0.255000
k = 15, accuracy = 0.322000
k = 15, accuracy = 0.321000
k = 20, accuracy = 0.089000
k = 20, accuracy = 0.225000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.319000
k = 20, accuracy = 0.306000
k = 50, accuracy = 0.079000
k = 50, accuracy = 0.248000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.287000
k = 50, accuracy = 0.293000
k = 100, accuracy = 0.075000
k = 100, accuracy = 0.246000
k = 100, accuracy = 0.275000
k = 100, accuracy = 0.284000
k = 100, accuracy = 0.277000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
可视化交叉验证的结果

import matplotlib.pyplot as plt

plt.rcParams[‘figure.figsize’] = (10.0, 8.0)
plt.rcParams[‘image.interpolation’] = ‘nearest’
plt.rcParams[‘image.cmap’] = ‘gray’
1
2
3
4
5
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)

accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title(‘Cross-validation on k’)
plt.xlabel(‘k’)
plt.ylabel(‘Cross-validation accuracy’)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

perfect Yang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值