KNN算法学习笔记+应用实例
KNN概述
KNN(K-Nearest-Neighbors)是一种最为简单的分类算法,它没有显式的学习过程,区别于一般的监督学习模型,如贝叶斯、神经网络,利用数据集训练模型之后,得到相应模型的参数,然后再利用这些参数去预测相关数据的类型。在KNN中,每次预测数据类型都是计算预测数据与数据集中的相关数据的相似度程度(也称为距离),然后通过给定的K参数(超参数)去进行相似度筛选,再由分类决策规则(如多数表决),通过K个相似数据的标签信息,去表决该预测数据属于哪一种类型。也正因为如此,KNN不太适合使用在应用上,因为其每次预测都需要大量计算相似度。
距离度量
欧氏距离
d = ∑ i = 1 n ( x i − y i ) 2 d=\sqrt{\sum_{i=1}^n(x_i-y_i)^2} d=i=1∑n(xi−yi)2
np.tile(x_test, (x_train.shape[0], 1))
将测试数据的维度拉到与训练数据集一致的大小,方便进行后续的矩阵运算distance_square = np.sum((x_train - x_test) ** 2, axis=1)
:axis=1,因为array的行代表数据个数,而列则代表每个数据的特征维度,所以只需要对行进行求和开方
def Euclidean_Distance(x_train, x_test):
x_test = np.tile(x_test, (x_train.shape[0], 1))
distance_square = np.sum((x_train - x_test) ** 2, axis=1)
distance = np.sqrt(distance_square)
return distance
曼哈顿距离
d = ∑ i = 1 n ∣ x i − y i ∣ d=\sum_{i=1}^{n} |x_i-y_i| d=i=1∑n∣xi−yi∣
def Manhattan_Distance(x_train, x_test):
x_test = np.tile(x_test, (x_train.shape[0], 1))
distance = np.sum(np.abs(x_train - x_test), axis=1)
return distance
闵可夫斯基距离
L p ( x i , x j ) = ( ∑ i = 1 n ∣ x i ( l ) − x j ( l ) ∣ p ) 1 p L_p(x_i,x_j)=(\sum_{i=1}^n|x_i^{(l)}-x_j^{(l)}|^p)^{\frac{1}{p}} Lp(xi,xj)=(i=1∑n∣xi(l)−xj(l)∣p)p1
KNN算法
计算逻辑:
- 对于一个给定测试对象,计算它与训练集中每个对象的距离
- 选定距离最近的k个训练对象,作为测试对象的邻居
- 根据这k个对象所属的类别,进行投票表决,找到占比最高的那个类别作为测试对象的预测类别。
算法实现:
dict.get((y_train[i]),0)
:如果该个字典数值未被赋值,则给其初始化数值为0;如果初始化,0则无效nearest_k = np.argsort(distances)
:按值进行排序得到的索引序列(从小到大)sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
:items用来遍历字典的键值对,而operator.itemgetter(1)
则是表示对字典的值进行排序
def KNN_classify(k, dis, x_train, y_train, x_test):
assert dis == 'E' or dis == 'M', 'dis must Euclidean or Manhattan'
num_test = x_test.shape[0]
labellist = []
if (dis == 'E'):
for i in range(num_test):
distances = Euclidean_Distance(x_train, x_test[i])
# 按值进行排序得到的索引序列(从小到大)
nearest_k = np.argsort(distances)
# 获取前k个最小距离
top_K = nearest_k[:k]
classCount = {}
for i in top_K:
classCount[y_train[i]] = classCount.get((y_train[i]), 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
labellist.append(sortedClassCount[0][0])
elif (dis == 'M'):
for i in range(num_test):
distances = Manhattan_Distance(x_train, x_test[i])
# 按值进行排序得到的索引序列(从小到大)
nearest_k = np.argsort(distances)
# 获取前k个最小距离
top_K = nearest_k[:k]
classCount = {}
for i in top_K:
classCount[y_train[i]] = classCount.get((y_train[i]), 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
labellist.append(sortedClassCount[0][0])
return np.array(labellist)
实例实现
二维训练数据的分类
def createDataSet():
group = np.array([[1.0, 2.0], [1.2, 0.1], [0.1, 1.4], [0.3, 3.5], [1.1, 1.0], [0.5, 1.5]])
labels = np.array(['A', 'A', 'B', 'B', 'A', 'B'])
return group, labels
group, labels = createDataSet()
plt.scatter(group[labels == 'A', 0], group[labels == 'A', 1], color='r', marker='*')
plt.scatter(group[labels == 'B', 0], group[labels == 'B', 1], color='g', marker='+')
plt.show()
group, labels = createDataSet()
y_pred = KNN_classify(1, 'E', group, labels, np.array([[1.0, 2.1], [0.4, 2.0]]))
print(y_pred)
#['A' 'B']
Mnist数据集实现
1.引入库
import torch
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import matplotlib.pyplot as plt
from KNN import *
2.加载数据集
-
datasets.MNIST(root='././data/MNIST', train=True, transform=None, download=True)
-
train=True用于获取训练集
-
transform=None:表示图像不经过类似于翻转、遮掩等图像噪声处理
-
download=True表示从指定链接下载数据集
-
-
torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
- dataset:用于处理的数据集
- batch_size:设置批处理大小,用于批处理数据
- shuffle:用于打乱数据集
batch_size = 100
# MNIST dataset
train_dataset = datasets.MNIST(root='././data/MNIST', train=True, transform=None, download=True)
test_dataset = datasets.MNIST(root='././data/MNIST', train=False, transform=None, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)
3.查看数据集的大小
print("train_data:",train_dataset.data.size())
print("train_labels:",train_dataset.targets.size())
print("test_data:",test_dataset.data.size())
print("test_labels:",test_dataset.targets.size())
4.显示数据集中的数据
digit=train_loader.dataset.data[0]
plt.imshow(digit,cmap=plt.cm.binary)
plt.show()
print(train_loader.dataset.targets[0])
5.进行训练
x_train.reshape(x_train.shape[0], 28 * 28)
将训练集展开成批处理样式,相当于将图片的维度拉成一维去表示这个图片的特征。
x_train = np.array(train_loader.dataset.data[:200])
x_train = x_train.reshape(x_train.shape[0], 28 * 28)
y_train = np.array(train_loader.dataset.targets[:200])
x_test = np.array(test_loader.dataset.data[:100])
x_test = x_test.reshape(x_test.shape[0], 28 * 28)
y_test = np.array(test_loader.dataset.targets[:100])
num_test = y_test.shape[0]
y_test_pred = KNN_classify(1, 'M', x_train, y_train, x_test)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d/%d correct=>accuarcy:%f' % (num_correct, num_test, accuracy))
可以看到直接使用数据集进行KNN分类操作只能达到40%的准确率,此处训练集只取 np.array(train_loader.dataset.data[:200])
是因为当使用全部数据集计算时,计算耗时较长。
为了优化分类模型,我们可以在数据集的基础上进行一些图像预处理操作,比如在图像加载时进行归一化操作,使所有特征向量的数据都限制在[-1,1]之间,以此来消除量纲。
def getXmean(data):
# 将data改成data.shape[0]行
data = np.reshape(data, (data.shape[0], -1))
# 获得所有样本像素的均值
mean_image = np.mean(data, axis=0)
return mean_image
def centralized(data, mean_image):
data = data.reshape((data.shape[0], -1))
data = data.astype(np.float64)
data -= mean_image # 减去图像均值,实现零均值化
return data
下图为所有图像进行均值化后,形成的图像
下图为类别7经过归一化处理后得到的特征矩阵
x_train = np.array(train_loader.dataset.data[:100])
mean_image=getXmean(x_train)
x_train=centralized(x_train,mean_image)
x_train = x_train.reshape(x_train.shape[0], 28 * 28)
y_train = np.array(train_loader.dataset.targets[:100])
x_test = np.array(test_loader.dataset.data[:100])
x_test=centralized(x_test,mean_image)
x_test = x_test.reshape(x_test.shape[0], 28 * 28)
y_test = np.array(test_loader.dataset.targets[:100])
num_test = y_test.shape[0]
y_test_pred = KNN_classify(1, 'M', x_train, y_train, x_test)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d/%d correct=>accuarcy after centralized :%f' % (num_correct, num_test, accuracy))
可以看到在经过图像归一化处理后,KNN的分类准确率提升了30%多
Cifar10数据集实现
batch_size = 100
# Cifar10 dataset
train_dataset = dsets.CIFAR10(root='../data/Cifar10', train=True, download=False)
test_dataset = dsets.CIFAR10(root='../data/Cifar10', train=False, download=False)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)
x_train = np.array(train_loader.dataset.data)
mean_image = getXmean(x_train)
x_train = centralized(x_train, mean_image)
y_train = np.array(train_loader.dataset.targets)
x_test = np.array(test_loader.dataset.data[:100])
x_test = centralized(x_test, mean_image)
y_test = np.array(test_loader.dataset.targets[:100])
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20] # k的值一般选择1~20以内
num_training = x_train.shape[0]
X_train_folds = []
y_train_folds = []
indices = np.array_split(np.arange(num_training), indices_or_sections=num_folds) # 把下标分成5个部分
for i in indices:
X_train_folds.append(x_train[i])
y_train_folds.append(y_train[i])
k_to_accuracies = {}
for k in k_choices:
# 进行交叉验证
acc = []
for i in range(num_folds):
x = X_train_folds[0:i] + X_train_folds[i + 1:] # 训练集不包括验证集
x = np.concatenate(x, axis=0) # 使用concatenate将4个训练集拼在一起
y = y_train_folds[0:i] + y_train_folds[i + 1:]
y = np.concatenate(y) # 对label使用同样的操作
test_x = X_train_folds[i] # 单独拿出验证集
test_y = y_train_folds[i]
classifier = KNN() # 定义model
classifier.fit(x, y) # 将训练集读入
# dist = classifier.compute_distances_no_loops(test_x) # 计算距离矩阵
y_pred = classifier.predict(k, 'M', test_x) # 预测结果
accuracy = np.mean(y_pred == test_y) # 计算准确率
acc.append(accuracy)
k_to_accuracies[k] = acc # 计算交叉验证的平均准确率
# 输出准确度
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print('k = %d, accuracy = %f' % (k, accuracy))
为了获取超参数k的值,在进行训练时采用了交叉验证的方法,将数据集分成若干份,将各份作为验证集之后给出的平均准确率,最好将评估得到的超参数在测试集进行测试。
KNN类的实现
import numpy as np
import operator
class KNN:
def __init__(self):
pass
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self,k,dis,x_test):
assert dis == 'E' or dis == 'M', 'dis must Euclidean or Manhattan'
num_test = x_test.shape[0]
labellist = []
if dis == 'E':
for i in range(num_test):
distances = Euclidean_Distance(self.x_train, x_test[i])
# 按值进行排序得到的索引序列(从小到大)
nearest_k = np.argsort(distances)
# 获取前k个最小距离
top_K = nearest_k[:k]
classCount = {}
for i in top_K:
# dict.get((y_train[i]),0)如果dict未被赋值,则给其初始化数值为0;如果初始化,0则无效
classCount[self.y_train[i]] = classCount.get((self.y_train[i]), 0) + 1
# items用来遍历字典的键值对,而operator.itemgetter(1)则是表示对字典的值进行排序
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
labellist.append(sortedClassCount[0][0])
elif dis == 'M':
for i in range(num_test):
distances = Manhattan_Distance(self.x_train, x_test[i])
# 按值进行排序得到的索引序列(从小到大)
nearest_k = np.argsort(distances)
# 获取前k个最小距离
top_K = nearest_k[:k]
classCount = {}
for i in top_K:
# dict.get((y_train[i]),0)如果dict未被赋值,则给其初始化数值为0;如果初始化,0则无效
classCount[self.y_train[i]] = classCount.get((self.y_train[i]), 0) + 1
# items用来遍历字典的键值对,而operator.itemgetter(1)则是表示对字典的值进行排序
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
labellist.append(sortedClassCount[0][0])
def Euclidean_Distance(x_train, x_test):
# np.tile()将测试数据的维度拉到与训练数据集一致的大小
# 此处的-相当于是矩阵的减法
# axis=1,因为array的行代表数据个数,而列则代表每个数据的属性维度,所以只需要对行进行求和开方
# 将矩阵构造成与训练数据样本个数一致,也就是将行数拓展成和训练集一致
x_test = np.tile(x_test, (x_train.shape[0], 1))
# 计算和数据集中各个数据的平方距离,然后再以行进行求和,得到和每个数据各个属性的距离
distance_square = np.sum((x_train - x_test) ** 2, axis=1)
# 开根号得到欧氏距离
distance = np.sqrt(distance_square)
return distance
def Manhattan_Distance(x_train, x_test):
# 将矩阵构造成与训练数据样本个数一致,也就是将行数拓展成和训练集一致
x_test = np.tile(x_test, (x_train.shape[0], 1))
# 计算和数据集中各个数据的绝对值,然后再以行进行求和,得到和每个数据各个属性的距离
distance = np.sum(np.abs(x_train - x_test), axis=1)
return distance
def getXmean(data):
# 将data改成data.shape[0]行,
data = np.reshape(data, (data.shape[0], -1))
# 将所有数据每个像素位置的均值
mean_image = np.mean(data, axis=0)
return mean_image
def centralized(data, mean_image):
data = data.reshape((data.shape[0], -1))
data = data.astype(np.float64)
data -= mean_image # 减去图像均值,实现零均值化
return data
kd树
实现KNN算法时通常实现方法是线性扫描,也就是计算输入样本与每一个训练样本的距离,为了提高计算效率,前人们设计了很多高效的算法,比如下面所说的kd树。
算法逻辑:
输入:k维空间数据集 T = x 1 , x 2 , … , x k T={x_1,x_2,\dots,x_k} T=x1,x2,…,xk,其中 x i = ( x i ( 1 ) , x i ( 2 ) … , x i ( n ) ) x_i=(x_i^{(1)},x_i^{(2)}\dots,x_i^{(n)}) xi=(xi(1),xi(2)…,xi(n))
输出:
- 开始:构造根节点,以T中所有样本的1维坐标中位数作为切分点,左子节点一维特征小于均值,右子节点反之
- 重复:对深度为j的节点,选 x ( l ) x^{(l)} x(l)为切分坐标轴, l = j ( m o d k ) + 1 l=j(mod k)+1 l=j(modk)+1
- 直到两子区域无实例停止
- 深度学习与图像识别:原理与实践
- 统计学习