写在前面
KNN近邻算法终极思想莫非一句话:
物以类聚,人以群分。
此算法为机器学习入门的算法,复杂程度极低,适合入门新手练习。(虽然算法简单,可并不代表没用,麻雀虽小,五脏六腑俱全。)
算法思想
图片来源
在本图中求绿色圆属于哪一类?
怎么求?如何验证?
机器学习思想,但凡是设计到属于哪一类这种问题很多人会想到,分类器(Classifier)。
但是,分类器是怎么写出来的?K-NN算法就讲述了其底层原理。也告诉我们一个事实,实际问题但凡是能转换成数学问题的都可以用人工智能解决。
算法设计:
-
计算测试数据与各个训练数据之间距离
-
按照距离的递增进行排序
-
选取距离最小的k个点
-
确定前k个点所在类别的出现频率
-
返回前K个点出现频率最高的类别作为测试数据的一个预测分类
训练集: -
X_train储存训练数据x,y坐标
Y_train储存训练训练集标签
'''
封装实现
'''
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import sqrt
from collections import Counter
raw_data_x = [[3.3935, 2.3313],
[3.1101, 1.7815],
[1.3438, 3.3684],
[3.5823, 4.6792],
[2.2804, 2.8670],
[7.4234, 4.6965],
[5.7451, 3.5340],
[9.1722, 2.5111],
[7.7928, 3.4241],
[7.9398, 0.7916]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
#划分训练数据集
X_train = np.array(raw_data_x)
#print(X_train)
Y_train = np.array(raw_data_y)
#待预测的点,可随便设置
x = np.array([8.2,4.3])
#设定K值
k = input('请输入K值。')
def KNN (x,X_train,Y_train,k):
'''封装
'''
distance = [sqrt(np.sum(_x-x)**2) for _x in X_train]
# print(distance)
neareset = np.argsort(distance)
# print(neareset)
top_key = [Y_train[i] for i in neareset[:k]]
# print(top_key)
votes = Counter(top_key)
print(votes)
return votes.most_common()
KNN(x,X_train, Y_train, k)
数据可视化绘图
黑点即为待预测点,我们可以发现其属于Y_TRAIN(1)。
#matplotlib数据可视化展示
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import sqrt
from collections import Counter
raw_data_x = [[3.3935, 2.3313],
[3.1101, 1.7815],
[1.3438, 3.3684],
[3.5823, 4.6792],
[2.2804, 2.8670],
[7.4234, 4.6965],
[5.7451, 3.5340],
[9.1722, 2.5111],
[7.7928, 3.4241],
[7.9398, 0.7916]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
#划分训练数据集
X_train = np.array(raw_data_x)
# print(X_train)
Y_train = np.array(raw_data_y)
#待遇测的点
x = np.array([8.2,4.3])
#散点图
plt.figure()
plt.scatter(x[0],x[1],color='black')
plt.scatter(X_train[Y_train==0,0],X_train[Y_train==0,1],color='red')
plt.scatter(X_train[Y_train==1,0],X_train[Y_train==1,1],color='cyan')
plt.show()
#求各点之间的距离
distance = []
for x_ in X_train:
dis = sqrt(np.sum(x_-x)**2)
distance.append(dis)
print(distance)
#排序取索引
neareset = np.argsort(distance)
print(neareset)
#设定K值
k = 6
top_key = [Y_train[i] for i in neareset[:k]]
votes = Counter(top_key)
votes.most_common()
print(votes)