k-邻近算法精髓在于欧式距离公式,我们就是模仿这个过程来实现算法,代码如下:
import numpy as np
def knn_classify(dataset, labels, new_input, k, weight='uniform'):
'''
dataset: 输入的训练数据集,即 x_train
labels: 输入的训练集对应的类别标签, 即 y_train
new_input: 新输入待分析的数据, 即 x_test,此函数的缺点就是只能输入一个待分析的数据
k: 邻进数
weight: 决策规划 "uniform" 多数表决法,"distance" 距离加权表决法
'''
sample_index = dataset.shape[0]
print(sample_index)
#step1:计算待分类数据与训练集各数据点的距离(欧氏距离公式)
diff1 = np.tile(new_input,(sample_index, 1)) - dataset
#print(diff1)
diff2 = diff1**2
#按列相加
distance = (diff2.sum(axis=1))**0.5
#print(distance)
#将distance升序排序,并取最近的k个数
sort_distance = distance.argsort()
#print(sort_distance)
#定义一个空字典,用于存放k个领进点的分类计数
class_count = {}
for i in range(k):
#第i个领进点在distance数组中的索引,对应的分类
votelabel = labels[sort_distance[i]]
if weight == 'uniform':
# votelabel作为字典的key,对相同的key值累加(多数表决法)
class_count[votelabel] = class_count.get(votelabel, 0) + 1
elif weight == 'distance':
# 对相同的key值按距离加权累加(加权表决法)
class_count[votelabel] = class_count.get(votelabel, 0) + (1/distance[sort_distance[i]])
else:
print ("分类决策规则错误!")
print ("\"uniform\"多数表决法\"distance\"距离加权表决法")
break
#对k个领进点降序排序
sorted_class_count = sorted(class_count.items(), reverse=True)
if weight == 'uniform':
print ("新输入到训练集的最近%d个点的计数为:"%k,"\n",class_count)
print ("新输入的类别是:", sorted_class_count[0][0])
elif weight == 'distance':
print ("新输入到训练集的最近%d个点的距离加权计数为:"%k,"\n",class_count)
print ("新输入的类别是:", sorted_class_count[0][0])
return sorted_class_count[0][0]
然后创建数据集
sample1 = np.array([[96,13984],[120,20380],[100,10000],[116,9800],[85,14000],[90,11061],[85,9400],[69,11950],[98,12000],[99,11500],
[92,8500],[128,11061],[100,14000],[84,9500],[99,12500],[85,8000],[130,12500],[85,12647],[85,11061],[108,12500]])
target1 = np.array(['A','A','A','A','A','B','B','B','B','B',
'C','C','C','C','C','D','D','D','D','D'])
x_test = np.array([98,12540])
然后调用函数预测结果:
p = knn_classify(sample1, target1, x_test, 5)
print(p)
得到的结果为:
新输入到训练集的最近5个点的计数为:
{'D': 1, 'C': 2, 'B': 1, 'A': 1}
新输入的类别是: D
D
验证部分:我们用KNN算法来验证我们的结果
from sklearn.neighbors import KNeighborsClassifier
knnclf = KNeighborsClassifier(n_neighbors=5)
#输入同样的数据
sample1 = [[96,13984],[120,20380],[100,10000],[116,9800],[85,14000],[90,11061],[85,9400],[69,11950],[98,12000],[99,11500],
[92,8500],[128,11061],[100,14000],[84,9500],[99,12500],[85,8000],[130,12500],[85,12647],[85,11061],[108,12500]]
target1 = np.array(['A','A','A','A','A','B','B','B','B','B',
'C','C','C','C','C','D','D','D','D','D'])
x_test = np.array([98,12540])
进行训练和分类
knnclf.fit(sample1, target1)
y_test = knnclf.predict(x_test)
y_test
得到的结果为:
array(['D'], dtype='<U1')
我们自定义的和KNN算法结果一致!!
我们也用鸢尾花数据再来测试
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
x_train = data[:,2:][:120]
y_train = iris.target[:120]
x_test = np.array([5.8,1.8])
p = knn_classify(x_train, y_train, x_test, 5)
print(p)
结果为:
新输入到训练集的最近5个点的计数为:
{2: 5}
新输入的类别是: 2
2
同样用KNN算法来验证
from sklearn.datasets import load_iris
iris = load_iris()
x_train = iris.data[:,2:][:120]
y_train = iris.target[:120]
x_test = np.array([[5.8,1.8]])
knnclf = KNeighborsClassifier(n_neighbors=5)
knnclf.fit(x_train, y_train)
y_test = knnclf.predict(x_test)
y_test
结果为:
array([2])
同样也是一样的!!ヾ(๑╹◡╹)ノ"