knn原理与实现

最新推荐文章于 2024-05-13 18:31:59 发布

YJ语

最新推荐文章于 2024-05-13 18:31:59 发布

阅读量107

点赞数

分类专栏： # 算法实现

本文链接：https://blog.csdn.net/The_dream1/article/details/117134846

版权

算法实现专栏收录该内容

6 篇文章 1 订阅

订阅专栏

文章目录

1 K近邻介绍
- 1.1 K近邻的原理
- 1.2 近邻的优缺点
2 k近邻算法实现

1 K近邻介绍

1.1 K近邻的原理

找出距离测试样本点的前k个距离最近的点，判断这些点中，哪个类别的样本的数量最多，则将测试样本点归于这个类别

1.2 近邻的优缺点

优点：

无需训练、简单，易于理解，易于实现

缺点：

必须指定K值，K值选择不当则分类精度不能保证
内存开销大，懒惰算法，对测试样本分类时的计算量大

2 k近邻算法实现

2.1 引入依赖

import numpy as np
import pandas as pd

# 这里直接引入sklearn里的数据集，iris鸢尾花
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split  # 切分数据集为训练集和测试集
from sklearn.metrics import accuracy_score # 计算分类预测的准确率

2.2 数据加载和预处理

iris = load_iris()

df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
df['class'] = iris.target
df['class'] = df['class'].map({0: iris.target_names[0], 1: iris.target_names[1], 2: iris.target_names[2]})
df.head(10)
df.describe()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

x = iris.data
y = iris.target.reshape(-1,1)
print(x.shape, y.shape)

(150, 4) (150, 1)

# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=35, stratify=y)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)


a = np.array([[3,2,4,2],
             [2,1,4,23],
             [12,3,2,3],
             [2,3,15,23],
             [1,3,2,3],
             [13,3,2,2],
             [213,16,3,63],
             [23,62,23,23],
             [23,16,23,43]])
b = np.array([[1,1,1,1]])
print(a-b)
np.sum(np.abs(a - b), axis=1)
dist = np.sqrt( np.sum((a-b) ** 2, axis=1) )
print(dist)

(105, 4) (105, 1)
(45, 4) (45, 1)
[[  2   1   3   1]
 [  1   0   3  22]
 [ 11   2   1   2]
 [  1   2  14  22]
 [  0   2   1   2]
 [ 12   2   1   1]
 [212  15   2  62]
 [ 22  61  22  22]
 [ 22  15  22  42]]
[  3.87298335  22.22611077  11.40175425  26.17250466   3.
  12.24744871 221.39783197  71.92357055  54.3783045 ]

2.3 核心算法实现

# 距离函数定义
def l1_distance(a, b):
    return np.sum(np.abs(a-b), axis=1)
def l2_distance(a, b):
    return np.sqrt( np.sum((a-b) ** 2, axis=1) )

# 分类器实现
class kNN(object):
    # 定义一个初始化方法，__init__ 是类的构造方法
    def __init__(self, n_neighbors = 1, dist_func = l1_distance):
        self.n_neighbors = n_neighbors
        self.dist_func = dist_func
    
    # 训练模型方法
    def fit(self, x, y):
        self.x_train = x
        self.y_train = y
    
    # 模型预测方法
    def predict(self, x):
        # 初始化预测分类数组
        y_pred = np.zeros( (x.shape[0], 1), dtype=self.y_train.dtype )
        
        # 遍历输入的x数据点，取出每一个数据点的序号i和数据x_test
        for i, x_test in enumerate(x):
            # x_test跟所有训练数据计算距离
            distances = self.dist_func(self.x_train, x_test)
            
            # 得到的距离按照由近到远排序，取出索引值
            nn_index = np.argsort(distances)
            
            # 选取最近的k个点，保存它们对应的分类类别
            nn_y = self.y_train[ nn_index[:self.n_neighbors] ].ravel()
            
            # 统计类别中出现频率最高的那个，赋给y_pred[i]
            y_pred[i] = np.argmax( np.bincount(nn_y) )
        
        return y_pred

2.4 测试

# 定义一个knn实例
knn = kNN(n_neighbors = 3)
# 训练模型
knn.fit(x_train, y_train)
# 传入测试数据，做预测
y_pred = knn.predict(x_test)

print(y_test.ravel())
print(y_pred.ravel())

# 求出预测准确率
accuracy = accuracy_score(y_test, y_pred)

print("预测准确率: ", accuracy)

[2 1 2 2 0 0 2 0 1 1 2 0 1 1 1 2 2 0 1 2 1 0 0 0 1 2 0 2 0 0 2 1 0 2 1 0 2
 1 2 2 1 1 1 0 0]
[2 1 2 2 0 0 2 0 1 1 1 0 1 1 1 2 2 0 1 2 1 0 0 0 1 2 0 2 0 0 2 1 0 2 1 0 2
 1 2 1 1 2 1 0 0]
预测准确率:  0.9333333333333333

# 定义一个knn实例
knn = kNN()
# 训练模型
knn.fit(x_train, y_train)

# 保存结果list
result_list = []

# 针对不同的参数选取，做预测
for p in [1, 2]:
    knn.dist_func = l1_distance if p == 1 else l2_distance
    
    # 考虑不同的k取值，步长为2
    for k in range(1, 10, 2):
        knn.n_neighbors = k
        # 传入测试数据，做预测
        y_pred = knn.predict(x_test)
        # 求出预测准确率
        accuracy = accuracy_score(y_test, y_pred)
        result_list.append([k, 'l1_distance' if p == 1 else 'l2_distance', accuracy])
df = pd.DataFrame(result_list, columns=['k', '距离函数', '预测准确率'])
df

	k	距离函数	预测准确率
0	1	l1_distance	0.933333
1	3	l1_distance	0.933333
2	5	l1_distance	0.977778
3	7	l1_distance	0.955556
4	9	l1_distance	0.955556
5	1	l2_distance	0.933333
6	3	l2_distance	0.933333
7	5	l2_distance	0.977778
8	7	l2_distance	0.977778
9	9	l2_distance	0.977778

YJ语

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
knn原理与实现

k近邻算法教程0.引入依赖import numpy as npimport pandas as pd# 这里直接引入sklearn里的数据集，iris鸢尾花from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # 切分数据集为训练集和测试集from sklearn.metrics import accuracy_score # 计算分类预测的准确率1. 数据
复制链接

扫一扫

专栏目录