机器学习(二)——K近邻算法

最新推荐文章于 2024-10-27 11:29:42 发布

hhhcbw

最新推荐文章于 2024-10-27 11:29:42 发布

阅读量666

点赞数 1

分类专栏：机器学习文章标签： python 机器学习

本文链接：https://blog.csdn.net/weixin_44491423/article/details/121613956

版权

机器学习专栏收录该内容

16 篇文章 43 订阅

订阅专栏

引言

在具体讨论 KNN 算法之前，我们先通过一个具体的例子引入。我们创建一个数据集，并将其可视化出来。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

raw_data_x = [[3.3935, 2.3313],
              [3.1101, 1.7815],
              [1.3438, 3.3684],
              [3.5823, 4.6792],
              [2.2804, 2.8670],
              [7.4234, 4.6965],
              [5.7451, 3.5340],
              [9.1722, 2.5111],
              [7.7928, 3.4241],
              [7.9398, 0.7916]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
plt.scatter(x_train[y_train==0,0], x_train[y_train==0,1], color='g')
plt.scatter(x_train[y_train==1,0], x_train[y_train==1,1], color='r')
plt.scatter(x_new[0], x_new[1], color='b')
plt.show()

从上面的散点图中，蓝色的点明显更接近红色的点。我们将蓝色的点与其它点的距离求出来，并将最近的 5 个点选出来。

from math import sqrt
distance = []
for x in x_train:
    d = sqrt(np.sum((x_new - x) ** 2))
    distance.append(d)
nearsest = np.argsort(distance)
k = 5
topk_y = [y_train[i] for i in nearsest[:k]]

可以看到结果是 [1,1,1,1,1]，如果我们将距离最近的 5 个点中，种类数量最多作为蓝色的点的种类，那么蓝色也就是归于红色一类，与我们之前看到的散点图一样。没错这就是 KNN 算法。

具体思想

KNN 即 K-Nearest Neighbors，其思想特别简单，就是设置一个超参数 k，对于一个新样本，我们计算其与所有训练集中数据的距离（这个计算距离的方式也可以选择，欧式距离、马氏距离…），然后选出最近的 k 个，进行投票，即 k 个数据中，类别最多的那一类即为新样本的类别。
是不是很简单，下面附一个手写的 KNN 解决手写数字识别的代码。

import numpy as np
from math import sqrt
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split


class KNNClassifier():

    def __init__(self, k):
        assert 1 <= k, "k must be valid"
        self.k = k
        self._x_train = None
        self._y_train = None

    def fit(self, x_train, y_train):
        assert x_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        assert self.k <= x_train.shape[0], \
            "the size of x_train must be at least k"

        self._x_train = x_train
        self._y_train = y_train

    def predict(self, x_new):
        assert self._x_train is not None and self._y_train is not None, \
            "must fit before predict"
        assert x_new.shape[1] == self._x_train.shape[1], \
            "the feature number of x_new must be equal to x_train"

        y_new = [self._predict(x) for x in x_new]
        return np.array(y_new)

    def _predict(self, x):
        assert x.shape[0] == self._x_train.shape[1], \
            "the feature number of x must be equal to x_train"

        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]
        nearest = np.argsort(distances)

        topk_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topk_y)

        return votes.most_common(1)[0][0]

    def score(self, x_test, y_test):
        y_predict = self.predict(x_test)
        assert y_test.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        return sum(y_test == y_predict) / len(y_test)

    def __repr__(self):
        return "KNN(k=%d)" % self.k


if __name__ == '__main__':
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    knn_classifier = KNNClassifier(k=3)
    knn_classifier.fit(x_train, y_train)
    print(knn_classifier.score(x_test, y_test))

超参数

我们已经知道，这个模型自己选择的部分一是 k 即选几个邻居，二是计算距离方法。但是实际操作中，可能会遇到平票的情况，而且只是通过类别的个数来判断有时也不是很准确，我们可以设定一个权重，一般为距离的倒数，这样离新样本越近的那个种类所具有的权重也越大，当然我们也可以自己根据具体情况自己设定计算权重的函数。
下面附一个，利用sklearn包的 KNN 分类器寻找最优超参数的代码。

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

best_p = -1
best_score = 0.0
best_k = -1
for p in range(1, 6):
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(x_train, y_train)
        score = knn_clf.score(x_test, y_test)
        if score > best_score:
            best_p = p
            best_k = k
            best_score = score
print("best_p=", best_p)
print("best k=", best_k)
print("best score=", best_score)