机器学习(二)——K近邻算法

引言

在具体讨论 KNN 算法之前,我们先通过一个具体的例子引入。我们创建一个数据集,并将其可视化出来。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

raw_data_x = [[3.3935, 2.3313],
              [3.1101, 1.7815],
              [1.3438, 3.3684],
              [3.5823, 4.6792],
              [2.2804, 2.8670],
              [7.4234, 4.6965],
              [5.7451, 3.5340],
              [9.1722, 2.5111],
              [7.7928, 3.4241],
              [7.9398, 0.7916]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
plt.scatter(x_train[y_train==0,0], x_train[y_train==0,1], color='g')
plt.scatter(x_train[y_train==1,0], x_train[y_train==1,1], color='r')
plt.scatter(x_new[0], x_new[1], color='b')
plt.show()

散点图
从上面的散点图中,蓝色的点明显更接近红色的点。我们将蓝色的点与其它点的距离求出来,并将最近的 5 个点选出来。

from math import sqrt
distance = []
for x in x_train:
    d = sqrt(np.sum((x_new - x) ** 2))
    distance.append(d)
nearsest = np.argsort(distance)
k = 5
topk_y = [y_train[i] for i in nearsest[:k]]

可以看到结果是 [1,1,1,1,1],如果我们将距离最近的 5 个点中,种类数量最多作为蓝色的点的种类,那么蓝色也就是归于红色一类,与我们之前看到的散点图一样。没错这就是 KNN 算法。


具体思想

KNNK-Nearest Neighbors,其思想特别简单,就是设置一个超参数 k,对于一个新样本,我们计算其与所有训练集中数据的距离(这个计算距离的方式也可以选择,欧式距离、马氏距离…),然后选出最近的 k 个,进行投票,即 k 个数据中,类别最多的那一类即为新样本的类别。
是不是很简单,下面附一个手写的 KNN 解决手写数字识别的代码。

import numpy as np
from math import sqrt
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split


class KNNClassifier():

    def __init__(self, k):
        assert 1 <= k, "k must be valid"
        self.k = k
        self._x_train = None
        self._y_train = None

    def fit(self, x_train, y_train):
        assert x_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        assert self.k <= x_train.shape[0], \
            "the size of x_train must be at least k"

        self._x_train = x_train
        self._y_train = y_train

    def predict(self, x_new):
        assert self._x_train is not None and self._y_train is not None, \
            "must fit before predict"
        assert x_new.shape[1] == self._x_train.shape[1], \
            "the feature number of x_new must be equal to x_train"

        y_new = [self._predict(x) for x in x_new]
        return np.array(y_new)

    def _predict(self, x):
        assert x.shape[0] == self._x_train.shape[1], \
            "the feature number of x must be equal to x_train"

        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]
        nearest = np.argsort(distances)

        topk_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topk_y)

        return votes.most_common(1)[0][0]

    def score(self, x_test, y_test):
        y_predict = self.predict(x_test)
        assert y_test.shape[0] == y_predict.shape[0], \
            "the size of y_true must be equal to the size of y_predict"
        return sum(y_test == y_predict) / len(y_test)

    def __repr__(self):
        return "KNN(k=%d)" % self.k


if __name__ == '__main__':
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    knn_classifier = KNNClassifier(k=3)
    knn_classifier.fit(x_train, y_train)
    print(knn_classifier.score(x_test, y_test))


超参数

我们已经知道,这个模型自己选择的部分一是 k 即选几个邻居,二是计算距离方法。但是实际操作中,可能会遇到平票的情况,而且只是通过类别的个数来判断有时也不是很准确,我们可以设定一个权重,一般为距离的倒数,这样离新样本越近的那个种类所具有的权重也越大,当然我们也可以自己根据具体情况自己设定计算权重的函数。
下面附一个,利用sklearn包的 KNN 分类器寻找最优超参数的代码。

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

best_p = -1
best_score = 0.0
best_k = -1
for p in range(1, 6):
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(x_train, y_train)
        score = knn_clf.score(x_test, y_test)
        if score > best_score:
            best_p = p
            best_k = k
            best_score = score
print("best_p=", best_p)
print("best k=", best_k)
print("best score=", best_score)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值