ML_k近邻(KNN)

k近邻(k-NearestNeighbor,简称kNN)是一种监督学习方法,用于分类和回归任务。算法基于实例,通过计算测试样本与训练样本之间的距离来预测结果。k值的选择对结果有显著影响,通常通过交叉验证确定。距离度量常用的是欧式距离,而k值的选取和距离计算方式都会影响分类或回归的准确性。kNN算法的优点是简单、精度高,但计算复杂度高,不适合大数据集。
摘要由CSDN通过智能技术生成

原理

k近邻(k-Nearest Neighbor, 简称kNN)是一种常用的监督学习方法,最简单和最常用的分类算法之一,区别于K-means算法。

基本原理就是根据某种距离度量找出训练集中与其最靠近的k个训练样本,然后根据这k个“邻居”的信息进行预测。在分类任务中,可使用“投票法”;在回归任务中,使用“平均法“。

kNN的一个明显的特点就是“懒惰学习”,没有训练过程,也就是说训练样本保存起来,待收到测试样本时,再对训练样本处理,找到最近的k个样本。

影响分类结果的因素:

  • k的取值,不同的k值预测的结果也会不同
  • 距离计算方式,不同的距离计算方式,找出的“近邻”也会有显著的区别

k值选择

k值的选择比较重要,那么如何确定k的大小呢?通过交叉验证的方法。

将样本数据按照一定的比例,拆分为训练集和验证集,从选择较小的k值开始,不断增加k的值,然后计算在验证集上的错误率,错误率最小对应的k值就是最终选择。
image.png
当开始增加k值时,一般错误率会先降低,因为选择周围更多的样本分类的效果会更好;但是当k值变得很大时,错误率增加,比如,有35个样本,但是k取30,这时候KNN就没什么意义了。

距离计算方式

常见的距离度量有欧式距离、曼哈顿距离、余弦距离等,KNN常用的是欧式距离,公式定义为 d x y = ∑ k = 1 n ( x k − y k ) 2 d_{xy}=\sqrt{\sum_{k=1}^{n}(x_k-y_k)^2} dxy=k=1n(xkyk)2
其中,n表示维度,如在二维空间中,n=2

将测试样本与训练样本逐一计算距离,选择距离最小的k个样本。
如果是分类任务,预测结果就是k个样本中包含最多的类别(投票法);如果是分类任务,预测结果就是k个样本target的平均值 y ^ = 1 K ∑ i = 1 K y i \hat{y}=\frac{1}{K}\sum_{i=1}^{K}y_i y^=K1i=1Kyi

KNN的局限性

优点:

  • 简单,容易实现
  • 精度高、对异常值不敏感

缺点:

  • 计算复杂度高,需要和每个训练样本计算距离,当数据量特别大时,计算成本是非常高的

Implemention

虽然各个库(sklearn、torch等)实现已经有很多了,但为了更好的掌握KNN原理,还是自己实现比较好,预测结果可以和调用库函数的预测结果比较验证。

Version1

def KNN(test_data1,train_data_pca,train_label,k,p):  
    subMat = train_data_pca - np.tile(test_data1,(train_data_pca.shape[0],1))  
    subMat = np.abs(subMat)  
    distance = subMat**p  
    distance = np.sum(distance,axis=1)  
    distance = distance**(1.0/p)  
    distanceIndex = np.argsort(distance)  
    classCount = np.zeros(41)  
    for i in range(k):  
        label = train_label[distanceIndex[i]]  
        classCount[label] = classCount[label] + 1  
    return np.argmax(classCount)  
  
  
def test(k, p,train_data_pca, test_data_pca, train_labels, test_labels):  
    train_labels, test_labels = train_labels.squeeze(), test_labels.squeeze()  
    print("testing with K= %d and lp norm p=%d" % (k, p))  
    m, n = np.shape(test_data_pca)  
    correctCount = 0  
    M = np.zeros((41, 41), int)  
    for i in range(m):  
        test_data1 = test_data_pca[i, :]  
        predict_label = KNN(test_data1, train_data_pca, train_labels, k, p)  
        true_label = test_labels[i]  
        M[true_label][predict_label] += 1  
        # print("predict:%d,true:%d" % (predict_label,true_label)) 
        if true_label == predict_label:  
            correctCount += 1  
  
    print("The accuracy is: %f" % (float(correctCount) / m))  
    print("Confusion matrix:", M)
  • 使用说明
    test(3, 2, train_data_pca, test_data_pca, train_labels, test_labels)
    数据格式:(n_samples, dim)

Version 2

import numpy as np  
import operator  
import matplotlib.pyplot as plt  
  
class KNN(object):  
  
    def __init__(self, k=3):  
        self.k = k  
  
    def fit(self, x, y):  
        self.x = x  
        self.y = y.squeeze()  
  
    def _square_distance(self, v1, v2):  
        return np.sum(np.square(v1-v2))  
  
    def _vote(self, ys):  
  
        vote_dict = {}  
        for y in ys:  
            if y not in vote_dict.keys():  
                vote_dict[y] = 1  
            else:  
                vote_dict[y] += 1  
        sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)  
        return sorted_vote_dict[0][0]  
  
    def predict(self, x):  
        y_pred = []  
        for i in range(len(x)):  
            dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]  
            sorted_index = np.argsort(dist_arr)  
            top_k_index = sorted_index[0:self.k]  
            y_pred.append(self._vote(ys=self.y[top_k_index]))  
        return np.array(y_pred)  
  
    def score(self, y_true=None, y_pred=None):  
        if y_true is None and y_pred is None:  
            y_pred = self.predict(self.x)  
            y_true = self.y  
        score = 0.0  
        y_true = y_true.squeeze()  
        for i in range(len(y_true)):  
            if y_true[i] == y_pred[i]:  
                score += 1  
        score /= len(y_true)  
        return score  
  
  
if __name__ == '__main__':  
    # data generation  
    np.random.seed(314)  
    data_size_1 = 300  
    x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)  
    x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)  
    y_1 = [0 for _ in range(data_size_1)]  
  
    data_size_2 = 400  
    x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)  
    x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)  
    y_2 = [1 for _ in range(data_size_2)]  
  
    x1 = np.concatenate((x1_1, x1_2), axis=0)  
    x2 = np.concatenate((x2_1, x2_2), axis=0)  
    x = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))  
    y = np.concatenate((y_1, y_2), axis=0)  
  
    data_size_all = data_size_1 + data_size_2  
    shuffled_index = np.random.permutation(data_size_all)  
    x = x[shuffled_index]  
    y = y[shuffled_index]  
  
    split_index = int(data_size_all * 0.7)  
    x_train = x[:split_index]  
    y_train = y[:split_index]  
    x_test = x[split_index:]  
    y_test = y[split_index:]  
  
    # visualize data  
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, marker='.')  
    plt.show()  
    plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, marker='.')  
    plt.show()  
  
    # data preprocessing  
    x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))  
    x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))  
  
    # knn classifier  
    clf = KNN(k=3)  
    clf.fit(x_train, y_train)  
  
    print('train accuracy: {:.3}'.format(clf.score()))  
  
    y_test_pred = clf.predict(x_test)  
    print('test accuracy: {:.3}'.format(clf.score(y_test, y_test_pred)))

Vesion3(推荐)

import numpy as np  
import operator  
import matplotlib.pyplot as plt  
from sklearn import datasets  
from sklearn.model_selection import train_test_split  
  
class KNN(object):  
  
    def __init__(self, k=3):  
        self.k = k  
        self.result = None  
  
    def fit(self, x, y):  
        self.x = x  
        self.y = y.squeeze()  
  
    def _square_distance(self, v1, v2):  
        return np.sum(np.square(v1-v2))  
  
    def _vote(self, ys):  
  
        vote_dict = {}  
        for y in ys:  
            if y not in vote_dict.keys():  
                vote_dict[y] = 1  
            else:  
                vote_dict[y] += 1  
        sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)  
        return sorted_vote_dict[0][0]  
  
    def predict(self, x):  
        y_pred = []  
        for i in range(len(x)):  
            dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]  
            sorted_index = np.argsort(dist_arr)  
            top_k_index = sorted_index[0:self.k]  
            y_pred.append(self._vote(ys=self.y[top_k_index]))  
        self.result = np.array(y_pred)  
        return self.result  
  
    def score(self, y_true=None, y_pred=None):  
        if y_true is None and y_pred is None:  
            y_pred = self.predict(self.x)  
            y_true = self.y  
        score = 0.0  
        y_true = y_true.squeeze()  
        for i in range(len(y_true)):  
            if y_true[i] == y_pred[i]:  
                score += 1  
        score /= len(y_true)  
        return score  
  
    # 计算准确率  
    def get_accuracy(self, y_test):  
        assert self.result is not None, "must predict before calculate accuracy!"  
        assert y_test.shape[0] == self.result.shape[0], "the label number of test data must be equal to train data!"  
        correct = 0  
        for i in range(len(self.result)):  
            if y_test[i] == self.result[i]:  
                correct += 1  
        return (correct / float(len(self.result))) * 100.0  
  
if __name__ == '__main__':  
    # 用鸢尾花数据集测试knn  
    iris_data = datasets.load_iris()  
    all_data = iris_data['data']  
    all_label = iris_data['target']  
    flower_name = iris_data['target_names']  
    train_data, val_data, train_label, val_label = train_test_split(all_data, all_label, test_size=0.2,  
                                                                    random_state=1, stratify=all_label)  
    knn_clf = KNN(k=1)  
    knn_clf.fit(train_data, train_label)  
    result = knn_clf.predict(val_data)  
    acc = knn_clf.get_accuracy(val_label)  
    print(result)  
    print(val_label)  
    print(acc)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值