机器学习(一) KNN算法原理与代码实现-CSDN博客

本文链接：https://blog.csdn.net/liuchenwill/article/details/119681452

title: 机器学习(一) KNN
date: 2021-08-12 18:31:35
categories: 机器学习
tags:
- 机器学习
- 人工智能
- 算法
- KNN算法

KNN算法

KNN算法的基本原理

KNN（K-Nearest Neighbor）最邻近分类算法是数据挖掘分类（classification）技术中最简单的算法之一，其指导思想是”近朱者赤，近墨者黑“，即由你的邻居来推断出你的类别。

如果一个样本在特征空间中的K个最相似（即特征空间中最邻近，用上面的距离公式描述）的样本中的大多数属于某一个类别，则该样本也属于这个类别。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。

在这里插入图片描述

如图，距离蓝点最近的三个点都是红点，这是就可以把蓝点也归类为红色方。

在这里插入图片描述

当蓝点位于（5，2.5）时，距离蓝点最近的三个点有两个为绿点，一个为红，此时把蓝点归为红色方

实现KNN的步骤

计算想要分类的点到其余点的距离
按距离升序排列，并选出前K（KNN的K）个点，也就是距离样本点最近的K个点
加权平均，得到答案

三种加权距离

1.欧几里德距离（又称欧式距离）（Euclidean Distance）
欧式距离是最常见的距离度量，衡量的是多维空间中各个点之间的绝对距离。

公式如下：

2.明可夫斯基距离（Minkowski Distance）
明氏距离是欧式距离的推广，是对多个距离度量公式的概括性的表述。

公式如下：

这里的p值是一个变量，当p=2的时候就得到了上面的欧式距离，当p为时为曼哈顿距离，当p→∞时为切比雪夫距离。

3.曼哈顿距离（Manhattan Distance）
曼哈顿距离来源于城市区块距离，是将多个维度上的距离进行求和后的结果，即当上面的明氏距离中p=1时得到的距离度量公式

公式如下：

KNN的代码实现

基本KNN分类函数

from math import sqrt
import numpy as np
from collections import Counter

def KNN_classify(k, X_train, y_train, x):

    distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
    nearest = np.argsort(distances)
    topk_y = [y_train[i] for i in nearest[:k]]
    votes = Counter(topk_y)
    return votes.most_common(1)[0][0]

# 模拟实现
raw_data_X = [[3.39, 2.33],
              [3.11,1.78],
              [1.34, 3.36],
              [3.58, 4.67],
              [2.28, 2.86],
              [7.42, 4.69],
              [5.74, 3.53],
              [9.17, 2.51],
              [7.79, 3.42],
              [7.93, 0.79]
             ]
raw_data_y = [0,0,0,0,0,1,1,1,1,1]
predict_y = KNN_classify(6, X_train, y_train, x) #返回raw_data_y 的类别

最基本KNN分类函数，实现基本的分类功能

KNN函数的封装

from math import sqrt
import numpy as np
from collections import Counter
from sklearn.metrics import accuracy_score

class KNNClassifier:

    def __init__(self, k):
        assert k>=1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        assert X_train.shape[0] == y_train.shape[0], \
            "size of x_train must equal to size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least K"
        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        assert self._X_train is not None and self._y_train is not None, \
            "must fit before predict"
        assert X_predict.shape[1] == self._X_train.shape[1], \
            "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
        nearest = np.argsort(distances)
        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)
        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

# KNN的手写数字识别实例
import numpy as np
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn_clf = KNeighborsClassifier(n_neighbors=3)
y_predict = knn_clf.predict(X_test) # 预测结果
knn_clf.score(X_test, y_test) # 预测准确度

fit函数实现对数据的训练拟合

predict函数实现对数据的预测

score函数对预测结果做出评估

scikit-learn中的KNN

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.score(X_test, y_test) # 预测准确度

超参数

寻找最好的k与距离权值

best_method = ""
best_score = 0.0
best_k = -1
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
print(f"best_k = {best_k}")
print(f"best_score = {best_score}")
print(f"best_methon = {best_method}")

网格搜索

param_grid = [
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)]
    }
]
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(X_train,y_train)

"""
输出结果
GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=3, p=3,
                                            weights='distance'),
             iid='warn', n_jobs=-1,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
"""

一些代码的底层

train_test_split

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    assert X.shape[0] ==y.shape[0], \
        "the size of x must equal to y"
    assert 0.0<=test_ratio<=1.0, \
        "test_ratio must be valid"

    if seed:
        np.random.seed(seed)
    shuffle_indexes = np.random.permutation(len(X))
    test_size = int(len(X) * test_ratio)
    test_indexes = shuffle_indexes[:test_size]
    train_indexes = shuffle_indexes[test_size:]
    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

StandardScaler

import numpy as np

class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        assert X.ndim == 2, "the dimension of x must be 2"

        self.mean_ = np.array([np.mean(X[:, i])] for i in range(X.shapep[1]))
        self.scale_ = np.array([np.std(X[:, i])] for i in range(X.shapep[1]))

    def tranform(self, X):
        assert X.ndim == 2, "the dimension of x must be 2"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:, col] = (X[:,col]-self.mean_[col]) / self.scale_[col]

        return resX

accuracy_score

import numpy as np
from math import sqrt

def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)