基础KNN算法

KNN算法的基本概述:

  • 1、定义:KNN算法(有监督算法)是最简单的分类算法之一,也是一个最常用你的算法之一。
  • 2、原理:当需要预测的新值x值,距离(x)最近的K个点类别比例判断x属于什么类别。
  • 3、主要需要计算的指标:
    • 最近的K个点:计算K值、两个点之间的距离
    • 类别:用0、1表示
    • 比例:计算0、1所占比例
    • 欧式距离的计算:n维空间的欧式距离计算公式
      c = ∑ i = 1 n ( x i − y i ) 2 c=\sqrt{\sum_{i=1}^n ({x_i}-{y_i})^2} c=i=1n(xiyi)2
    • 曼哈顿距离公式: 用块的思维计算两点之间的距离,理论上比欧式距离速度快
      c = ∑ i = 1 n ∣ x i − y i ∣ c= \sum_{i=1}^n |{x_i-y_i}| c=i=1nxiyi
    • 明可夫斯基距离
      c = ( ∑ i = 1 n ∣ x i − y i ∣ p ) 1 / p c=( \sum_{i=1}^n |{x_i-y_i}|^p )^{1/p} c=(i=1nxiyip)1/p
    • 代码实现(欧式距离):KNN算法
# -*- coding: utf-8 -*-
# @Time    : 2021/7/21 21:28
# @Email   : 2635681517@qq.com
from collections import Counter

import numpy as np
from math import sqrt


class KNNClassifier:
    def __init__(self, k):
        """初始化kNN分类器"""
        if k <= 1:
            return "K值无效"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, _X_train, _y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        self._X_train = _X_train
        self._y_train = _y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1], \
            "the feature number of X_predict must be equal to X_train"
        y_predict = []
        for x in X_predict:
            # 把预测值放入一个列表里返回预测的数组
            y_predict.append(self._predict(x))
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x,返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = []
        for x_train in self._X_train:
            distances.append(sqrt(np.sum((x_train - x) ** 2)))
        print("预测点到原来点的距离为:")
        print(distances)
        """
        nearest:找到距离最近的K个点位置(索引)
        """
        nearest = np.argsort(distances)
        """
        在距离预测值 最近的 k个值中 进行 分类 统计 得出最多的比例类别
        """
        """
        _y_train[i]:分类
        """
        topK_y = []  # 得到最近的三个点的距离分别是多大
        for i in nearest[:self.k]:  # nearest[:k]计算距离最近的k个点的距离
            topK_y.append(self._y_train[i])  # 分类
            """Counter(topK_y)统计分类"""
        votes = Counter(topK_y)  # 分类统计
        """votes.most_common(1)[0][0]取出分类过后最多的那个类别是0还是1"""
        return votes.most_common(1)[0][0]
    def score(self, X_test, y_test):
      """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
      y_predict = self.predict(X_test)
      return accuracy_score(y_test, y_predict)
    def __repr__(self):
        return "KNN(k=%d)" % self.k

测试函数代码(数据是乱编的):

# -*- coding: utf-8 -*-
# @Time    : 2021/7/21 21:53
# @Email   : 2635681517@qq.com

import sys
import numpy as np
from sf.KNN1 import KNNClassifier
if __name__ == '__main__':
    raw_data_X = [[3.393533211, 2.331273381],
                  [3.110073483, 1.781539638],
                  [1.343808831, 3.368360954],
                  [3.582294042, 4.679179110],
                  [2.280362439, 2.866990263],
                  [7.423436942, 4.696522875],
                  [5.745051997, 3.533989803],
                  [9.172168622, 2.511101045],
                  [7.792783481, 3.424088941],
                  [7.939820817, 0.791637231]
                  ]
    x = np.array([8.093607318, 3.365731514])
    X_predict = x.reshape(1, -1)
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 相当于是类别

    X_train = np.array(raw_data_X)
    y_train = np.array(raw_data_y)

    knn_clf = KNNClassifier(3)
    knn_clf.fit(X_train, y_train)
    y_predict = knn_clf.predict(X_predict)
    print("所属类别")
    print(y_predict)
    print(y_predict[0])

  • 如果想换成曼哈顿距离公式则把上面的
distances.append(sqrt(np.sum((x_train - x) ** 2)))

换成

distances.append((np.sum(abs(x_train - x))))
  • 当然还有明可夫斯基距离:p值取2
distances.append((np.sum(abs(x_train - x)**2))**1/2)
  • 4 、 把数据集分为训练集和测试集两个部分:训练集占80%,测试集占20%,这个比例可以调整,
    定义一个分类器函数train_test_split、train_test_split也可以直接调用sklearn里面对应的方法sklearn的实现过程比我这个复杂很多,但是原理是类似的,我这里是直接把这个算法写出来的代码如下:
# -*- coding: utf-8 -*-
# @Time    : 2021/7/26 10:14
# @Email   : 2635681517@qq.com
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from KNN1 import KNNClassifier


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """
    :param X:
    :param y:
    :param test_ratio: 测试集比例0.2
    :param seed: 随机数
    :return: 分类好的数据:X_train, y_train, X_test, y_test
    """
    if X.shape[0] != y.shape[0]:
        print("X的行数需要和y的列数相同")
    if 0.0 >= test_ratio >= 1.0:
        print("测试比例需要在0.0-1.0之间")
    if seed:
        np.random.seed(seed)
    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)  # 测试数据的大小
    train_indexes = shuffled_indexes[test_size:]  # 训练数据集的索引后80%
    test_indexes = shuffled_indexes[:test_size]  # 测试数据集的索引前20%
    """根据索引进行取出数据取出训练集"""
    X_train = X[train_indexes]
    y_train = y[train_indexes]
    """根据索引进行取出数据取出测试集"""
    X_test = X[test_indexes]
    y_test = y[test_indexes]
    return X_train, y_train, X_test, y_test
  • 计算精确度(同样可以直接调用sklearn里面的accuracy_score方法):我这里自己看视频学习实现了一个accuracy_score方法,这个方法的原理很简单,就是看预测值和真实值(测试值y_test)有多少个是相同的,得到相同的个数除以总数就是精确度,方法如下:
def accuracy_score(y_true, y_predict):
    """
    :param y_true: 测试值,分类出来的y_test
    :param y_predict: 预测值,利用X_test跑predict算法得出来的预测值
    :return: 精确度
    """
    """计算y_true和y_predict之间的准确率"""
    if y_true.shape[0] != y_predict.shape[0]:
        print("预测值和训练数据的行数要相同")
    return sum(y_true==y_predict)/len(y_true)

5、超参数,在knn算法中存在一个k值,k值的选取适当与否直接关系到这个KNN算法预测值是否准确,所以我们怎么确定这个参数呢,一个方法就是在一定范围内利用不同k值计算出所有的准确率,精确率最高的那个k值就是我们所需要的k值,我们查看源码可以可以看到KNeighborsClassifier算法的参数有 ,这些参数的合适值我都可以找到,在不同的应用场景参数的值可能会不一样。

def __init__(self, n_neighbors=5, *,
                 weights='uniform', algorithm='auto', leaf_size=30,
                 p=2, metric='minkowski', metric_params=None, n_jobs=None,
                 **kwargs):

下面用代码进行找到这样一个k值,(说明:这里也可以直接调用sklearn的网格化GridSearchCV方法得出k值)。

start1 = time.perf_counter()
# 寻找最好的k
best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        """得到一个对象"""
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        """利用X_train和y_train进行拟合"""
        knn_clf.fit(X_train, y_train)
        """计算出精确度score"""
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
end1 = time.perf_counter()
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)
print("final is in : %s Seconds " % (end1 - start1))

结果:k值为3,

best_method = uniform
best_k = 3
best_score = 0.9916666666666667
final is in : 0.4128426999999999 Seconds 
  • 寻找p值,p值是明可夫斯基距离相应的p:注意:p值的寻找weights参数必须是distance,它们是一一对应的
best_score = 0.0
best_k = -1
best_p = ""
for k in range(1, 11):
    for p in range(1, 6):
        """得到一个对象"""
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        """利用X_train和y_train进行拟合"""
        knn_clf.fit(X_train, y_train)
        """计算出精确度score"""
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_p = p
end1 = time.perf_counter()
print("best_p =", best_p)
print("best_k =", best_k)
print("best_score =", best_score)
print("final is in : %s Seconds " % (end1 - start1))

结果:

best_p = 2
best_k = 3
best_score = 0.9916666666666667
final is in : 24.7262921 Seconds 
  • 寻找 leaf_size,原理同上,代码如下:
best_score = 0.0
best_k = -1
best_p = ""
best_leaf_size = -1
for leaf_size in range(20, 40):
    for k in range(1, 11):
        # for p in range(1, 6):
        """得到一个对象"""
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", leaf_size=leaf_size)
        """利用X_train和y_train进行拟合"""
        knn_clf.fit(X_train, y_train)
        """计算出精确度score"""
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_score = score
            best_k = k
            best_leaf_size = leaf_size
end1 = time.perf_counter()
print("best_k =", best_k)
print("best_score =", best_score)
print("best_leaf_size =", best_leaf_size)
print("final is in : %s Seconds " % (end1 - start1))

结果如下:

best_k = 3
best_score = 0.9916666666666667
best_leaf_size = 20
final is in : 27.014338 Seconds 

6 、前面提到过的网格化搜索GridSearchCV也可以找到最合适的值,但是我还是习惯自己写算法来进行操作:GridSearchCV的使用代码如下:

# -*- coding: utf-8 -*-
# @Time    : 2021/7/27 10:20
# @Email   : 2635681517@qq.com
import time

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()
from sklearn.neighbors import KNeighborsClassifier

X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

start1 = time.perf_counter()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
sk_knn_clf = KNeighborsClassifier(n_neighbors=3, weights="uniform")
sk_knn_clf.fit(X_train, y_train)
res = sk_knn_clf.score(X_test, y_test)
end1 = time.perf_counter()
print(res)
print("final is in : %s Seconds " % (end1 - start1))

# 网格化搜索
parm_grid = [
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)],

        # 'leaf_size': [i for i in range(10, 40)]
    }

]
"""再加上一个参数leaf_size的话搜索的时间复杂度就是O(n^3)这样的算法是不采用的,很花时间,
          而实际上,我们写的算法大多数都是O(n),都是在这基础上改进的"""
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, parm_grid)

start1 = time.perf_counter()
# 网格化搜索的拟合
grid_search.fit(X_train, y_train)
end1 = time.perf_counter()
print("final is in : %s Seconds " % (end1 - start1))

start1 = time.perf_counter()
print(grid_search.best_estimator_)
end1 = time.perf_counter()
print("final is in : %s Seconds " % (end1 - start1))

print("grid_search.best_score_:")
print(grid_search.best_score_)
print("========================")

# 把计算机本身的核数n_jobs(CPU的核数)加进去就可以得到一个较快的速度
grid_search = GridSearchCV(knn_clf, parm_grid, n_jobs=-1, verbose=2)
start1 = time.perf_counter()
grid_search.fit(X_train, y_train)
end1 = time.perf_counter()
print("final is in : %s Seconds " % (end1 - start1))
print("==========================")
print(grid_search.best_estimator_)
print(str(grid_search.best_score_*100)+'%')

结果:

0.9916666666666667
final is in : 0.01844010000000007 Seconds 
final is in : 79.1339263 Seconds 
KNeighborsClassifier(n_neighbors=1)
final is in : 0.0012061999999986028 Seconds 
grid_search.best_score_:
0.9860820751064653
==========================
加入cpu核数后搜索时间明显比前面的79.1339263少,这里的k值是1,在不同的环境下k值会不一样,准确度也会不一样
final is in : 22.741624200000004 Seconds 
==========================
KNeighborsClassifier(n_neighbors=1)
98.60820751064652%
  • 这里出现了另一个问题,如果在数据中常存在奇异样本数据,奇异样本数据存在所引起的网络训练时间增加,可以防止某一维或某几维对数据影响过大,其次可以程序可以运行更快。所以需要归一化处理:
  • 最大值归一化处理:
    X s c a l e = X − X m i n X m a x − X m i n 最大值归一化处理 X_{scale} = \frac{X-X_{min}}{X_{max}-X_{min}} \text {最大值归一化处理} Xscale=XmaxXminXXmin最大值归一化处理
  • 均值归一化处理:
    X m e a n = X − X m e a n S 方 差 均值归一化处理 X_{mean} = \frac{X-X_{mean}}{S_{方差}} \text {均值归一化处理} Xmean=SXXmean均值归一化处理
# -*- coding: utf-8 -*-
# @Time    : 2021/7/27 11:11
# @Email   : 2635681517@qq.com
import numpy as np
import matplotlib.pyplot as plt

# 最值归一化Normalization
x = np.random.randint(0, 100, 100)
print(np.min(x))
print(np.max(x))
array = (x - np.min(x)) / (np.max(x) - np.min(x))

X = np.random.randint(0, 100, (50, 2))
# print(X)
print("=======")
# print(X[1:3, 1:])# 取第一行到第2行的元素,再取第一列的元素
X = np.array(X, dtype=float)
# print(X[:, 0])  # 取所有行中的第0列所有元素放在一个列表中
# print(X[:10, :])
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
print(X[:10, :])
plt.scatter(X[:, 0], X[:, 1])
plt.show()
print(np.mean(X[:, 0]))
print(np.std(X[:, 0]))
print(np.mean(X[:, 1]))
print(np.std(X[:, 1]))

# 均值方差归一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])

plt.scatter(X2[:, 0], X2[:, 1])
plt.show()
print("============")
print(np.mean(X2[:, 0]))
print(np.std(X2[:, 0]))
print(np.mean(X2[:, 1]))
print(np.std(X2[:, 1]))
  • 归一化的使用
# -*- coding: utf-8 -*-
# @Time    : 2021/7/27 16:45
# @Email   : 2635681517@qq.com
import numpy as np
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=666)
standardScalar = StandardScaler()
# print(X_train[:10, :])
print(y_test)
standardScalar.fit(X_train)
# print(standardScalar.mean_)  # 标准差
# print(standardScalar.scale_)  # 方差
# print(standardScalar.transform(X_train))  # 归一化
# print(standardScalar.transform(X_test))
X_train = standardScalar.transform(X_train)
X_test_standard = standardScalar.transform(X_test)
print(X_test_standard)
print('===============================')
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)  # 归一化后的数据
print(y_test)
"""因为X_train是经过归一化处理后的数据,所以在测试(X_test_standard)时也是需要使用归一化的数据进行测试得出预测值"""
print(knn_clf.score(X_test_standard, y_test))
# X_test这个没有进行归一化处理不行,要么都归一化,要么都不归一化处理
print(knn_clf.score(X_test, y_test))
  • 自定义一个归一化的函数求出均值核方差
# -*- coding: utf-8 -*-
# @Time    : 2021/7/27 18:02
# @Email   : 2635681517@qq.com
import numpy as np

"""归一化,求均值和方差"""


class StandardScaler:
    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        if X.ndim != 2:
            raise Exception("维度需要是2维")
        self.mean_ = np.array([np.mean(X[:, i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:, i]) for i in range(X.shape[1])])
        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        if X.ndim != 2:
            raise Exception("维度需要是2维")
        if self.mean_ is None and self.scale_ is None:
            raise Exception("mean_和scale_的特征值不能为空")

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            # 均值归一化,这里才是算法的核心部分
            resX[:, col] = (X[:, col] - self.mean_[col]) / self.scale_[col]
        return resX
  • KNN算法也有一些缺点:
    效率低,预测结果没有解释性,维度灾难。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值