2. k近邻算法-CSDN博客

本文链接：https://blog.csdn.net/weixin_41207499/article/details/85249198

一. k近邻算法基础(knn)

knn算法简单的描述,就是新的数据与已知的多类型的数据进行'距离'比较, 找出距离最近的k个数据
k个数据中, 占比最大的是哪种类型数据, 就把这新的数据归类到这类数据中

创建测试数据

knn算法实现

二. scikit-learn中机器学习算法封装

将前一节的knn算法封装成函数

import numpy as np
from math import sqrt
from collections import Counter


def kNN_classify(k, X_train, y_train, x):

    assert 1 <= k <= X_train.shape[0], "k must be valid"
    assert X_train.shape[0] == y_train.shape[0], \
        "the size of X_train must equal to the size of y_train"
    assert X_train.shape[1] == x.shape[0], \
        "the feature number of x must be equal to X_train"

    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    nearest = np.argsort(distances)

    topK_y = [y_train[i] for i in nearest[:k]]
    votes = Counter(topK_y)

    return votes.most_common(1)[0][0]

nootbook中调用它

疑问: 前一节的knn算法运用并没有用法模型啊?

k近邻算法是非常特殊的，可以被认为是没有模型的算法
为了和其他算法同一，可以认为训练数据集就是模型本身

sklearn中的knn

按sklearn的形式封装自己的knn类

import numpy as np
from math import sqrt
from collections import Counter


class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None, \
                "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x，返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):
        return "KNN(k=%d)" % self.k

notebook中调用自己的knn类

三. 训练数据集, 测试数据集

判断机器学习算法的性能

不将所有的数据都当训练数据，而是留一部分作为测试数据, 对训练数据得到的模型进行测试

自己编写测试代码

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"

    if seed:
        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

 ── playML
    ├── __init__.py
    ├── kNN.py                # knn类
    └── model_selection.py    # 测试

notebook中调用自己的代码

sklearn中的train_test_split

四.分类准确度

上一节最后查看分类准确度的代码, 可以封装成函数 metrics.py

import numpy as np


def accuracy_score(y_true, y_predict):
    '''计算y_true和y_predict之间的准确率'''
    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"

    return sum(y_true == y_predict) / len(y_true)

在knn类中进行调用

...
from .metrics import accuracy_score

class KNNClassifier:

   ...

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "KNN(k=%d)" % self.k

notebook中调用
sklearn中的accuracy_score

五. 超参数

超参数: 运行机器算法前需要指定的参数，比如knn算法中的k。机器学习术语:调参一般就是指调试超参数。

模型参数: 算法过程中学习的参数,knn算法没有模型参数

寻找最好的超参数

这里如果k离11很近, 我们必须把k可取的范围扩大

考虑权重的k近邻算法

另外在遇到平票的问题时, 考虑权重就可以解决了.

明可夫斯基距离

曼哈顿距离:∑i=1n(xia−xib)11∑i=1n(xia−xib)11
欧拉距离:(∑i=1n(xia−xib)2)(∑i=1n(xia−xib)2)21
明可夫斯基距离:(∑i=1n(xia−xib)p)(∑i=1n(xia−xib)p)p1
当p=1时, 明可夫斯基距离就是曼哈顿距离
当p=2时，明可夫斯基距离就是欧拉距离
所以p也可以作为knn算法的一个超参数

图中的k和p两重循环，就可以看作网格搜索

六.网格搜索与k近邻算法中更多超参数

sklearn的网格搜索使用

七. 数据归一化

以肿瘤数据为例

肿瘤大小和发现时间，因为单位不同，数值大小差异很大
这样在求距离的时候，会对结果的正确性造成很大影响

解决方案

将所有的数据映射到同一尺度

最值归一化

最值归一化 normalization: 把所有数据映射到0-1之间, 公式如下：

最值归一化的适用范围: 适合分布有明显边界的情况
受outlier极端值，影响较大

均值方差归一化

均值方差归一化: 把所有数据归一到均值为0方差为1的分布中
适用于:数据分布没有明显的边界；有可能存在极端数据值的情况

实际演示

sklearn中也有最值归一化sklearn.preprocessing.MinMaxScaler
但是均值方差归一化是比较常用的

八. scikit-learn中的Scaler

如何对测试数据集进行归一化?

假如训练数据集得到的均值和方差为: mean_train和 std_train
测试数据集的归一化必须使用训练数据集的均值和方差
测试数据集的归一化: (x_test-mean_train)/std_train
因此需要保存训练数据集得到的均值和方差, scikit-learn中使用Scaler类

自己实现一下StandardScaler

preprocessing.py

import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None, \
               "must fit before transform!"
        assert X.shape[1] == len(self.mean_), \
               "the feature number of X must be equal to mean_ and std_"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):  # 每一列 均值方差归一化
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX