KNN分类USPS, USI sonar及USI iris

最新推荐文章于 2023-04-29 15:51:00 发布

lmszs

最新推荐文章于 2023-04-29 15:51:00 发布

阅读量1.4k

点赞数 9

分类专栏：别的文章标签： python 机器学习

本文链接：https://blog.csdn.net/qq_37392059/article/details/120705329

版权

别的专栏收录该内容

5 篇文章 0 订阅

订阅专栏

KNN算法简介

邻近算法，或者说K近邻(kNN，k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K近邻，就是 $k$ 个最近的邻居的意思，说的是每个样本都可以用它最接近的 $k$ 个邻居来代表。KNN算法本身简单有效，它是一种lazy-learning算法，分类器不需要使用训练集进行训练，训练时间复杂度为0，KNN分类的计算复杂度和训练集中的样本数目成正比。

由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。

当KNN算法的超参数 $k = 1$ 时，算法直接由与测试样本最近的训练集样本决定测试样本类别，此时算法又称为最近邻算法。 $k$ 值的选取是影响KNN算法准确率的重要因素。

KNN算法流程

KNN算法的基本思路就是将测试样本与所有训练样本求距离，选出距离最近的 $k$ 个样本投票表决确定测试样本类别。

KNN中的距离度量

两个样本特征向量之间的距离反映了其相似程度。常见的距离有欧氏距离、曼哈顿距离和 $L_p$ 距离等。

设样本空间 $X\subseteq{R}^n$ ， $n$ 为数据特征向量维度。定义 $L_p$ 为
$L_p( \boldsymbol{x}_i, \boldsymbol{x}_j) = \left(\sum_{l=1}^{n}\left|\boldsymbol{x}_i^{(l)}- \boldsymbol{x}_j^{(l)}\right|^p\right)^{1/p}$
其中 $\geqslant 1$ 。当 $p = 2$ 时，称为欧氏距离
$L_2( \boldsymbol{x}_i, \boldsymbol{x}_j) = \left(\sum_{l=1}^{n}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|^2\right)^{1/2}$

当 $p = 1$ 时，称为曼哈顿距离
$L_1( \boldsymbol{x}_i, \boldsymbol{x}_j) = \sum_{l=1}^{n}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|$

当 $p=\infty$ 时，它代表各个坐标距离的最大值，即
$L_{\infty}( \boldsymbol{x}_i, \boldsymbol{x}_j) = \max \limits_{l}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|$

其中，最常用的是欧氏距离。

KNN中的 $k$ 值选择

KNN算法最重要的超参数就是 $k$ 值。其选取会显著影响算法的准确性。如果 $k$ 值较小，则表示在分类时依靠的样本数更少，因此更容易造成过拟合。如果 $k$ 值较大，与预测样本点距离较远的样本也会参与投票，这样会导致错误预测的概率更大。

一般情况下， $k$ 值从较小值开始尝试，然后通过交叉验证选取最优的 $k$ 值。另外， $k$ 值一般不取偶数，因为这样在投票表决过程中会产生矛盾。

KKN中的分类决策准则

KNN算法使用多数表决来决定测试样本分类：已知样本的 $k$ 个近邻训练样本，统计其中最多的类别作为测试样本的类别，少数服从多数。

即对于一个 $S$ 类的分类问题，如果某测试样本的 $k$ 近邻为 $k$ 个训练样本 $\boldsymbol{x}_i, i= 1,2,\ldots,k$ ，他们的样本标签为 $y_i$ ，则该测试样本的标签 $y_p$ 应为
$y_p = \mathop{\arg\max} \limits_{c_j} \; \sum_{i=1}^k I(y_i = c_j)$
其中 $c_j \in \left\{c_1, c_2, \ldots, c_s\right\}$ 表示样本的类别。

数据集介绍

USPS手写体数据集

USPS手写体数据集是一个手写数字数据集，有10类表述数字0到9，每个样本是 $16\times 16$ 的黑白图像，即样本空间为256维。数据集共有9298个样本，已经分好7291个训练样本和2007个测试样本。下图为其中一个样本图片的示例图

USPS数据集展示

UCI-sonar数据集

UCI-sonar数据集是一个通过声纳数据对岩石和水雷判别的数据集。其只有两类“M”和“R”表示水雷和岩石，样本空间60维，为60个声纳点的收集数据，数据集共有207个样本，其中111个“M”类，96个“R”类。

UCI-iris数据集

UCI-iris数据集是一个分类鸢尾花的数据集，共有四个类别，样本空间为四维，表示花的四个特征，数据集共有150个样本。

实验设置

对于三个数据集，均采用交叉验证法计算分类的准确率。其中UCI的两个数据集由于数据量较小，因此采用留一法划分，USPS数据集使用5折的交叉验证法。

实验环境：Intel® Core™ i7-9750H CPU @ 2.60GHz.

Python版本：python3.6, numpy=1.19.4, sklearn=0.21.2.

实验结果及分析

对于三个数据集，均使用三种不同的距离度量方式，取 $k=1,3,5,\ldots,49$ ，分别做实验并作出如下的图线来寻找最优的超参数。

USPS手写体数据集

在这里插入图片描述
对于USPS手写体数据集， $k = 1$ 时显然已经为最优，分类准确率最高达到0.9639。随着 $k$ 升高，准确率显著下降。在三种距离度量方式中比较， $L_{\infty}$ 的准确度明显低于另二者，而 $L_1$ 与 $L_2$ 距离中后者略优于前者。这样的结果原因可能是USPS数据集中数据特征就是大量的像素点，这样的数据特征单个特征信息量很小，特征间相关度较高， $L_{\infty}$ 距离会丢失很多信息。

UCI-sonar数据集

在这里插入图片描述
对于UCI-sonar手写体数据集， $k = 1$ 时为最优，分类准确率最高达到0.8550，在 $k = 5$ 时准确率也有小幅度回升。继续增大 $k$ ，准确率显著下降，并在 $k = 15$ 时稳定在0.7以下。在三种距离度量方式中比较， $L_{\infty}$ 的准确度仍低于另二者，而 $L_1$ 与 $L_2$ 距离中则变成前者略优于后者。原因仍与USPS数据集类似，sonar数据集特征也是多个传感器的数据结果，特征不明显，相关度较高。

UCI-iris数据集

在这里插入图片描述

对于UCI-sonar手写体数据集，最优的准确率出现在 $k = 20$ 左右，分类准确率最高达到0.98，而在 $k$ 取其他值时，准确率变化不定，但大多稳定在0.93以上，三种距离度量方式的表现也比较类似。原因可能是iris数据集本身的数据特征是人工提取的花朵数据特征，四维特征都有其实际意义，同时四个类别在特征空间中相交不大，明显可分。

综合考虑，sonar数据集的分类效果明显劣于另两个数据集，原因可能是其数据量最小，并且两类数据混杂较多。而iris数据集准确率受超参数影响较小，原因可能仍如上文所述，是因为其数据本身很易区分。

反思与改进

kd树算法

由于需要计算测试样本与所有训练样本之间的距离，KNN算法是一种复杂度非常高的算法。kd树的提出显著提高了其效率。通过构建一棵类似与二叉查找树的树形结构，kd树能在更小的时间复杂度内完成一次最近距离的查找。

分析可得，记数据集大小为 $n$ ，数据特征维数为 $d$ ，当 $\gg 2^d$ 时kd树的计算速度明显优于KNN。在本次实验的中，USPS数据量较大，但由于数据维数也很大( $d = 256$ )，因此不适合kd树；另两个数据集本身较小，因此也不适合使用kd树。通过调用sklearn.neighbors.KNeighborsClassifier，可以调整实例化参数决定分类算法。依此在USPS数据集上进行实验，kd树的耗时为11.20s，KNN仅为0.86s。这也印证了上述的判断，即在USPS数据集上kd树效率很低。

KNN本身的加速

设训练样本集大小为 $n$ ，测试集样本大小也为 $O (n)$ 量级，数据特征维数为 $d$ ，则显然计算一次训练集与测试集中样本两两之间的距离矩阵的复杂度为 $O(n^2d)$ 。这一步骤是KNN算法中影响算法效率的瓶颈。

在做上述kd树的实验时，发现我遍历测试集分别预测的KNN算法在USPS数据集上对整个测试集(2007个样本)分类一次的时间，比调用库花费的时间长数十倍。通过阅读库源码，发现sklearn对计算上述距离矩阵有如下优化：

1. 类似于分块矩阵计算的思想，对训练样本和测试样本都分成多个大小相同的slice计算。

2. 在计算 $L_2$ 距离时，将乘方打开，即 $\begin{aligned} L_2( \boldsymbol{x}_i, \boldsymbol{x}_j) &= \left(\sum_{l=1}^{d}\left|{x}_i^{(l)}- {x}_j^{(l)}\right|^2\right)^{1/2}\\ &=\left(\boldsymbol{x}_i \cdot \boldsymbol{x}_i -2\boldsymbol{x}_i \cdot \boldsymbol{x}_j +\boldsymbol{x}_j \cdot \boldsymbol{x}_j\right)^{1/2} \end{aligned}$ 其中 $(\; \cdot \;)$ 表示向量点积。

关键部分库源码如下，函数包含在sklearn.metrics.pairwise下，作用即计算距离矩阵

# Pairwise distances
def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
                        X_norm_squared=None):
    """
    Considering the rows of X (and Y=X) as vectors, compute the
    distance matrix between each pair of vectors.

    For efficiency reasons, the euclidean distance between a pair of row
    vector x and y is computed as::

        dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

    This formulation has two advantages over other ways of computing distances.
    First, it is computationally efficient when dealing with sparse data.
    Second, if one argument varies but the other remains unchanged, then
    `dot(x, x)` and/or `dot(y, y)` can be pre-computed.

    However, this is not the most precise way of doing this computation, and
    the distance matrix returned by this function may not be exactly
    symmetric as required by, e.g., ``scipy.spatial.distance`` functions.

    Read more in the :ref:`User Guide <metrics>`.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape (n_samples_1, n_features)

    Y : {array-like, sparse matrix}, shape (n_samples_2, n_features)

    Y_norm_squared : array-like, shape (n_samples_2, ), optional
        Pre-computed dot-products of vectors in Y (e.g.,
        ``(Y**2).sum(axis=1)``)
        May be ignored in some cases, see the note below.

    squared : boolean, optional
        Return squared Euclidean distances.

    X_norm_squared : array-like, shape = [n_samples_1], optional
        Pre-computed dot-products of vectors in X (e.g.,
        ``(X**2).sum(axis=1)``)
        May be ignored in some cases, see the note below.

    Notes
    -----
    To achieve better accuracy, `X_norm_squared` and `Y_norm_squared` may be
    unused if they are passed as ``float32``.

    Returns
    -------
    distances : array, shape (n_samples_1, n_samples_2)

    Examples
    --------
    >>> from sklearn.metrics.pairwise import euclidean_distances
    >>> X = [[0, 1], [1, 1]]
    >>> # distance between rows of X
    >>> euclidean_distances(X, X)
    array([[0., 1.],
           [1., 0.]])
    >>> # get distance to origin
    >>> euclidean_distances(X, [[0, 0]])
    array([[1.        ],
           [1.41421356]])

    See also
    --------
    paired_distances : distances betweens pairs of elements of X and Y.
    """
    X, Y = check_pairwise_arrays(X, Y)

    # If norms are passed as float32, they are unused. If arrays are passed as
    # float32, norms needs to be recomputed on upcast chunks.
    # TODO: use a float64 accumulator in row_norms to avoid the latter.
    if X_norm_squared is not None:
        XX = check_array(X_norm_squared)
        if XX.shape == (1, X.shape[0]):
            XX = XX.T
        elif XX.shape != (X.shape[0], 1):
            raise ValueError(
                "Incompatible dimensions for X and X_norm_squared")
        if XX.dtype == np.float32:
            XX = None
    elif X.dtype == np.float32:
        XX = None
    else:
        XX = row_norms(X, squared=True)[:, np.newaxis]

    if X is Y and XX is not None:
        # shortcut in the common case euclidean_distances(X, X)
        YY = XX.T
    elif Y_norm_squared is not None:
        YY = np.atleast_2d(Y_norm_squared)

        if YY.shape != (1, Y.shape[0]):
            raise ValueError(
                "Incompatible dimensions for Y and Y_norm_squared")
        if YY.dtype == np.float32:
            YY = None
    elif Y.dtype == np.float32:
        YY = None
    else:
        YY = row_norms(Y, squared=True)[np.newaxis, :]

    if X.dtype == np.float32:
        # To minimize precision issues with float32, we compute the distance
        # matrix on chunks of X and Y upcast to float64
        distances = _euclidean_distances_upcast(X, XX, Y, YY)
    else:
        # if dtype is already float64, no need to chunk and upcast
        distances = - 2 * safe_sparse_dot(X, Y.T, dense_output=True)
        distances += XX
        distances += YY
    np.maximum(distances, 0, out=distances)

    # Ensure that distances between vectors and themselves are set to 0.0.
    # This may not be the case due to floating point rounding errors.
    if X is Y:
        np.fill_diagonal(distances, 0)

    return distances if squared else np.sqrt(distances, out=distances)

上述代码段的函数名就是计算计算pairwise的欧氏距离。在函数说明中，编写者也阐述了这一优化思路和其优点。函数在经过一系列的判断后一般会直接进入_euclidean_distances_upcast函数进行真正的距离计算

def _euclidean_distances_upcast(X, XX=None, Y=None, YY=None, batch_size=None):
    """Euclidean distances between X and Y

    Assumes X and Y have float32 dtype.
    Assumes XX and YY have float64 dtype or are None.

    X and Y are upcast to float64 by chunks, which size is chosen to limit
    memory increase by approximately 10% (at least 10MiB).
    """
    n_samples_X = X.shape[0]
    n_samples_Y = Y.shape[0]
    n_features = X.shape[1]

    distances = np.empty((n_samples_X, n_samples_Y), dtype=np.float32)

    if batch_size is None:
        x_density = X.nnz / np.prod(X.shape) if issparse(X) else 1
        y_density = Y.nnz / np.prod(Y.shape) if issparse(Y) else 1

        # Allow 10% more memory than X, Y and the distance matrix take (at
        # least 10MiB)
        maxmem = max(
            ((x_density * n_samples_X + y_density * n_samples_Y) * n_features
             + (x_density * n_samples_X * y_density * n_samples_Y)) / 10,
            10 * 2 ** 17)

        # The increase amount of memory in 8-byte blocks is:
        # - x_density * batch_size * n_features (copy of chunk of X)
        # - y_density * batch_size * n_features (copy of chunk of Y)
        # - batch_size * batch_size (chunk of distance matrix)
        # Hence x² + (xd+yd)kx = M, where x=batch_size, k=n_features, M=maxmem
        #                                 xd=x_density and yd=y_density
        tmp = (x_density + y_density) * n_features
        batch_size = (-tmp + np.sqrt(tmp ** 2 + 4 * maxmem)) / 2
        batch_size = max(int(batch_size), 1)

    x_batches = gen_batches(n_samples_X, batch_size)

    for i, x_slice in enumerate(x_batches):
        X_chunk = X[x_slice].astype(np.float64)
        if XX is None:
            XX_chunk = row_norms(X_chunk, squared=True)[:, np.newaxis]
        else:
            XX_chunk = XX[x_slice]

        y_batches = gen_batches(n_samples_Y, batch_size)

        for j, y_slice in enumerate(y_batches):
            if X is Y and j < i:
                # when X is Y the distance matrix is symmetric so we only need
                # to compute half of it.
                d = distances[y_slice, x_slice].T

            else:
                Y_chunk = Y[y_slice].astype(np.float64)
                if YY is None:
                    YY_chunk = row_norms(Y_chunk, squared=True)[np.newaxis, :]
                else:
                    YY_chunk = YY[:, y_slice]

                d = -2 * safe_sparse_dot(X_chunk, Y_chunk.T, dense_output=True)
                d += XX_chunk
                d += YY_chunk

            distances[x_slice, y_slice] = d.astype(np.float32, copy=False)

    return distances

其中row_norms函数用来计算 $\boldsymbol{x}_i \cdot \boldsymbol{x}_i$ ，主要依靠np.einsum函数实现。

def row_norms(X, squared=False):
    """Row-wise (squared) Euclidean norm of X.

    Equivalent to np.sqrt((X * X).sum(axis=1)), but also supports sparse
    matrices and does not create an X.shape-sized temporary.

    Performs no input validation.

    Parameters
    ----------
    X : array_like
        The input array
    squared : bool, optional (default = False)
        If True, return squared norms.

    Returns
    -------
    array_like
        The row-wise (squared) Euclidean norm of X.
    """
    if sparse.issparse(X):
        if not isinstance(X, sparse.csr_matrix):
            X = sparse.csr_matrix(X)
        norms = csr_row_norms(X)
    else:
        norms = np.einsum('ij,ij->i', X, X)

    if not squared:
        np.sqrt(norms, norms)
    return norms

safe_sparse_dot函数则计算 $\boldsymbol{x}_i \cdot \boldsymbol{x}_j$ ，本质就是np.dot计算矩阵乘法。

def safe_sparse_dot(a, b, dense_output=False):
    """Dot product that handle the sparse matrix case correctly

    Uses BLAS GEMM as replacement for numpy.dot where possible
    to avoid unnecessary copies.

    Parameters
    ----------
    a : array or sparse matrix
    b : array or sparse matrix
    dense_output : boolean, default False
        When False, either ``a`` or ``b`` being sparse will yield sparse
        output. When True, output will always be an array.

    Returns
    -------
    dot_product : array or sparse matrix
        sparse if ``a`` or ``b`` is sparse and ``dense_output=False``.
    """
    if sparse.issparse(a) or sparse.issparse(b):
        ret = a * b
        if dense_output and hasattr(ret, "toarray"):
            ret = ret.toarray()
        return ret
    else:
        return np.dot(a, b)

观察易得， $\boldsymbol{x}_i \cdot \boldsymbol{x}_i$ 是可以在 $O (n d)$ 时间内计算完成，而 $\boldsymbol{x}_i \cdot \boldsymbol{x}_j$ 就是训练样本矩阵与测试样本矩阵转置的乘积。虽然这个乘积复杂度也是 $O(n^2d)$ ，但是由于Numpy对矩阵乘法有加速，因此这样的算法可以比遍历测试集的写法快数十倍。

因此，我仿照sklearn库，也实现了打开 $L_2$ 距离来加速 $k$ 近邻计算的算法。同样在USPS数据集上测试，这样的写法在预测整个测试集仅用了0.34s的时间，而遍历2007个测试样本分别预测的算法用时13.05s。

总结而言，kd树等树形结构是加速寻找 $k$ 近邻的数据结构，但是在本次实验的三个数据集上都并不适用；KNN本身的加速需要展开 $L_2$ 距离，将计算引到Numpy实现的矩阵乘法上，可以大大提高运算效率。当然，这两种思路都是建立在准确求取 $k$ 近邻的基础上进行的加速，不会改变预测结果，所以对预测准确率不会有任何改变。

总结

KNN算法是一个基于样本距离计算的有监督分类方法，其本质海肆模式匹配。在本次实验中，KNN算法在USPS和iris数据集上准确率都很高，但是在sonar数据集上表现相对较差。同时，KNN算法的计算效率很低，但是可以通过引入树形结构或改变计算方式来提高其计算效率。但是不论是想要引入树形结构，还是提高准确的，关键点都在于特征提取是否准确、易分。

本实验还尝试了SVM，随机森林等常见的分类器。由于时间有限(报告写不完了)，在此不再加以列举。

附录

附录总体为代码。首先是我编写的KNN类，然后是在三个数据集上的实验，最后是kd树与KNN的比较。

KNN类

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin


class KNN(BaseEstimator, ClassifierMixin):
    def __init__(self, k, method=2, quick_L_2=True):
        # method=2 => use L2 distance
        self.k = k
        self.method = method
        self.quick_L_2 = quick_L_2

    def fit(self, x, y):
        self.x = x
        self.y = y
        self.labels = np.unique(y)
        # self.y = np.array([self.labels[self.labels == i][0] for i in self.y])

    def predict(self, a):
        y_label = []
        if self.method == 2 and self.quick_L_2:
            dis = -2 * np.dot(self.x, a.T)
            dis += np.einsum('ij,ij->i', self.x, self.x)[:, np.newaxis]
            dis += np.einsum('ij,ij->i', a, a)[np.newaxis, :]
            idx = np.argpartition(dis, kth=self.k, axis=0)[0:self.k, :]
            for i in range(a.shape[0]):
                vote = dict(zip(self.labels, np.zeros_like(self.labels)))
                for j in range(self.k):
                    vote[self.y[idx[j, i]]] += 1
                y_label.append(max(vote, key=vote.get))
            return y_label
        for i in range(a.shape[0]):
            if self.method == 0:
                idx = np.argsort(np.max(np.abs(self.x - a[i, :]), axis=1))
            elif self.method == 1:
                idx = np.argsort(np.sum(np.abs(self.x - a[i, :]), axis=1))
            else:
                idx = np.argsort(np.sum((self.x - a[i, :]) ** 2, axis=1))
            vote = dict(zip(self.labels, np.zeros_like(self.labels)))
            for j in range(self.k):
                vote[self.y[idx[j]]] += 1
            y_label.append(max(vote, key=vote.get))
        return y_label

USPS实验

import h5py
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from KNN import KNN
import time

# load data
# data from https://www.kaggle.com/bistaumanga/usps-dataset?select=usps.h5
# another data source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps
path = 'usps.h5'
with h5py.File(path, 'r') as hf:
    train = hf.get('train')
    x_tr = train.get('data')[:]
    y_tr = train.get('target')[:]
    test = hf.get('test')
    x_te = test.get('data')[:]
    y_te = test.get('target')[:]


def check_pic(data, label, idx):
    pic = data[idx, :]
    pic = pic.reshape((16, 16))
    cv2.imshow(str(label[idx]), pic)
    cv2.waitKey(0)


# check picture
# check_pic(x_tr, y_tr, 10)

# RandomForest
def random_forest():
    randomForest = RandomForestClassifier()
    print(np.mean(cross_val_score(randomForest, np.concatenate((x_tr, x_te)),
                                  np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))


def svm():
    SVM = SVC(gamma='scale', C=1.0, decision_function_shape='ovr', kernel='rbf')
    print(np.mean(cross_val_score(SVM, np.concatenate((x_tr, x_te)),
                                  np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))


# KNN
def knn():
    ks = np.arange(1, 50, 2)
    acc = np.zeros((3, ks.shape[0]))
    for method in [2, 1, 0]:
        for (i, k) in enumerate(ks):
            acc[method, i] = (np.mean(cross_val_score(KNN(k=k, method=method), np.concatenate((x_tr, x_te)),
                                                      np.concatenate((y_tr, y_te)), cv=KFold(n_splits=5), n_jobs=8)))
            print(k, acc[method, i])
    np.save('acc.npy', acc)
    ks = np.arange(1, 50, 2)
    plt.plot(ks, acc[2, :], label='$L_2$ distance')
    plt.plot(ks, acc[1, :], label='$L_1$ distance')
    plt.plot(ks, acc[0, :], label='$L_\infty$ distance')
    plt.legend()
    plt.xlabel('k')
    plt.ylabel('Accuracy')
    plt.show()
    print(np.max(acc))
    return acc


def compare_knn_and_fast_knn():
    knn1 = KNN(k=1, quick_L_2=True)
    knn1.fit(x_tr, y_tr)
    tim = time.clock()
    knn1.predict(x_te)
    print(time.clock() - tim)
    knn2 = KNN(k=1, quick_L_2=False)
    knn2.fit(x_tr, y_tr)
    tim = time.clock()
    knn2.predict(x_te)
    print(time.clock() - tim)


if __name__ == '__main__':
    random_forest()
    svm()
    knn()

sonar实验

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, LeaveOneOut
import matplotlib.pyplot as plt
from KNN import KNN

# load data
# data from http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/
path = 'sonar.all-data'
data = pd.read_csv(path).values
labels = data[:, -1]
data = data[:, :-1]

ks = np.arange(1, 50, 2)
acc = np.zeros((3, ks.shape[0]))
for method in range(3):
    for (i, k) in enumerate(ks):
        acc[method, i] = np.mean(cross_val_score(KNN(k=k, method=method), data, labels, cv=LeaveOneOut()))
plt.plot(ks, acc[2, :], label='$L_2$ distance')
plt.plot(ks, acc[1, :], label='$L_1$ distance')
plt.plot(ks, acc[0, :], label='$L_\infty$ distance')
plt.legend()
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.show()
print(np.max(acc))

iris实验

from sklearn import datasets
import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.base import BaseEstimator, ClassifierMixin
import matplotlib.pyplot as plt
from KNN import KNN

iris = datasets.load_iris()
data = iris['data']
labels = iris['target']

ks = np.arange(1, 50, 2)
acc = np.zeros((3, ks.shape[0]))
for method in range(3):
    for (i, k) in enumerate(ks):
        acc[method, i] = np.mean(cross_val_score(KNN(k=k, method=method), data, labels, cv=LeaveOneOut()))
plt.plot(ks, acc[2, :], label='$L_2$ distance')
plt.plot(ks, acc[1, :], label='$L_1$ distance')
plt.plot(ks, acc[0, :], label='$L_\infty$ distance')
plt.legend()
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.show()
print(np.max(acc))

KNN与kd树比较实验

import h5py
import numpy as np
import cv2
from sklearn.neighbors import KNeighborsClassifier
import time

# load data
# data from https://www.kaggle.com/bistaumanga/usps-dataset?select=usps.h5
# another data source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps
path = 'usps.h5'
with h5py.File(path, 'r') as hf:
    train = hf.get('train')
    x_tr = train.get('data')[:]
    y_tr = train.get('target')[:]
    test = hf.get('test')
    x_te = test.get('data')[:]
    y_te = test.get('target')[:]

# add random noise to avoid calculate in spare matrix
x_tr += np.random.rand(*x_tr.shape) * 0.001
x_te += np.random.rand(*x_te.shape) * 0.001


def check_pic(data, label, idx):
    pic = data[idx, :]
    pic = pic.reshape((16, 16))
    cv2.imshow(str(label[idx]), pic)
    cv2.waitKey(0)


# check picture
# check_pic(x_tr, y_tr, 10)


# KNN
def knn():
    knn = KNeighborsClassifier(n_neighbors=5, algorithm='brute', n_jobs=1)
    tic = time.clock()
    knn.fit(x_tr, y_tr)
    res = knn.predict(x_te)
    print(time.clock() - tic)
    return res


def kd():
    kdt = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', n_jobs=1)
    tic = time.clock()
    kdt.fit(x_tr, y_tr)
    res = kdt.predict(x_te)
    print(time.clock() - tic)
    return res


if __name__ == '__main__':
    assert np.all(kd() == knn())

lmszs

关注

9
点赞
踩
28

收藏

觉得还不错? 一键收藏
1
评论
KNN分类USPS, USI sonar及USI iris

KNN算法简介邻近算法，或者说K近邻(kNN，k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K近邻，就是kkk个最近的邻居的意思，说的是每个样本都可以用它最接近的kkk个邻居来代表。KNN算法本身简单有效，它是一种lazy-learning算法，分类器不需要使用训练集进行训练，训练时间复杂度为0，KNN分类的计算复杂度和训练集中的样本数目成正比。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本
复制链接

扫一扫