机器学习之KNN算法python实现

最新推荐文章于 2024-04-24 22:39:55 发布

码生

最新推荐文章于 2024-04-24 22:39:55 发布

阅读量1.4k

点赞数 1

分类专栏：机器学习文章标签：机器学习 python knn python实现

本文链接：https://blog.csdn.net/iloveyousunna/article/details/77852653

版权

机器学习之KNN算法python实现

机器学习之KNN算法python实现
- 一理论基础
- 二 python实现
  - - 代码
    - 结果
    - 数据

一. 理论基础

1. 距离度量

特征空间中两个实例点的距离是两个实例点相似程度的反映。一般采用欧氏距离，但也可以是其他距离，如cosine距离，曼哈顿距离等.

2. k值选择

k值越大，意味着模型越简单，学习近似误差大，估计误差小，欠拟合；
k值越小，意味着模型越复杂，学习近似误差小，估计误差大，过拟合，而且对近邻的实例点敏感.

通常采取交叉验证选取最优的k值。

3. 分类决策规则

多数表决，即由输入实例的K个近邻的多数类决定输入实例的类别。

4. kd树

高效实现k近邻，类似于二分查找，只不过是在高维的二分查找。
kd树更适用于训练实例数远大于空间维数时的k近邻搜索，当空间维数接近训练实例数时，它的效率会迅速下降，几乎接近线性扫描。

二. python实现

实现了knn的暴力搜索，也实现了kd-tree搜索，但是kd-tree只能找最近邻，即k=1，当k>1时，还未实现，初步想法：可以考虑k次搜索kd-tree，每次搜索后将最近邻节点删除，继续搜索，就找到了top k近邻搜索；这样的话就得实现kd-tree的删除插入。

1. 代码

knn.py

#encoding=utf-8

'''
implement the knn algorithm
'''

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.stats import mode
import matplotlib.pyplot as plt

class KNN:

    def __init__(self):
        pass

    def predict(self, x_train, y_train, x_test, k=3):
        self.k = k
        m_train = x_train.shape[0]
        m_test = x_test.shape[0]
        x_train = np.mat(x_train)
        y_train = np.mat(y_train)
        x_test = np.mat(x_test)

        #1. get the distances between each sample in train samples and each sample in test samples,
        #the distances matrix's shape is (m_test, m_train).
        dists = self.__distance__(x_train, x_test)
        #2. sort the distances by row, and get the sort index
        sort_idx = np.argsort(dists, axis=1)
        #3. get the x index and y index, which is top k distance sample index
        x_idx = np.tile(np.mat(range(m_test)).T, [1, self.k])
        y_idx = sort_idx[:, 0 : self.k]
        #4. get the top k distance labels, and the matrix's shape is (m_test, k)
        labels = np.tile(y_train.T, [m_test, 1])
        p_labels = labels[x_idx, y_idx]
        #5. get the mode of each row, which means the most labels
        y_predict = np.mat(mode(p_labels, axis=1)[0])
        return y_predict

    def __distance__(self, x_train, x_test):
        '''
        force compute to get the distance between each sample in train samples and each sample in test samples
        '''
        m_train = x_train.shape[0]
        m_test = x_test.shape[0]
        dists = np.zeros((m_test, m_train))
        count = 0
        for test in x_test:
            test =  np.tile(test, [m_train, 1])
            distance = np.sum(np.multiply(x_train - test, x_train - test), axis=1)
            dists[count] = distance.T
            count += 1
        return dists

    def create_kd_tree(self, datalist):
        '''
        create KD tree
        Args:
            data: data list
        '''
        root = KDNode()
        self.build_tree(root, datalist)
        self.kd_tree = root
        return root

    def build_tree(self, parent, datalist):
        '''
        recursive build tree function
        Args:
            parent: parent node
        '''
        m = datalist.shape[0]
        #if the length of data is equal to 1, the node is a leaf node
        if m == 1:
            parent.data = datalist
            return

        #compute the best split demension by the variance of each demension of the data
        demension = np.argmax(np.var(datalist, axis=0))
        #sort the data by the chosen demension
        sorted_index = np.argsort(datalist[:, demension], axis=0)
        #get the index of the middle value in the datalist
        middle = m / 2
        #get the left data
        l_data = datalist[np.squeeze(sorted_index[0 : middle].getA()), :]
        #get the right data
        r_data = datalist[np.squeeze(sorted_index[middle + 1 : ].getA()), :]

        #assign the property of the parent node
        parent.data = datalist[np.squeeze(sorted_index[middle, :].getA())]
        parent.demension = demension
        parent.split_value = datalist[np.squeeze(sorted_index[middle, :].getA()), demension]

        #recursive build the child node if the length of rest data is not equal to zero
        if len(l_data) != 0:
            l_node = KDNode()
            parent.left = l_node
            self.build_tree(l_node, l_data)

        if len(r_data) != 0:
            r_node = KDNode()
            parent.right = r_node
            self.build_tree(r_node, r_data)

    def __distance_by_kd_tree__(self, x_test):
        '''
        get nearest neighbors matrix by kd_tree search
        '''
        m = x_test.shape[0]
        dists = np.zeros((m, 1))
        count = 0
        for x in x_test:
            dists[count] = self.__find_neighbor__(x, self.kd_tree)
            count += 1
        return np.mat(dists)


    def __find_neighbor__(self, x, node):
        '''
        recursive find the neighbor of x in kd-tree
        Args:
            the root node of current child tree

        steps:
            1. if the current is leaf node, return the data in the node as the nearest neighbor
            2. if the value of x is less than the split value, take the neighbor of left child
               tree as nearest neighbor. And then check if another child tree has the more nearest
               neighbor;
               if the value of x is more than the split value, do it as like mentioned above;
            3. check if the current node and x has more nearest distance
        '''

        if node.demension == None: 
            return node.data

        if (x[0, node.demension] <= node.split_value) and node.left:
            neighbor = self.__find_neighbor__(x, node.left)
            if node.right \
                and (np.abs(x[0, node.demension] - node.split_value) < self.__euclidean_distance__(x, neighbor)) \
                and (self.__euclidean_distance__(self.__find_neighbor__(x, node.right), x) < self.__euclidean_distance__(x, neighbor)):
                    neighbor = self.__find_neighbor__(x, node.right)
        elif (x[0, node.demension] > node.split_value) and node.right:
            neighbor = self.__find_neighbor__(x, node.right)
            if node.left \
                and (np.abs(x[0, node.demension] - node.split_value)

最低0.47元/天解锁文章

码生

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
机器学习之KNN算法python实现

机器学习之KNN算法python实现机器学习之KNN算法python实现一理论基础距离度量k值选择分类决策规则kd树二 python实现代码结果数据一. 理论基础1. 距离度量特征空间中两个实例点的距离是两个实例点相似程度的反映。一般采用欧氏距离，但也可以是其他距离，如cosine距离，曼哈顿距离等.2. k值选择 k值越大，意味着模型越简单，学习近似误差大，估计误差小，欠拟合；
复制链接

扫一扫