机器学习之KNN算法python实现

机器学习之KNN算法python实现

一. 理论基础

1. 距离度量

特征空间中两个实例点的距离是两个实例点相似程度的反映。一般采用欧氏距离,但也可以是其他距离,如cosine距离,曼哈顿距离等.

2. k值选择


  • k值越大,意味着模型越简单,学习近似误差大,估计误差小,欠拟合;
  • k值越小,意味着模型越复杂,学习近似误差小,估计误差大,过拟合,而且对近邻的实例点敏感.

通常采取交叉验证选取最优的k值。
3. 分类决策规则

多数表决,即由输入实例的K个近邻的多数类决定输入实例的类别。

4. kd树

高效实现k近邻,类似于二分查找,只不过是在高维的二分查找。
kd树更适用于训练实例数远大于空间维数时的k近邻搜索,当空间维数接近训练实例数时,它的效率会迅速下降,几乎接近线性扫描。

二. python实现

实现了knn的暴力搜索,也实现了kd-tree搜索,但是kd-tree只能找最近邻,即k=1,当k>1时,还未实现,初步想法:可以考虑k次搜索kd-tree,每次搜索后将最近邻节点删除,继续搜索,就找到了top k近邻搜索;这样的话就得实现kd-tree的删除插入。

1. 代码

knn.py

#encoding=utf-8

'''
implement the knn algorithm
'''

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.stats import mode
import matplotlib.pyplot as plt

class KNN:

    def __init__(self):
        pass

    def predict(self, x_train, y_train, x_test, k=3):
        self.k = k
        m_train = x_train.shape[0]
        m_test = x_test.shape[0]
        x_train = np.mat(x_train)
        y_train = np.mat(y_train)
        x_test = np.mat(x_test)

        #1. get the distances between each sample in train samples and each sample in test samples,
        #the distances matrix's shape is (m_test, m_train).
        dists = self.__distance__(x_train, x_test)
        #2. sort the distances by row, and get the sort index
        sort_idx = np.argsort(dists, axis=1)
        #3. get the x index and y index, which is top k distance sample index
        x_idx = np.tile(np.mat(range(m_test)).T, [1, self.k])
        y_idx = sort_idx[:, 0 : self.k]
        #4. get the top k distance labels, and the matrix's shape is (m_test, k)
        labels = np.tile(y_train.T, [m_test, 1])
        p_labels = labels[x_idx, y_idx]
        #5. get the mode of each row, which means the most labels
        y_predict = np.mat(mode(p_labels, axis=1)[0])
        return y_predict

    def __distance__(self, x_train, x_test):
        '''
        force compute to get the distance between each sample in train samples and each sample in test samples
        '''
        m_train = x_train.shape[0]
        m_test = x_test.shape[0]
        dists = np.zeros((m_test, m_train))
        count = 0
        for test in x_test:
            test =  np.tile(test, [m_train, 1])
            distance = np.sum(np.multiply(x_train - test, x_train - test), axis=1)
            dists[count] = distance.T
            count += 1
        return dists

    def create_kd_tree(self, datalist):
        '''
        create KD tree
        Args:
            data: data list
        '''
        root = KDNode()
        self.build_tree(root, datalist)
        self.kd_tree = root
        return root

    def build_tree(self, parent, datalist):
        '''
        recursive build tree function
        Args:
            parent: parent node
        '''
        m = datalist.shape[0]
        #if the length of data is equal to 1, the node is a leaf node
        if m == 1:
            parent.data = datalist
            return

        #compute the best split demension by the variance of each demension of the data
        demension = np.argmax(np.var(datalist, axis=0))
        #sort the data by the chosen demension
        sorted_index = np.argsort(datalist[:, demension], axis=0)
        #get the index of the middle value in the datalist
        middle = m / 2
        #get the left data
        l_data = datalist[np.squeeze(sorted_index[0 : middle].getA()), :]
        #get the right data
        r_data = datalist[np.squeeze(sorted_index[middle + 1 : ].getA()), :]

        #assign the property of the parent node
        parent.data = datalist[np.squeeze(sorted_index[middle, :].getA())]
        parent.demension = demension
        parent.split_value = datalist[np.squeeze(sorted_index[middle, :].getA()), demension]

        #recursive build the child node if the length of rest data is not equal to zero
        if len(l_data) != 0:
            l_node = KDNode()
            parent.left = l_node
            self.build_tree(l_node, l_data)

        if len(r_data) != 0:
            r_node = KDNode()
            parent.right = r_node
            self.build_tree(r_node, r_data)

    def __distance_by_kd_tree__(self, x_test):
        '''
        get nearest neighbors matrix by kd_tree search
        '''
        m = x_test.shape[0]
        dists = np.zeros((m, 1))
        count = 0
        for x in x_test:
            dists[count] = self.__find_neighbor__(x, self.kd_tree)
            count += 1
        return np.mat(dists)


    def __find_neighbor__(self, x, node):
        '''
        recursive find the neighbor of x in kd-tree
        Args:
            the root node of current child tree

        steps:
            1. if the current is leaf node, return the data in the node as the nearest neighbor
            2. if the value of x is less than the split value, take the neighbor of left child
               tree as nearest neighbor. And then check if another child tree has the more nearest
               neighbor;
               if the value of x is more than the split value, do it as like mentioned above;
            3. check if the current node and x has more nearest distance
        '''

        if node.demension == None: 
            return node.data

        if (x[0, node.demension] <= node.split_value) and node.left:
            neighbor = self.__find_neighbor__(x, node.left)
            if node.right \
                and (np.abs(x[0, node.demension] - node.split_value) < self.__euclidean_distance__(x, neighbor)) \
                and (self.__euclidean_distance__(self.__find_neighbor__(x, node.right), x) < self.__euclidean_distance__(x, neighbor)):
                    neighbor = self.__find_neighbor__(x, node.right)
        elif (x[0, node.demension] > node.split_value) and node.right:
            neighbor = self.__find_neighbor__(x, node.right)
            if node.left \
                and (np.abs(x[0, node.demension] - node.split_value)
  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值