k邻近学习分类算法 —— k-Nearest Neighbor

k邻近学习算法 —— k-Nearest Neighbor python实现

环境

Python 3.7.4
numpy==1.18.1
matplotlib==3.1.1

k邻近学习

K邻近算法(K-Nearest Neighbor, KNN)是一种常用的、简单的监督学习方法,KNN算法对于给定的训练数据样本集和待分类样本数据,基于某种距离度量策略(欧氏距离、余弦相似度等)得出与待分类样本数据最近的k个样本,待分类样本类别的预测结果即为k个最近的邻居中出现最多的类别,采用投票法得出出现最多的类别并将待分类样本归为其类。
KNN算法认为若一个样本在特征空间中的K个最相邻的样本中的大多数属于某一类别,那么该样本也属于这个类别,并具有这个类别上样本的特性。

k的选取

不同的K可能会导致不同的结果,如下图,对于输入的待分类样本(红色),若选取K3,则待分类的样本的预测结果应为矩形类,而若选取K10,则待分类的样本的预测结果应为三角形类

这也造成了一个问题,若某个类的类别数量本身较小,而其他类的数量较大且K选取稍大的话,即使输入的样本为较小的类,也可能导致预测结果被归为较大的类,所以测试数据集各个类数量不能悬殊太大,k的选取也需要仔细斟酌

示例

对于给定的书籍样本集合,用KNN进行分类,分为数据结构类和操作系统类,示例随机生成数据,每个样本有两个属性对应着这本书中数据结构相关内容的出现频次和操作系统类相关内容的出现频次,并假定样本将归为关联值较高对应的哪一类
如图所示,左上角的为操作系统类的,右下角的为数据结构类的

随机生成数据

假定较高的值即为对应类别进行随机生成

def randomGenerate():

    # generate 100 * 2 array randomly, values range from 0 to 100
    dataSet = np.random.randint(0, 1, (100, 2))
    dataLabels = []

    # set labels
    labels = ['DS', 'OS']

    # assume that the data belong to the label which ralated value is biggest
    for i in range(0, len(dataSet)):
        index = 0
        if i > len(dataSet) / 2:
            index = 1

        dataLabels.append(labels[index])
        for j in range(0, len(dataSet[i])):
            if j == index:
                dataSet[i][j] = random.randint(10, 100)
            else:
                dataSet[i][j] = random.randint(0, 10)

    # generate 1 * 2 array randomly, values range from 0 to 100
    inputData = np.random.randint(0, 100, 2)
    # generate k randomly
    k = random.randint(0, 50)

    return inputData, dataSet, dataLabels, labels, k

初始化构造方法

初始化构造方法,初始化输入样例、数据集、数据集对应标签、所有标签和k

def __init__(self, inputData, dataSet, dataLabels, labels, k):
    self.inputData = inputData
    self.dataSet = dataSet
    self.dataLabels = dataLabels
    self.labels = labels
    self.k = k

计算样本间距离

对于输入样例和数据集中每一样本的欧氏距离,以矩阵的形式返回

def calculateDistances(self):

    distances = []

    for data in self.dataSet:
        # calculate the distances between input sample and classified data set, which is the distances between input sample and each sample in the data set
        distance = (self.inputData - data) ** 2
        distance = distance.sum() ** 2
        distances.append(distance)

    # transform list into one demensional matrix and return
    return np.array(distances)

投票

对于K个邻近的标签中每种标签进行投票,返回投票结果

def vote(self, sortedDistancesIndex):
    frequencies = {}

    # initialize dictionary frequencies with 0, means frequency of each label is 0
    for i in range(0, len(self.labels)):
        frequencies[self.labels[i]] = 0

    # for the k elements nearest to the sample, read their labels and vote for frequencies
    for i in range(0, self.k):
        frequencies[self.dataLabels[sortedDistancesIndex[i]]] += 1

    return frequencies

分类

计算得到新输入的待分类样本和数据样本集中所有样本的欧式距离,将其进行排序,得到排序后序列在原位置的下标,进行投票,得到投票结果后进行排序,返回频次最大的键作为分类预测结果

def classify(self):

    # get distance between input sample and each sample in the data set
    distances = self.calculateDistances()

    # get indexes of acsending distances
    sortedDistancesIndex = distances.argsort()

    # vote for frequencies of appearence of labels
    frequencies = self.vote(sortedDistancesIndex)

    # sort by value with descending order
    frequencies = sorted(frequencies.items(), key = lambda x : x[1], reverse = True)

    return frequencies[0][0]

可视化

借助matplotlib模块进行可视化,操作系统类的点为黑色,数据结构类的点为红色,x轴为数据结构相关内容在书中输出的频次,y轴为操作系统相关内容出现的频次

def visualizeDataSet(self):

    # set title and label
    plt.title('KNN')
    plt.xlabel(labels[0])
    plt.ylabel(labels[1])

    # draw points
    start = 0
    median = len(dataSet) // 2
    end = len(dataSet)
    plt.plot(self.dataSet[start : median, 0], self.dataSet[start : median, 1], "or")
    plt.plot(self.dataSet[median + 1: end, 0], self.dataSet[median + 1: end, 1], "ok")

    # draw divided line
    divideLineX = np.arange(0, 100)
    divideLineY = divideLineX
    plt.plot(divideLineX, divideLineY, "g")

    # show figure
    plt.show()

实现代码

import numpy as np
import random
from matplotlib import pyplot as plt

def randomGenerate():

    '''
        Description:
            generating data to KNN algorithm randomly
        Args:
            None
        Returns:
            inputData: data of the sample to be classified
            dataSet: classified data set
            dataLabels: labels of data set
            labels: all labels of data set
            k: number of nearest neighbor
    '''

    # generate 100 * 2 array randomly, values range from 0 to 100
    dataSet = np.random.randint(0, 1, (100, 2))
    dataLabels = []

    # set labels
    labels = ['DS', 'OS']

    # assume that the data belong to the label which ralated value is biggest
    for i in range(0, len(dataSet)):
        index = 0
        if i > len(dataSet) / 2:
            index = 1

        dataLabels.append(labels[index])
        for j in range(0, len(dataSet[i])):
            if j == index:
                dataSet[i][j] = random.randint(10, 100)
            else:
                dataSet[i][j] = random.randint(0, 10)

    # generate 1 * 2 array randomly, values range from 0 to 100
    inputData = np.random.randint(0, 100, 2)
    # generate k randomly
    k = random.randint(0, 50)

    return inputData, dataSet, dataLabels, labels, k


class KNN():

    '''
        Description:
            A simple example of KNN algorithm
        Attributes:
            inputData: data of the sample to be classified
            dataSet: classified data set
            dataLabels: labels of data set
            labels: all labels of data set
            k: number of nearest neighbor
    '''

    def __init__(self, inputData, dataSet, dataLabels, labels, k):
        self.inputData = inputData
        self.dataSet = dataSet
        self.dataLabels = dataLabels
        self.labels = labels
        self.k = k

    def calculateDistances(self):
        '''
            Description:
                calculate distance between classfied data set and input sample to be classfied
            Args:
                None
            Returns:
                distances: distance between classfied data set and input sample to be classfied
        '''

        distances = []

        for data in self.dataSet:
            # calculate the distances between input sample and classified data set, which is the distances between input sample and each sample in the data set
            distance = (self.inputData - data) ** 2
            distance = distance.sum() ** 2
            distances.append(distance)

        # transform list into one demensional matrix and return
        return np.array(distances)

    def vote(self, sortedDistancesIndex):
        '''
            Description:
                vote, calculate frequencies
            Args:
                sortedDistancesIndex: the indexes of sorted distances
            Return:
                frequencies: frequencies of tags corresponding to the nearest k elements
        '''

        frequencies = {}

        # initialize dictionary frequencies with 0, means frequency of each label is 0
        for i in range(0, len(self.labels)):
            frequencies[self.labels[i]] = 0

        # for the k elements nearest to the sample, read their labels and vote for frequencies
        for i in range(0, self.k):
            frequencies[self.dataLabels[sortedDistancesIndex[i]]] += 1

        return frequencies

    def classify(self):
        '''
            Description:
                classify input sample, according to the classified data set
            Args:
                None
            Returns:
                the key with highest frequency
        '''

        # get distance between input sample and each sample in the data set
        distances = self.calculateDistances()

        # get indexes of acsending distances
        sortedDistancesIndex = distances.argsort()

        # vote for frequencies of appearence of labels
        frequencies = self.vote(sortedDistancesIndex)

        # sort by value with descending order
        frequencies = sorted(frequencies.items(), key = lambda x : x[1], reverse = True)

        return frequencies[0][0]

    def visualizeDataSet(self):
        '''
            Description:
                visualize the data set by matplotlib module, display a figure
            Args:
                None
            Returns:
                None
        '''

        # set title and label
        plt.title('KNN')
        plt.xlabel(labels[0])
        plt.ylabel(labels[1])

        # draw points
        start = 0
        median = len(dataSet) // 2
        end = len(dataSet)
        plt.plot(self.dataSet[start : median, 0], self.dataSet[start : median, 1], "or")
        plt.plot(self.dataSet[median + 1: end, 0], self.dataSet[median + 1: end, 1], "ok")

        # draw divided line
        divideLineX = np.arange(0, 100)
        divideLineY = divideLineX
        plt.plot(divideLineX, divideLineY, "g")

        # show figure
        plt.show()



if __name__ == "__main__":

    # get the data set randomly
    inputData, dataSet, dataLabels, labels, k = randomGenerate()

    knn = KNN(inputData, dataSet, dataLabels, labels, k)

    # output the labels
    print("labels: ", labels)

    # output the random input sample
    print(inputData)

    # output the classfied result of input sample
    print(knn.classify())

    knn.visualizeDataSet()


测试结果

随机生成的待分类样本数据输出到第二行

labels:  ['DS', 'OS']
[41 19]
DS

最后

  • 由于博主水平有限,不免有疏漏之处,欢迎读者随时批评指正,以免造成不必要的误解!
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值