机器学习基础——KNN算法详细解析

最新推荐文章于 2024-08-16 10:00:40 发布

繁星风语

最新推荐文章于 2024-08-16 10:00:40 发布

阅读量414

点赞数 2

文章标签：机学

本文链接：https://blog.csdn.net/qq_38231061/article/details/81364807

版权

近期在学习机器学习的过程中，查阅了不少资料，发现现在的资料大多分散不全面，尤其是过于讲原理，对源码的解析可能对新手并不是很友好。因此我对KNN算法进行了一次详细的整理，希望方便后人，也是对自己学习的总结。

1. 什么是K-近邻算法？

源引经典教材《机器学习实战》中的话讲：k-近邻算法采用测量不同特征值之间的距离方法进行分类。它的工作原理是：存在一个样本数据集合，也称作训练样本集，并且样本集中每个数据都存在标签，即我们知道样本集中每一数据与所属分类的对应关系。输入没有标签的新数据后，将新数据的每个特征与样本集中数据对应的特征进行比较，然后算法提取样本集中特征最相似数据（最近邻）的分类标签。一般来说，我们只选择样本数据集中前k个最相似的数据，这就是k-近邻算法中k的出处，通常k是不大于20的整数。最后，选择k个最相似数据中出现次数最多的分类，作为新数据的分类。

2. K-近邻算法的一般流程

在《机器学习实战》这本书中，曾给出了一个实现步骤，但是比较笼统。本文先讲此描述引用进来，后面将会结合源码，用实例加以解读。

对未知类别属性的数据集中的每个点依次执行以下操作：

(1) 计算已知类别数据集中的点与当前点之间的距离；

(2) 按照距离递增次序排序；

(3) 选取与当前点距离最小的k个点；

(4) 确定前k个点所在类别的出现频率；

(5) 返回前k个点出现频率最高的类别作为当前点的预测分类

3. 使用python实现KNN算法

最核心的环节来了，但是在写程序之间，可能需要先讲清楚一个事情，就是python中的模块与包。

3.1 模块的导入

由于在KNN算法的实施中，我们需要自己创建模块，导入模块，但是大多数教程只是提一句“导入模块”，这里我推荐几种导入自定义模块的三种方式：

第一种同目录导入

这里有个大前提，就是你的py执行文件和模块同属于同个目录（父级目录），如下图：

之后直接用from KNN import *语句即可导入模块。很多新手可能会像我刚开始一样直接使用import KNN导入模块，但是这么做并不妥，import KNN只是导入包，命名空间依然不处于当前文件，如果像导一个py文件当模板用,同级下,直接from xxx import xxx 更好用点。对于初学者来说，多数情况使用同目录导入就足够了。

第二种通过sys模块导入

如果执行文件和模块不在同一目录，这时候直接import是找不到自定义模块的。这时候可以通过python内置的sys模块来实现导入。

因此我们导入自定义模块的步骤如下：

先导入sys模块
然后通过sys.path.append(path) 函数来导入自定义模块所在的目录
导入自定义模块

3.2 实施KNN算法

之前已经给出了一个实现步骤，现在我们就根据这个实现步骤，来具体实现

from numpy import *
import operator

def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels


def classify0(intX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]  #输入矩阵的行数，即样本个数
# step 1: 计算距离[
      # 假如：
      # intX：[1,0,2]
      # Dataset:
      # [1,0,1]
      # [2,1,3]
      # [1,0,2]
      # 计算过程即为：
      # 1、求差
      # [1,0,1]       [1,0,2]
      # [2,1,3]   --   [1,0,2]
      # [1,0,2]       [1,0,2]
      # =
      # [0,0,-1]
      # [1,1,1]
      # [0,0,-1]
      # 2、对差值平方
      # [0,0,1]
      # [1,1,1]
      # [0,0,1]
      # 3、将平方后的差值累加
      # [1]
      # [3]
      # [1]
      # 4、将上一步骤的值求开方，即得距离
      # [1]
      # [1.73]
      # [1]

    diffMat = tile(intX, (dataSetSize, 1)) - dataSet  #求矩阵差
    sqDiffMat = diffMat ** 2  #对差值平方
    sqDistances = sqDiffMat.sum(axis=1)   #求平方和
    distances = sqDistances**0.5  #求出测试样本点距离每个样本点的距离
    sortedDistIndicies = distances.argsort()  #将距离按升序排列，获取排序后的距离值的索引（序号）

    classCount = {}  # 用于保存各个类别出现的次数

    for i in range(k):  #统计最近的 k 个点的类别出现的次数
        voteIlabel = labels[sortedDistIndicies[i]]  #获取该索引对应的训练样本的标签
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1   #累加几类标签出现的次数，构成键值对key/values并存于classCount中
        #Python 字典 get() 函数返回指定键的值，如果值不在字典中返回默认值
        #dict.get(key, default=None)
        #key -- 字典中要查找的键  default -- 如果指定键的值不存在时，返回该默认值值
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)  # 将计数后的标签按降序进行排列，得到元组列表
        #Python 字典items()函数以列表返回可遍历的(键, 值) 元组数组
        #Examples
        #dict = {'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}
        # print "字典值 : %s" % dict.items()
        #输出：
        # 字典值: [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]
    return sortedClassCount[0][0]  # 返回类别出现次数最多的分类名称

测试程序

from KNN import *
# 生成数据集和类别标签
dataSet, labels = createDataSet()
# 定义一个未知类别的数据
testX = array([1.2, 1.0])
k = 3
# 调用分类函数对未知数据分类
outputLabel = classify0(testX, dataSet, labels, 3)
print ("Your input is:", testX, "and classified to class", outputLabel)
testX = array([0.1, 0.3])
outputLabel = classify0(testX, dataSet, labels, 3)
print("Your input is", testX, "and classified to class", outputLabel)

程序使用的函数详解

shape()函数
shape函数读取矩阵维度 shape[0]行数 shape[1]列数

tile()函数
tile(A, reps): 构造一个矩阵，通过A重复reps次得到
Examples

  a = np.array([0, 1, 2])
   np.tile(a, 2)
   array([0, 1, 2, 0, 1, 2])
    np.tile(a, (2, 2))
   array([[0, 1, 2, 0, 1, 2],
          [0, 1, 2, 0, 1, 2]])
  np.tile(a, (2, 1, 2))
   array([[[0, 1, 2, 0, 1, 2]],
          [[0, 1, 2, 0, 1, 2]]])
  b = np.array([[1, 2], [3, 4]])
    np.tile(b, 2)
   array([[1, 2, 1, 2],
          [3, 4, 3, 4]])
    np.tile(b, (2, 1))
   array([[1, 2],
          [3, 4],
          [1, 2],
          [3, 4]])

axis
设axis=i，则numpy沿着第i个下标变化的放下进行操作
对于二维数组：axis=0则沿着纵轴进行操作；axis=1则沿着横轴进行操作
Examples1

>>> import numpy as np
>>> data = np.array([
     [1,2,1],
     [0,3,1],
     [2,1,4],
     [1,3,1]])

>>> np.sum(data, axis=1)
array([4, 4, 7, 5])

>>> np.min(data, axis=0)
array([0, 1, 1])

Examples2

>>> data = np.random.randint(0, 5, [4,3,2,3])
>>> data
array([[[[4, 1, 0],
         [4, 3, 0]],
        [[1, 2, 4],
         [2, 2, 3]],
        [[4, 3, 3],
         [4, 2, 3]]],

       [[[4, 0, 1],
         [1, 1, 1]],
        [[0, 1, 0],
         [0, 4, 1]],
        [[1, 3, 0],
         [0, 3, 0]]],

       [[[3, 3, 4],
         [0, 1, 0]],
        [[1, 2, 3],
         [4, 0, 4]],
        [[1, 4, 1],
         [1, 3, 2]]],

       [[[0, 1, 1],
         [2, 4, 3]],
        [[4, 1, 4],
         [1, 4, 1]],
        [[0, 1, 0],
         [2, 4, 3]]]])

当axis=0时，numpy验证第0维的方向来求和，也就是第一个元素值=a0000+a1000+a2000+a3000=11,第二个元素=a0001+a1001+a2001+a3001=5，同理可得最后的结果如下

>>> data.sum(axis=0)
array([[[11,  5,  6],
        [ 7,  9,  4]],

       [[ 6,  6, 11],
        [ 7, 10,  9]],

       [[ 6, 11,  4],
        [ 7, 12,  8]]])

当axis=3时，numpy验证第3维的方向来求和，也就是第一个元素值=a0000+a0001+a0002=5,第二个元素=a0010+a0011+a0012=7，同理可得最后的结果如下：

>>> data.sum(axis=3)
array([[[ 5,  7],
        [ 7,  7],
        [10,  9]],

       [[ 5,  3],
        [ 1,  5],
        [ 4,  3]],

       [[10,  1],
        [ 6,  8],
        [ 6,  6]],

       [[ 2,  9],
        [ 9,  6],
        [ 1,  9]]])


  使用axis的相关函数
  sort函数

>>> data = np.random.randint(0, 5, [3,2,3])
>>> data
array([[[4, 2, 0],
        [0, 0, 4]],

       [[2, 1, 1],
        [1, 0, 2]],

       [[3, 0, 4],
        [0, 1, 3]]])
>>> np.sort(data)  ## 默认对最大的axis进行排序，这里即是axis=2
array([[[0, 2, 4],
        [0, 0, 4]],

       [[1, 1, 2],
        [0, 1, 2]],

       [[0, 3, 4],
        [0, 1, 3]]])
>>> np.sort(data, axis=0)  # 沿着第0维进行排序，原先的a000->a100->a200转变为a100->a200->a000
array([[[2, 0, 0],
        [0, 0, 2]],

       [[3, 1, 1],
        [0, 0, 3]],

       [[4, 2, 4],
        [1, 1, 4]]])
>>> np.sort(data, axis=1)  # 沿着第1维进行排序
array([[[0, 0, 0],
        [4, 2, 4]],

       [[1, 0, 1],
        [2, 1, 2]],

       [[0, 0, 3],
        [3, 1, 4]]])
>>> np.sort(data, axis=2)  # 沿着第2维进行排序
array([[[0, 2, 4],
        [0, 0, 4]],

       [[1, 1, 2],
        [0, 1, 2]],

       [[0, 3, 4],
        [0, 1, 3]]])
>>> np.sort(data, axis=None)  # 对全部数据进行排序
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4])

   prod函数

>>> np.prod([[1.,2.],[3.,4.]])
 24.0

 >>> np.prod([[1.,2.],[3.,4.]], axis=1)
 array([  2.,  12.])

 >>> np.prod([[1.,2.],[3.,4.]], axis=0)
 array([ 3.,  8.])

argsort()函数 
将元素从小到大排列，提取其对应的index(索引) 
Examples 1 
One dimensional array: 一维数组

x = np.array([3, 1, 2]) 
np.argsort(x) array([1, 2, 0])

Two dimensional array: 二维数组

x = np.array([[0, 3], [2, 2]]) 
np.argsort(x, axis=0) # 按列排序 
array([[0, 1], [1, 0]]) 
np.argsort(x, axis=1) # 按行排序 
array([[0, 1], [0, 1]])

Examples 2

x = np.array([3, 1, 2]) 
np.argsort(x) # 按升序排列 
array([1, 2, 0]) 
np.argsort(-x) # 按降序排列 
array([0, 2, 1])

Examples 3

x[np.argsort(x)] # 通过索引值排序后的数组 
array([1, 2, 3]) 
x[np.argsort(-x)] 
array([3, 2, 1])

get()函数
Python 字典 get() 函数返回指定键的值，如果值不在字典中返回默认值
dict.get(key, default=None)
key -- 字典中要查找的键  default -- 如果指定键的值不存在时，返回该默认值值

items()函数
Python 字典items()函数以列表返回可遍历的(键, 值) 元组数组
Examples

dict = {'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}
print "字典值 : %s" % dict.items()

输出：字典值: [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]

sorted函数与operator.itemgetter的使用

operator.itemgetter函数
operator模块提供的itemgetter函数用于获取对象的哪些维的数据，参数为一些序号（即需要获取的数据在对象中的序号）

a = [1, 2, 3]
>> > b = operator.itemgetter(1) // 定义函数b，获取对象的第1个域的值
>> > b(a)
2
>> > b = operator.itemgetter(1, 0) // 定义函数b，获取对象的第1个域和第0个的值
>> > b(a)
(2, 1)

对于多条件排序，也非常简单，只需要记住下面这句话就行。 即参数key指定的函数返回一个元组，多条件排序的顺序将按照元组的顺序。
下面是2010世界杯小组赛A组的积分榜

from operator import itemgetter
teamitems = [{'team': 'France', 'P': 1, 'GD': -3, 'GS': 1, 'GA': 4},
             {'team': 'Uruguay', 'P': 7, 'GD': 4, 'GS': 4, 'GA': 0},
             {'team': 'SouthAfrica', 'P': 4, 'GD': -2, 'GS': 3, 'GA': 5},
             {'team': 'Mexico', 'P': 4, 'GD': 1, 'GS': 3, 'GA': 2}]
print sorted(teamitems ,key = itemgetter('P','GD','GS','GA'),reverse=True)
#print sorted(teamitems, key=lambda x: (x['P'], x['GD'], x['GS'], x['GA']), reverse=True)

输出

[{'P': 7, 'GD': 4, 'GS': 4, 'GA': 0, 'team': 'Uruguay'},
 {'P': 4, 'GD': 1, 'GS': 3, 'GA': 2, 'team': 'Mexico'},
 {'P': 4, 'GD': -2, 'GS': 3, 'GA': 5, 'team': 'SouthAfrica'},
 {'P': 1, 'GD': -3, 'GS': 1, 'GA': 4, 'team': 'France'}]

sorted（）函数
对字典进行排序

mydict= {5: 'D', 7: 'B', 3: 'C', 4: 'E', 8: 'A'}
print(sorted(mydict))#按字典键值排序
#out:[3,4, 5, 7, 8]
print(sorted(mydict.values()))#按字典值排序
#out:['A','B', 'C', 'D', 'E']
#也可以按照下面这种方式进行排序，如果字典的值是一个列表的话，可以对列表进行多参数排序
print(sorted(mydict.items(),key=operator.itemgetter(0)))
#out:[(3,'C'), (4, 'E'), (5, 'D'), (7, 'B'), (8, 'A')]

对元组组成的list进行排序

students= [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10),]
sorted(students,key=lambda student : student[2])   # sortby age
#out:[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
sorted(students,key=itemgetter(2))  # sort by age
#out:[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
sorted(students,key=itemgetter(1,2))  # sort by gradethen by age
#out:[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]

注意多条件排序：参数key指定的函数返回一个元组，多条件排序的顺序将按照元组的顺序

对字典组成的list进行排序

info= [{'ID':11,'name':'lili','age':20},
        {'ID':2,'name':'jobs','age':40},
        {'ID':22,'name':'aces','age':30},
        {'ID':15,'name':'bob','age':18}]
print(sorted(info, key=lambdax: x['ID']))  # sort by ID
# out:[{'ID':2, 'name': 'jobs', 'age': 40}, {'ID': 11, 'name': 'lili', 'age': 20}, {'ID':15, 'name': 'bob', 'age': 18}, {'ID': 22, 'name': 'aces', 'age': 30}]
print(sorted(info, key=itemgetter('age')))  # sort by ID
# out:[{'ID':15, 'name': 'bob', 'age': 18}, {'ID': 11, 'name': 'lili', 'age': 20}, {'ID':22, 'name': 'aces', 'age': 30}, {'ID': 2, 'name': 'jobs', 'age': 40}
# 多级排序
print(sorted(info, key=lambdax: (x['name'], x['age'])))
print(sorted(info, key=itemgetter("name", 'age')))
# out:[{'ID': 22, 'name': 'aces', 'age': 30}, {'ID': 15, 'name': 'bob', 'age': 18},{'ID': 2, 'name': 'jobs', '

本文使用了一些参考文章，也感谢这些作者

https://blog.csdn.net/fangjian1204/article/details/53055219

https://blog.csdn.net/handsomekang/article/details/9621823

https://blog.csdn.net/u011475210/article/details/77770751