KNN算法

孙子杨

已于 2022-08-20 22:06:39 修改

阅读量411

点赞数

文章标签：机器学习算法学习

于 2022-08-20 22:06:05 首次发布

本文链接：https://blog.csdn.net/qq_35594345/article/details/126445110

版权

1、KNN算法

学习书籍为《机器学习实战》

1.1 import库或模块的三种方法（以numpy库为例）

import numpy
- 说明：import整个numpy库，在import时简单，但使用时麻烦，其缺点在于：
- 调用整个库或模块浪费时间和资源，尤其是需要多个库或模块支撑时。
- 在使用其中的函数使需要有numpy.来限定，否则会报错，如 numpy.tile（）
from numpy import
- 说明：import numpy库中的任意函数，*为通配符，与上一种方式的不同之处在于不需要库名来限定（常用这种方法）
from numpy import tile
- 说明：仅仅import numpy库中的tile()函数

1.2 numpy中的tile函数解析

from numpy import *
from operator import *
a=array([[1,2,3],[4,5,6],[7,8,9]])
b=tile(a,(1,1))
print("b=",b)

b= [[1 2 3]
 [4 5 6]
 [7 8 9]]

1.3 numpy中的shape函数解析

from numpy import *
a=[[1,2,3],[4,5,6],[7,8,9]]
b=array([[1,2,3],[4,5,6]])
print(a)
print(b)
# print(a.shape[0]) 'list' object has no attribute 'shape'  这种定义的a是list，不能使用shape，只有矩阵才可以
print(b.shape[0]) #参数0表示行数
print(b.shape[1]) #蚕食1表示列数
print(b.shape)    #没有参数且没有“[]”符号的时候，输出行数和列数

# c=b.shape
# print(c.shape) 'tuple' object has no attribute 'shape'  shape的结果是一个tuple，不是矩阵，不能再用tuple,具体见https://www.php.cn/python-tutorials-424316.html

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
[[1 2 3]
 [4 5 6]]
2
3
(2, 3)

1.4 numpy中的sum函数

python自带的sum函数和numpy带的sum函数并不冲突，在使用的过程中不用分那么清楚，具体见：https://blog.csdn.net/Sophia_11/article/details/84975009

from numpy import *
a=array([[1,2,3],[2,3,4],[5,6,7]])
b=a.sum(axis=0) #按照列相加，最后变成一行，一行中的每个元素都是一列相加的结果
c=a.sum(axis=1) #按照行相加，最后变成一行，一行中的每个元素都是一行相加的结果
d=sum(a,axis=0) #sum有两种写的形式，两种形式结果一样

print('a=',a)
print('b=',b)
print('c=',c)
print('d=',d)

a= [[1 2 3]
 [2 3 4]
 [5 6 7]]
b= [ 8 11 14]
c= [ 6  9 18]
d= [ 8 11 14]

1.5 numpy中的**2函数（平方）

from numpy import *
a=array([[1,2,3],[2,3,4],[5,6,7]])
b=a**2

print('b=',b) #运算结果是讲矩阵的每个元素进行平方

b= [[ 1  4  9]
 [ 4  9 16]
 [25 36 49]]

1.6 numpy中矩阵的平方

from numpy import *
A=array([[1,2,3],[2,3,4],[5,6,7]])
B=array([[2,3,4],[5,6,7],[8,9,10]])
C=A*B
D=multiply(A,B)
E=dot(A,B)    #两个矩阵相乘，使用的是dot  

print('A=',A) 
print('B=',B) 
print('C=',C) 
print('D=',D) 
print('E=',E)

A= [[1 2 3]
 [2 3 4]
 [5 6 7]]
B= [[ 2  3  4]
 [ 5  6  7]
 [ 8  9 10]]
C= [[ 2  6 12]
 [10 18 28]
 [40 54 70]]
D= [[ 2  6 12]
 [10 18 28]
 [40 54 70]]
E= [[ 36  42  48]
 [ 51  60  69]
 [ 96 114 132]]

1.7 矩阵的两种表示形式array和matrix

参考文章https://blog.csdn.net/qq_42522262/article/details/86777426

两种形式的区别参考文章https://blog.csdn.net/weixin_44340030/article/details/85929182
利用文章中的内容总结一下就是：如果一个程序里面既有matrix 又有array，会让人脑袋大。但是如果只用array，你不仅可以实现matrix所有的功能，还减少了编程和阅读的麻烦，因为matrix只是二维的，而array可以是多维的，但是二者在一些用法上又有区别，所以中需要记住array的用法就行，matrix暂时不考虑

1.8 numpy中的排序函数argsort()

可以参考文章：https://blog.csdn.net/qq_38486203/article/details/80967696?utm_medium=distribute.pc_relevant.none-task-blog-2_defaultbaidujs_baidulandingword~default-0-80967696-blog-84978772.t0_searchtargeting_v1&spm=1001.2101.3001.4242.1&utm_relevant_index=3

from numpy import*
A=array([[1,2,3],[4,5,6],[7,8,9]])

B=A.argsort()   #按行从小到大进行排列,输出的是原来数据所在数组中的位置
C=A.argsort(0)  #按列从小到大进行排列，输出的是原来数据所在数组中的位置
D=A.argsort(1)  #按行从小到大进行排列，输出的是原来数据所在数组中的位置

print('A=',A)
print('B=',B)
print('C=',C)
print('D=',D)

A= [[1 2 3]
 [4 5 6]
 [7 8 9]]
B= [[0 1 2]
 [0 1 2]
 [0 1 2]]
C= [[0 0 0]
 [1 1 1]
 [2 2 2]]
D= [[0 1 2]
 [0 1 2]
 [0 1 2]]

1.9 KNN算法

from numpy import *
import operator

def creatDataset():
    group=array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])
    labels=['爱情片','爱情片','爱情片','动作片','动作片','动作片']
    return group,labels

def KNN(inX,dataSet,labels,k):
    dataSetSize=dataSet.shape[0]
    diffMat=tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat=diffMat**2    #将矩阵的每个元素都平方
    
    sqDistances = sqDiffMat.sum(axis=1) #按照行相加，最后变成一行，一行中的每个元素都是一行相加的结果
    distances = sqDistances**0.5        #对的出来的一行平方和中的每个元素开根号   
    sortedDistIndicies = distances.argsort()  #将一个一维数组元素按照从小到大排列，输出的是按照小大排列的数组编号   
    classCount={}             #建立一个空类
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]           #将排好顺序的一维数组传给labels，将带有名字数组传给voteIlabel
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1  #将voteIlabel存储的名字，一个个与classCount里的元素比较，同时classCount[voteIlabel]建立包含voteIlabel的类，如果 classCount[voteIlabel]没有这个元素，将0+1赋值给classCount[voteIlabel]，如果存在voteIlabel元素，将在classCount[voteIlabel]对应的元素加1再赋值给classCount[voteIlabel]
            
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) #结果是按照从大到小排序的元组列表
    return sortedClassCount[0][0]  #元组列表的第一个元素输出
   
#     return a,diffMat,sqDiffMat,sqDistances,distances,sortedDistIndicies



group,labels=creatDataset()
a=KNN([20,50],group,labels,3)
print(a)

# a,diffMat,sqDiffMat,sqDistances,distances,sortedDistIndicies=KNN([0,0],group,labels,3)


# print('a=\n',a)
# print('diffMat=\n',diffMat)
# print('sqDiffMat=\n',sqDiffMat)
# print('sqDistances=',sqDistances)
# print('distances=',distances)
# print('sortedDistIndicies=',sortedDistIndicies)

爱情片

1.9.1 上述程序中第20行：classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

参考文章：https://www.cnblogs.com/EvilAnne/p/9740111.html

类.get()函数

类.get(参数1，参数2)
将参数1跟类里面元素对照，如果存在参数1的元素，直接输出元素对应的值，如果不存在，输出参数2
如果使用get()函数的时候，没有参数2同时不存在与参数1一样的元素，默认返回none

from numpy import *
Dcountry = {"中国":"北京","美国":"华盛顿","法国":"巴黎"}

a=Dcountry.get('美国',0)
b=Dcountry.get('韩国',0)
c=Dcountry.get('中国')
d=Dcountry.get('印度')

print('a=',a)
print('b=',b)
print('c=',c)
print('d=',d)

a= 华盛顿
b= 0
c= 北京
d= None

1.9.2 上述程序第26行：sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

参考文章：https://blog.csdn.net/weixin_40264772/article/details/102133138?utm_medium=distribute.pc_relevant.none-task-blog-2_defaultbaidujs_baidulandingword~default-0-102133138-blog-125149014.pc_relevant_multi_platform_whitelistv4&spm=1001.2101.3001.4242.1&utm_relevant_index=3

sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1)，reverse=True)

classCount.items()返回的是dict_items，classCount.items()将classCount字典分解为元组列表（参考文章https://www.cnblogs.com/EvilAnne/p/9740111.html）
key=operator.itemgetter(1)按照第二个元素的次序对元组进行排序
reverse=True是逆序，即按照从大到小的顺序排列