2.机器学习，k-近邻算法

最新推荐文章于 2024-07-25 12:32:58 发布

攀爬←蜗牛

最新推荐文章于 2024-07-25 12:32:58 发布

阅读量123

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/qq_44464349/article/details/114485966

版权

k-近邻算法

1.概述
2.Python写算法
3.总代码
- 3.1.附件：样本数据

1.概述

采用测量不同特征值之间距离的方法进行分类的方法
1.已知样本数据集及其对应的标签（类别）

2.把新数据的各个特征与样本集中数据的对应特征作比较

3.然后提取样本集中前k项样本特征最相似的数据的标签作为新数据的分类依据（算法处理）

2.Python写算法

2.1创建名为KNN.py的Python模块

添加代码（详细看代码注释）：

from numpy import*      #导入科学计算包numpy（线性运算）
import operator         #导入运算符模块  （排序等）
def createDataSet():    #定义函数createDataSet(),gruop和labels分别为数据和标签
      group = array([[1.1,1.1],[1.0,1.0],[0,0],[0,0.1]])  #每一个数据都是一个数列，每一个数代表一个特征
      labels = ['A','A','B','B']
      return group,labels

2.2KNN分类算法

1.计算已知数据中的点与当前点之间的举例；2.按照距离持续递增排序；3.选取与当前点距离最小的K个点；4.确定前k个点所在标签出现的频率；5.返回前k个点出现频率最高的标签作为新数据的预测分类；

近邻算法程序：

def classify0(inX,dataSet,labels,k):                           #说明1见代码段下
      dataSetSize = dataSet.shape[0]                         #2
      diffMat = tile(inX,(dataSetSize,1)) - dataSet     #3
      sqDiffMat = diffMat**2
      sqDistances = sqDiffMat.sum(axis=1)                   #4
      distances = sqDistances**0.5
      sortedDistIndicies = distances.argsort()   #将数组中的数据在整体中的大小排名按按照从小到大排序
      classCount = {}           #定义一个字典
      for i in range(k):           #分别给i赋值0~k，不包括k
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
      sortedClassCount = sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
      return sortedClassCount[0][0]

说明：
1.定义了classify0函数，参数：inX需要分类的数据；dataSet：样本数据；Labels：样本数据标签；k：取前K个最接近的标签作为分类依据。
2.shape函数的功能是读取数组维度和长度的函数，shape[0]是读取矩阵第一维度的长度。shape[[5,7,9],[1,2,3]]返回值为（2，3）代表第一维度[],[]长度（元素个数）为2，第二维度长度为3。也可以理解为返回值为行数和列数。
3.tile函数，重复复制inX，行复制dataSetSize次，列数不便。
4.将矩阵的每一行向量相加（有axis=1这个参数时）

2.3.处理得到需要分析的数据

def file2matrix(filename):   #定义文本处理函数
      fr = open(filename)    #打开一个文本记录
      arrayOfLines = fr.readlines()   #.readlines函数，读取文本文件的所有行
      numberOfLines = len(arrayOfLines)  #读取行数
      returnMat = zeros((numberOfLines,3))  
      classLabelVector = []
      index = 0
      for line in arrayOfLines:
            line = line.strip()   #strip（*）函数，移除字符串头尾指定的字符*（默认为空格或换行符）或字符序列
            listFromLine = line.split('\t') #制表符
            returnMat[index,:] = listFromLine[0:3]  #读取前三个数据
            classLabelVector.append(int(listFromLine[-1]))  #读取最后一行数据
            index += 1
      return returnMat,classLabelVector

2.3.创建散点图Matplotlib

代码：

import matplotlib
import matplotlib.pyplot as plt
def Drawing(Matrix):
      fig = plt.figure()
      ax = fig.add_subplot(111)  #1
      ax.scatter(Matrix[:,1],Matrix[:,2])  #2
      plt.show()

说明：
1.add_subplot(111)函数，例：add_subplot(223)将图纸分为2*2的4块区域，指定在第三块作图
2.scatter函数：请点击连接参考：点这里学习scatter！！！！！！

2.4.归一化操作

某些特征的数值相对于其他值特别大，会影响分类的准确性，归一化数值可以将数值控制在0~1区间，得到更加准确的分类结果。

def autoNorm(dataSet):
      minVals = dataSet.min(0)   #得到每一列最小的值
      maxVals = dataSet.max(0)    #得到每一列最大的值 
      ranges = maxVals - minVals  
      normDataSet = zeros(shape(dataSet))
      m = dataSet.shape(0)
      normDataSet = dataSet - tile(minVals,(m,1))
      normDataSet = normDataSet/tile(ranges,(m,1))
      return normDataSet,ranges,minVals

2.5.主代码

def datingClassTest():
      hoRatio = 0.10
      datingDataMat,datingLabels = file2matrix('datingTestSet.txt')
      normMat,ranges,minVals = autoNorm(datingDataMat)
      m = normMat.shape[0]
      numTestVecs = int(m*hoRatio)
      errorCount = 0.0
      for i in range(numTestVecs):
            classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
     str = "the classifier come back with: %d, the real answer is : %d ||%d"% (classifierResult,datingLabels[i],index)
            print(str)
            index +=1
            if (classifierResult != datingLabels[i]):errorCount += 1.0
      str = "the total error rate is : %f" % (errorCount/float(numTestVecs))
      print (str)

2.6.输出结果

在这里插入图片描述

3.总代码

from numpy import*
import operator
#打开文档，提取整理样本数据
def file2matrix(filename):
      fr = open(filename)
      arrayOfLines = fr.readlines()
      numberOfLines = len(arrayOfLines)
      returnMat = zeros((numberOfLines,3))
      classLabelVector = []
      index = 0
      for line in arrayOfLines:
            line = line.strip()
            listFromLine = line.split('\t')
            returnMat[index,:] = listFromLine[0:3]
            classLabelVector.append(int(listFromLine[-1]))
            index += 1
      return returnMat,classLabelVector

#K-近邻法 分类
def classify0(inX,dataSet,labels,k):
      dataSetSize = dataSet.shape[0]
      diffMat = tile(inX,(dataSetSize,1)) - dataSet
      sqDiffMat = diffMat**2
      sqDistances = sqDiffMat.sum(axis=1)                   #将矩阵的每一行向量相加（有axis=1这个参数时）
      distances = sqDistances**0.5
      sortedDistIndicies = distances.argsort()              #将数组中的数据在整体中的大小排名按按照从小到大排序
      classCount = {}                                       #定义一个字典
      for i in range(k):                                    #分别给i赋值0~k，不包括k
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
      sortedClassCount = sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
      return sortedClassCount[0][0]
import matplotlib
import matplotlib.pyplot as plt
def Drawing(Matrix):
      fig = plt.figure()
      ax = fig.add_subplot(111)
      ax.scatter(Matrix[:,1],Matrix[:,2])
      plt.show()
#数值归一化
def autoNorm(dataSet):
      minVals = dataSet.min(0)
      maxVals = dataSet.max(0)
      ranges = maxVals - minVals
      normDataSet = zeros(shape(dataSet))
      m = dataSet.shape[0]
      normDataSet = dataSet - tile(minVals,(m,1))
      normDataSet = normDataSet/tile(ranges,(m,1))
      return normDataSet,ranges,minVals
#主代码
def datingClassTest():
      hoRatio = 0.1
      datingDataMat,datingLabels = file2matrix('F:\桌面\py工作\约会\约会分类训练数据.txt')
      normMat,ranges,minVals = autoNorm(datingDataMat)
      m = normMat.shape[0]
      numTestVecs = int(m*hoRatio)
      errorCount = 0.0
      index = 1
      for i in range(numTestVecs):
            classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
            str = "the classifier come back with: %d, the real answer is : %d ||%d"% (classifierResult,datingLabels[i],index)
            print(str)
            index +=1
            if (classifierResult != datingLabels[i]):errorCount += 1.0
      str = "the total error rate is : %f" % (errorCount/float(numTestVecs))
      print (str)

3.1.附件：样本数据

数据点这里！

攀爬←蜗牛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
2
评论
2.机器学习，k-近邻算法

1.概述采用测量不同特征值之间距离的方法进行分类的方法1.已知样本数据集及其对应的标签（类别）2.把新数据的各个特征与样本集中数据的对应特征作比较3.然后提取样本集中前k项样本特征最相似的数据的标签作为新数据的分类依据（算法处理）...
复制链接

扫一扫