机器学习（KNN-K最邻近分类）

最新推荐文章于 2022-04-22 17:27:28 发布

Lee_jiaqi

最新推荐文章于 2022-04-22 17:27:28 发布

阅读量766

点赞数

分类专栏：机器学习文章标签：机器学习 KNN 算法实现 k最邻近分类兼容问题

本文链接：https://blog.csdn.net/zoinsung_lee/article/details/78406103

版权

机器学习专栏收录该内容

25 篇文章 0 订阅

订阅专栏

一、综述

1.属于分类算法

2.输入基于实例的学习，懒惰学习

二、算法详述

1.步骤：
（1）为了判断未知实例的类别，以所有已知类别的实例作为参照
（2）选择参数
（3）计算未知实例与所有已知实例的距离（两点之间的距离）
（4）选择最近k个已知实例
（5）根据少数服从多数的投票法则，让未知实例归类为k个最临近样本中最多数的类别

2.算法优缺点：

优点：简单、易于理解、容易实现、通过对k的选择可具备丢噪音数据的健壮性

缺点：需要大量空间存储已知实例
算法复杂度高（需要比较所有已知实例与要分类的实例）
当分布不平衡时，比如其中一类样本过大（实例数量过多）占主导的时候，新的未知实例容易被归类为这个主导样本，因为这类样本实例的数量过大，但这个新的未知实例实际并未接近目标样本。

3.改进：
考虑距离，根据距离加上权重

三、算法实现

1.利用python自带的分类器实现：

#导入相应模块
from sklearn import neighbors
from sklearn import datasets

#实例化knn分类器
knn = neighbors.KNeighborsClassifier()

#加载训练数据集
iris = datasets.load_iris()
print(iris)

#利用特征向量和目标向量构建模型
knn.fit(iris.data,iris.target)

#传入测试集预测目标
predictedLabel = knn.predict([[0.1,0.2,0.3,0.4]])
print(predictedLabel)

运行结果：

{'DESCR': 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n', 'target_names': array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10'), 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'data': array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4.8,  3.1,  1.6,  0.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 5.2,  4.1,  1.5,  0.1],
       [ 5.5,  4.2,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5.5,  3.5,  1.3,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 4.4,  3. ,  1.3,  0.2],
       [ 5.1,  3.4,  1.5,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 4.4,  3.2,  1.3,  0.2],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5.1,  3.8,  1.9,  0.4],
       [ 4.8,  3. ,  1.4,  0.3],
       [ 5.1,  3.8,  1.6,  0.2],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 5.3,  3.7,  1.5,  0.2],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.4,  3.2,  4.5,  1.5],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 5.7,  2.8,  4.5,  1.3],
       [ 6.3,  3.3,  4.7,  1.6],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 5.2,  2.7,  3.9,  1.4],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 6. ,  2.2,  4. ,  1. ],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 5.6,  2.9,  3.6,  1.3],
       [ 6.7,  3.1,  4.4,  1.4],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5.8,  2.7,  4.1,  1. ],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.8,  4.7,  1.2],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 5.7,  2.6,  3.5,  1. ],
       [ 5.5,  2.4,  3.8,  1.1],
       [ 5.5,  2.4,  3.7,  1. ],
       [ 5.8,  2.7,  3.9,  1.2],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 5.5,  2.5,  4. ,  1.3],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 5. ,  2.3,  3.3,  1. ],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.7,  2.9,  4.2,  1.3],
       [ 6.2,  2.9,  4.3,  1.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 5.7,  2.8,  4.1,  1.3],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 7.1,  3. ,  5.9,  2.1],
       [ 6.3,  2.9,  5.6,  1.8],
       [ 6.5,  3. ,  5.8,  2.2],
       [ 7.6,  3. ,  6.6,  2.1],
       [ 4.9,  2.5,  4.5,  1.7],
       [ 7.3,  2.9,  6.3,  1.8],
       [ 6.7,  2.5,  5.8,  1.8],
       [ 7.2,  3.6,  6.1,  2.5],
       [ 6.5,  3.2,  5.1,  2. ],
       [ 6.4,  2.7,  5.3,  1.9],
       [ 6.8,  3. ,  5.5,  2.1],
       [ 5.7,  2.5,  5. ,  2. ],
       [ 5.8,  2.8,  5.1,  2.4],
       [ 6.4,  3.2,  5.3,  2.3],
       [ 6.5,  3. ,  5.5,  1.8],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 7.7,  2.6,  6.9,  2.3],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 6.9,  3.2,  5.7,  2.3],
       [ 5.6,  2.8,  4.9,  2. ],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 7.2,  3.2,  6. ,  1.8],
       [ 6.2,  2.8,  4.8,  1.8],
       [ 6.1,  3. ,  4.9,  1.8],
       [ 6.4,  2.8,  5.6,  2.1],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 7.4,  2.8,  6.1,  1.9],
       [ 7.9,  3.8,  6.4,  2. ],
       [ 6.4,  2.8,  5.6,  2.2],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 6.3,  3.4,  5.6,  2.4],
       [ 6.4,  3.1,  5.5,  1.8],
       [ 6. ,  3. ,  4.8,  1.8],
       [ 6.9,  3.1,  5.4,  2.1],
       [ 6.7,  3.1,  5.6,  2.4],
       [ 6.9,  3.1,  5.1,  2.3],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6.8,  3.2,  5.9,  2.3],
       [ 6.7,  3.3,  5.7,  2.5],
       [ 6.7,  3. ,  5.2,  2.3],
       [ 6.3,  2.5,  5. ,  1.9],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 5.9,  3. ,  5.1,  1.8]])}
[0]

2.用python实现knn分类器算法

首先需要在http://archive.ics.uci.edu/ml/页面中下载Iris数据集
这里写图片描述
点击Iris

这里写图片描述
点击Data Floder

这里写图片描述
下载iris.data

代码实现：

# -*- coding:utf-8 -*- 
#导入相应模块
import csv
import random
import math
import operator

#加载数据集
def load_dataset(fileName,split,traningData = [],testingData = []):
    with open(fileName,"rt") as csvfile:
        lines = csv.reader(csvfile)#读出csvfile文件内容
        dataset = list(lines)#将内容封装为list类型
        for x in range(len(dataset)-1):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])#将数据集转换为float类型
            if(random.random()<split):#用split为界限，随机生出的数字小于split归为训练集，反正归为测试集
                traningData.append(dataset[x])
            else:
                testingData.append(dataset[x])

#计算测试结点与周围结点之间得距离
def cal_instance(instance_1,instance_2,dimen):
    distance = 0
    for x in range(dimen):
        distance += pow((instance_1[x]-instance_2[x]),2)#计算结点所有维度的坐标方和
    return math.sqrt(distance)

#从周围结点中选取k个距离最近的结点
def select_neark(trainngSet,testingSet,k):
    distance = []#用来存储每个训练集点到测试集点的距离
    length = len(testingSet)-1#计算测试集的维度
    for x in range(len(trainngSet)):
        dst = cal_instance(testingSet,trainngSet[x],length)#计算每个训练集结点到测试集点的距离
        distance.append((trainngSet[x],dst))#列表元素为字典类型
    distance.sort(key=operator.itemgetter(1))#以距离为根据排序
    neighbors = []
    for x in range(k):
        neighbors.append(distance[x][0])#将距离测试集点距离最近的k个点存储在列表里
    return neighbors

#利用少数服从多数原则对测试结点进行分类
def calssify(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]#获取到结点类别
        if(response in classVotes):
            classVotes[response] += 1#如果在已存在的类别里，为相应类别加1
        else:
            classVotes[response] = 1#否则新加入类别及类别个数
    sortedVotes = sorted(classVotes.items(),key=operator.itemgetter(1),reverse=True)
    return sortedVotes[0][0]#将类别个数逆排序，返回最大个数的类别

#计算预测精度
def precision(testSet,prediction):
    correct = 0
    for x in range(len(testSet)):
        if(testSet[x][-1]==prediction[x]):
            correct += 1          
    return (correct/float(len(testSet)))*100.0        

#主程序
def main():
    trainningSet = []
    testingSet = []
    split = 0.67
    load_dataset(r"E:\demo_py\python\machine_learning\iris.data.txt",split,trainningSet,testingSet)
    print("trainSet:"+repr(len(trainningSet)))
    print("testSet:"+repr(len(testingSet)))

    prediction = []
    k = 3
    for x in range(len(testingSet)):
        neighbors = select_neark(trainningSet,testingSet[x],k)#获取k个最近邻居
        result = calssify(neighbors)#根据邻居将测试结点分类
        prediction.append(result)#将预测目标加入列表
        print("prediction:"+repr(result)+"acctully:"+repr(testingSet[x][-1]))
    accuracy = precision(testingSet,prediction)
    print("accuracy:"+repr(accuracy)+"%")#函数repr()将对象转换为字符串

main()

运行结果：

Finding files... done.
Importing test modules ... trainSet:96
testSet:54
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-setosa'acctully:'Iris-setosa'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-virginica'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-versicolor'acctully:'Iris-versicolor'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-versicolor'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
prediction:'Iris-virginica'acctully:'Iris-virginica'
accuracy:96.29629629629629%
done.

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

在代码时要注意python3与python2不兼容的问题，在python3中将dict中的iteritems()方法已经废除了，在python3里要用items()方法替换iteritems()。

备注：
python3字典中items()和python2.x中iteritems()的区别：

在python2.x中，items( )用于返回一个字典的拷贝列表【Returns a copy of the list of all items (key/value pairs) in D】，占额外的内存。
iteritems() 用于返回本身字典列表操作后的迭代【Returns
an iterator on all items(key/value pairs) in D】，不占用额外的内存。

python 3.x 里面，iteritems() 和 viewitems() 这两个方法都已经废除了，而 items() 得到的结果是和 2.x 里面 viewitems() 一致的。在3.x里用 items()替换iteritems() ，可以用于 for 来循环遍历。

Lee_jiaqi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习（KNN-K最邻近分类）

一、综述1.属于分类算法2.输入基于实例的学习，懒惰学习二、算法详述1.步骤：（1）为了判断未知实例的类别，以所有已知类别的实例作为参照（2）选择参数（3）计算未知实例与所有已知实例的距离（两点之间的距离）（4）选择最近k个已知实例（5）根据少数服从多数的投票法则，让未知实例归类为k个最临近样本中最多数的类别2.算法优缺点：优点：简单、易于理解、容易实现、通过对k的选择可具备丢噪
复制链接

扫一扫

专栏目录