机器学习实战python实例（1）KNN

最新推荐文章于 2024-04-10 10:52:52 发布

xiaonannanxn

最新推荐文章于 2024-04-10 10:52:52 发布

阅读量1.3k

点赞数 1

分类专栏：机器学习文章标签： python 机器学习算法 KNN

本文链接：https://blog.csdn.net/xiaonannanxn/article/details/52335920

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

最近整理了部分代码，每隔几天记录一下自己的成果，部分基于《机器学习实战Machine Learning in Action》这本书，我的想法是给那些对算法有大体理解的人一个更直观的认识，所以以下不会涉及到算法推导，只会有一个大概的解释。

第一个实例是KNN（k近邻）算法，KNN算法是利用空间中样本间的距离来度量一个新样本属于哪一类，若离新样本最近的k个点中属于y标签的点最多，那么新样本就是y标签。n个属性对应的便是n维空间中的点。
用的数据是一个约会网站中某个人对不同人的喜爱程度，属性有三个：玩视频游戏所耗百分比、每年获得飞行常客里程数、每周消费的冰淇淋公升数，标签有3个，分别是1、2、3代表不喜欢、一般、喜欢，具体的数据datingTestSet2.txt可以通过这个链接下载：http://download.csdn.net/detail/xiaonannanxn/9614733

观察数据我们可以发现数据有一个很明显的特点：三个属性的数值不在同一个数量级上，这会造成一个问题大的数值对空间距离的影响会大于小的数值，所以我们需要对数据做一个归一化的处理，具体做法就是用来进行归一化，其实也有别的归一化方法，具体我就不介绍了，有兴趣的同学可以自行查阅，最后贴上代码KNN.py

from numpy import *
import operator


# load the file and convert it into a matrix
def file2matrix(filename):
    # open the file
    fr = open(filename)
    arrayOfLines = fr.readlines()
    # get the number of samples
    numberOfLines = len(arrayOfLines)
    # there are three features
    returnMat = zeros((numberOfLines, 3))
    classLabelVector = []
    index = 0
    for line in arrayOfLines:
        # preprocess
        line = line.strip().split("\t")
        returnMat[index, :] = line[0:3]
        classLabelVector.append(int(line[-1]))
        index += 1
    # return data, label
    return returnMat, classLabelVector


# Normalized
def autoNorm(dataSet):
    # get min and max from each col of dataSet
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    # create a matrix shapes like dataSet and all is zero
    normDataSet = zeros(shape(dataSet))
    numOfSamples = dataSet.shape[0]
    # the function tile aims to copy the matrix
    normDataSet = dataSet - tile(minVals, (numOfSamples, 1))
    normDataSet = normDataSet / tile(ranges, (numOfSamples, 1))
    return normDataSet, ranges, minVals


# classify
def classify0(inX, dataSet, labels, k):
    numOfSamples = dataSet.shape[0]
    diffMat = tile(inX, (numOfSamples, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in xrange(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    # sort by key-value pairs in descending value
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

再建立一个文件运行测试init.py

from KNN import *

testRate = 0.1
datingDataMat, datingLabels = file2matrix("datingTestSet2.txt")
normMat, ranges, minVals = autoNorm(datingDataMat)
numOfSamples = normMat.shape[0]
numTestVecs = int(numOfSamples * testRate)
errorCount = 0.0
for i in xrange(numTestVecs):
    classifierResult = classify0(normMat[i, :], normMat[numTestVecs: numOfSamples, :],
                                 datingLabels[numTestVecs: numOfSamples], 3)
    print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
    if classifierResult != datingLabels[i]:
        errorCount += 1.0
print "the total error rate is: %f" % (errorCount / float(numTestVecs))