Machine Learning In Action - Chapter 2 KNN

最新推荐文章于 2024-09-24 00:00:00 发布

MrTriste

最新推荐文章于 2024-09-24 00:00:00 发布

阅读量315

点赞数

分类专栏： Machine Learning In Action 文章标签：机器学习机器学习实战 knn算法

本文链接：https://blog.csdn.net/wjc1182511338/article/details/76649121

版权

Machine Learning In Action 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Chapter 2 - KNN

KNN伪代码

For every point in our dataset:
  calculate the distance between inX and the current point
  sort the distances in increasing order
  take k items with lowest distances to inX
  find the majority class among these items
  return the majority class as our prediction for the class of inX

约会网站

datingTestSet.txt里的标签无法转为int，用datingTestSet2.txt

标准化

When dealing with values that lie in different ranges, it’s common to normalize them.

Common ranges to normalize them to are 0 to 1 or -1 to 1.

To scale everything from 0 to 1, you need to apply the following formula:

newValue = (oldValue-min)/(max-min)

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals- minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

分类函数：

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1)) -dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.iteritems(),
            key = operator.itemgetter(1),reverse = True)
    return sortedClassCount[0][0]

完整：

def datingClassTest():
    hoRatio = 0.30
    datingDataMat,datingLabels = file2matrix("datingTestSet2.txt")
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        res = classify0(normMat[i,:],normMat[numTestVecs:m,:],
            datingLabels[numTestVecs:m],6)
        # print "the classifier came back with: %d, the real answer is: %d"% (res, datingLabels[i])
        if res != datingLabels[i]:
            errorCount += 1.0
    print "error rate:",errorCount/float(numTestVecs)

数字识别

def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits')
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('trainingDigits/%s'%fileNameStr)
    testFileList = listdir("testDigits")
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s'%fileNameStr)
        res = classify0(vectorUnderTest,trainingMat,hwLabels,4)
        if res!= classNumStr:
            errorCount+=1
    print "error rate:",errorCount/mTest