k-近邻算法（KNN）--2改进约会网站的配对效果---by香蕉麦乐迪

最新推荐文章于 2020-08-17 09:12:12 发布

sloanqin

最新推荐文章于 2020-08-17 09:12:12 发布

阅读量1.6k

点赞数

分类专栏：机器学习 python 文章标签： python 机器学习 k近邻

本文链接：https://blog.csdn.net/sloanqin/article/details/50133491

版权

机器学习同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

python

10 篇文章 0 订阅

订阅专栏

参考书籍：《机器学习实战》

实验说明：预测约会对象对用户是否具有吸引力

输入数据：每个待约会的对象有三个属性，分别是每年飞行里程数、玩游戏占时间比、每周吃的冰淇淋（单位公升）；（ps：我觉得这三个参数，分别代表一个人是否有钱，生活娱乐，饮食习惯）

样本集：有1000个约会对象的数据，并且每个对象有一个标签，标签有三大类，分别是不喜欢、魅力一般、非常有魅力

实验过程：

1、将样本集90%当做训练集，10%当做测试集，测试classify.py的错误率

2、用户输入一个约会对象的参数，给出分类的标签，为用户提供建议

代码文件：

file2Matrix.py：样本集存在txt文件中，该函数将样本集输入到内存中，以array的方式存储起来

plotDataSet：将样本集的数据画出来（每个样本只能画出两个变量）

autoNorm.py：将参数归一化，每个参数的大小范围不一致

datingClassTest.py：错误率测试

classify.py：预测分类函数

classifyPerson.py：输入某个人的参数，给出预测结果

knn.py：主函数

样本集及源文件下载：点击打开链接

源文件：

file2Matrix.py：样本集存在txt文件中，该函数将样本集输入到内存中，以array的方式存储起来

__author__ = 'root'

import numpy as np

def file2Matrix(filename):
    #open file
    fileHandle=open(filename,mode='r')

    #read lines, here lines is a list
    lines=fileHandle.readlines()

    #for saving data
    i=0
    datingDataSet=np.zeros((len(lines),3))
    labels=[]


    #traverse all lines,save to matrix
    for line in lines:
        line=line.strip()
        listFromLine=line.split('\t')
        datingDataSet[i,:]=listFromLine[0:3]
        labels.append(int(listFromLine[-1]))
        i+=1

    #return dataSet and labels
    return datingDataSet, labels

plotDataSet：将样本集的数据画出来（每个样本只能画出两个变量）

__author__ = 'root'

import  numpy as np
import matplotlib.pyplot as plt

def plotDataSet(datingDataSet,labels):
    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(datingDataSet[:,0],datingDataSet[:,1],15*np.array(labels[:]),15*np.array(labels[:]))
    plt.show()

autoNorm.py：将参数归一化，每个参数的大小范围不一致

__author__ = 'root'

import file2Matrix
import numpy as np

def autoNorm(datingDataSet):
    #get the minimum and maximum value of each feature
    dataSetMin=datingDataSet.min(axis=0)
    dataSetMinTiled=np.tile(dataSetMin,(datingDataSet.shape[0],1))

    dataSetMax=datingDataSet.max(axis=0)
    dataSetMaxTiled=np.tile(dataSetMax,(datingDataSet.shape[0],1))

    # =(value-min)/(max-min)
    datingDataSet=(datingDataSet-dataSetMinTiled)/(dataSetMaxTiled-dataSetMinTiled)

    return datingDataSet,dataSetMin,dataSetMax

datingClassTest.py：错误率测试

__author__ = 'root'

import numpy as np
import classify

def datingClassTest(datingDataSet,labels):
    #set:ratio of test,k
    ratio=0.1
    k=4

    #num of testData
    lenOfDataSet=datingDataSet.shape[0]
    numOfTest=int(ratio*lenOfDataSet)
    print numOfTest

    #variable:num of error
    numOfError=0

    #traverse all test data
    for i in range(numOfTest):
        #prepare input data
        inX=datingDataSet[i,:]
        label=labels[i]

        ans=classify.classify(inX,datingDataSet[numOfTest:lenOfDataSet,:],labels[numOfTest:lenOfDataSet],k)
        if ans!=label:
            numOfError+=1.0
            print 'predict error'

    return numOfError/numOfTest

classify.py：预测分类函数

__author__ = 'root'

import numpy as np
import operator

def classify(inX,dataSet,labels,k):

    #calculate euclidean distance between k and dataSet
    dataSetSize=dataSet.shape[0]
    diffMat=np.tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1)
    distance=sqDistances**0.5

    #sort distance, min to max, return index list
    sortedDistIndicies=distance.argsort()

    # from 0 to k-1, count times of every class
    classCount={}
    for i in range(k):
        className=labels[sortedDistIndicies[i]]
        #print classCount.get(className,0)
        #here parameter 0 means:if className doesn't exist, return
        classCount[className]=classCount.get(className,0)+1

    #sort class count result, i don't understand this method now
    #parameter reverse=true:from big to small,reverse=flase:from small to big
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)

    #print sortedClassCount
    #print sortedClassCount[0][0]

    # return result
    return sortedClassCount[0][0]

classifyPerson.py：输入某个人的参数，给出预测结果

__author__ = 'root'

import numpy as np
import classify

def classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels):
    resultList=['not at all','a little like','like very much']
    k=3

    #input data
    flyMiles=float(raw_input('please input fly miles per year:'))
    percOfVedioGames=float(raw_input('please input percentage of time you spend playing video games:'))
    iceCream=float(raw_input('please input how much iceCream you eat every week:'))
    inX=[flyMiles,percOfVedioGames,iceCream]
    inX=(inX-dataSetMin)/(dataSetMax-dataSetMin)

    #predict
    ans=classify.classify(inX,datingDataSet,labels,k)
    ans=resultList[ans-1]

    #print result
    print 'you may feel this person:',ans

knn.py：主函数

__author__ = 'root'

import file2Matrix
import plotDataSet
import autoNorm
import datingDataSetClassifyTest
import classifyPerson
import numpy as np

#get data to ram
datingDataSetOri, labels=file2Matrix.file2Matrix('datingTestSet2.txt')
print 'datingDataSetOri:\n',datingDataSetOri
print 'labels:\n',labels

#plot data
plotDataSet.plotDataSet(datingDataSetOri,labels)

#autonorm data to [0,1]
datingDataSet,dataSetMin,dataSetMax=autoNorm.autoNorm(datingDataSetOri)
print 'datingDataSet:\n',datingDataSet

#test error rate
errorRate=datingDataSetClassifyTest.datingClassTest(datingDataSet,labels)
print 'errorRate:',errorRate

#pridect person
classifyPerson.classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels)

总结：

knn的优点：算法简单，易于实现

knn的缺点：1、随着样本集的增加，计算时间线性增长，当特征数量增加时，计算复杂度也线性增加；2、没有训练过程，无法提取出对样本的特征表述

原文链接：点击打开链接