目录
一.knn算法
1.1 knn算法概述
K近邻算法(英文为K-Nearest Neighbor,因而又简称KNN算法)是非常经典的机器学习算法。K近邻算法的原理非常简单:对于一个新样本,K近邻算法的目的就是在已有数据中寻找与它最相似的K个数据,或者说“离它最近”的K个数据,如果这K个数据大多数属于某个类别,则该样本也属于这个类别。
1.2 knn算法介绍
KNN的全称是K Nearest Neighbors,意思是K个最近的邻居,从这个名字我们就能看出一些KNN算法的蛛丝马迹了。K个最近邻居,毫无疑问,K的取值肯定是至关重要的。那么最近的邻居又是怎么回事呢?KNN的原理就是当预测一个新的值x的时候,根据它距离最近的K个点是什么类别来判断x属于哪个类别。如下图所示
图 1
从这个例子中,我们就能看得出K的取值是很重要的。接下来介绍k值的选取和点距离的计算
1.2.1 k值的选取
如果当k的取值过小时,一旦有噪声的成分存在,将会对预测产生比较大的影响,例如取k值为1时,一旦最近的一个点是噪声,那么就会出现偏差,k值的减小就意味着整体模型变得复杂,容易发生过拟合(在训练集上准确率非常高,而在测试集上准确率低),而忽略了数据真实的分布。
如果k的值取得过大时,就相当于用较大邻域中的训练实例进行预测,学习的近似误差会增大,这时与输入目标点较远的实例也会对预测起作用,使预测发生错误。k值的增大就意味着整体的模型变得简单,比如如果k=N(N为训练样本的个数),那么无论输入实例是什么,都将简单地预测它属于在训练实例中最多的类。这时相当于你压根就没有训练模型,直接拿训练数据统计了一下各个数据的类别,再找最大的类别而已!
所以说k值既不能过大,也不能过小(也就是说,选取k值的关键是实验调参)。k的取值尽量要取奇数,以保证在计算结果最后会产生一个较多的类别,如果取偶数则可能会产生相等的情况,不利于预测。
1.2.2 点距离的计算
本次实验采用欧氏距离,即计算n维空间点 a(x11 , x12 ,…, x1n) 与 b(x21 , x22 , … , x2n) 间的欧氏距离(两个n维向量):
1.3 K近邻算法的一般流程
- 计算已知类别数据集中的点与当前点之间的距离;
- 按照距离递增次序排序;
- 选取与当前点距离最小的k个点;
- 确定前k个点所在类别的出现频率;
- 返回前k个点出现频率最高的类别作为当前点的预测分类
二.knn算法实例
2.代码讲解
海伦女士一直使用在线约会网站寻找适合自己的约会对象。尽管约会网站会推荐不同的任选,但她并不是喜欢每一个人。经过一番总结,她发现自己交往过的人可以进行如下分类:不喜欢的人、魅力一般的人、极具魅力的人。
海伦收集的样本数据主要包含以下3种特征:
- 每年获得的飞行常客里程数
- 玩视频游戏所消耗时间百分比
- 每周消费的冰淇淋公升数
2.1 准备数据
2.1.1 导入数据
海伦收集约会数据已经有了一段时间,她把这些数据存放在文本文件datingTestSet.txt中,每个样本数据占据一行,总共有1000行。链接如下: https://github.com/Jack-Cherish/Machine-Learning/blob/master/kNN/2.%E6%B5%B7%E4%BC%A6%E7%BA%A6%E4%BC%9A/kNN_test02.py
2.1.2 将文本记录到转换Numpy的解析程序并显示数据
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines) #得到文本行数
returnMat = np.zeros((numberOfLines,3)) #创建返回的NumPy二维矩阵
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
# line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
listFromLine = line.split('\t')
# 选取前面三个元素,存储到特征矩阵中
returnMat[index,:] = listFromLine[0:3]
# 使用索引值-1为表示列表中最后一列元素
if listFromLine[-1] == 'didntLike': #文本内收集的数据
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector
if __name__ == '__main__':
#打开的文件名
filename = "datingTestSet.txt"
# 打开并处理数据
datingDataMat, datingLabels = file2matrix(filename)
print(datingDataMat)
print(datingLabels)
- 结果
图 2
2.2 分析数据:使用Matplotlib创建散点图
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines) #得到文本行数
returnMat = np.zeros((numberOfLines,3)) #创建返回的NumPy二维矩阵
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
# line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
listFromLine = line.split('\t')
# 选取前面三个元素,存储到特征矩阵中
returnMat[index,:] = listFromLine[0:3]
# 使用索引值-1为表示列表中最后一列元素
if listFromLine[-1] == 'didntLike': #文本内收集的数据
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector
if __name__ == '__main__':
#打开的文件名
filename = "datingTestSet.txt"
# 打开并处理数据
datingDataMat, datingLabels = file2matrix(filename)
print(datingDataMat)
print(datingLabels)
# 数据可视化
fig = plt.figure()
ax = fig.add_subplot(111) # 111,参数111的意思是:将画布分割成1行1列,图像画在从左到右从上到下的第1块
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2])
plt.show()
- 结果
图 3
2.3 准备数据:归一化数值
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines) #得到文本行数
returnMat = np.zeros((numberOfLines,3)) #创建返回的NumPy二维矩阵
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
# line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
listFromLine = line.split('\t')
# 选取前面三个元素,存储到特征矩阵中
returnMat[index,:] = listFromLine[0:3]
# 使用索引值-1为表示列表中最后一列元素
if listFromLine[-1] == 'didntLike': #文本内收集的数据
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector
# 归一化特征值
def autoNorm(dataSet):
minVals = dataSet.min(0) #从列中选取最小值
maxVals = dataSet.max(0) #从列中选取最大值
# 当前值减去最小值,然后除以取值范围
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals, (m, 1))
normDataSet = normDataSet / np.tile(ranges, (m, 1)) #特征值相除
return normDataSet, ranges, minVals
if __name__ == '__main__':
#打开的文件名
filename = "datingTestSet.txt"
# 打开并处理数据
datingDataMat, datingLabels = file2matrix(filename)
print(datingDataMat)
print(datingLabels)
#归一化特征值
normDataSet, ranges, minVals = autoNorm(datingDataMat)
print(normDataSet)
print(ranges)
print(minVals)
- 结果
图 4
2.4 测试算法:作为完整程序验证分类器
import numpy as np
# 数据可视化
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import matplotlib as mpl
from numpy import *
import operator
# 在文本文件中解析数据
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet # 距离计算
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndices = distances.argsort()
classCount = {}
# 选择距离最小的K个点 把classCount分解为元组列表
for i in range(k):
voteIlabel = labels[sortedDistIndices[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
# itemgetter为排序 为逆序,即按照最大到最小的次序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines) #得到文本行数
returnMat = np.zeros((numberOfLines,3)) #创建返回的NumPy二维矩阵
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
# line.split()截取掉所有回车字符 用tab字符\t将上一步得到的整行数据分割成一个元素列表
listFromLine = line.split('\t')
# 选取前面三个元素,存储到特征矩阵中
returnMat[index,:] = listFromLine[0:3]
# 使用索引值-1为表示列表中最后一列元素
if listFromLine[-1] == 'didntLike': #文本内收集的数据
classLabelVector.append(1)
elif listFromLine[-1] == 'smallDoses':
classLabelVector.append(2)
elif listFromLine[-1] == 'largeDoses':
classLabelVector.append(3)
index += 1
return returnMat, classLabelVector
# 归一化特征值
def autoNorm(dataSet):
minVals = dataSet.min(0) #从列中选取最小值
maxVals = dataSet.max(0) #从列中选取最大值
# 当前值减去最小值,然后除以取值范围
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals, (m, 1))
normDataSet = normDataSet / np.tile(ranges, (m, 1)) #特征值相除
return normDataSet, ranges, minVals
#测试算法:作为完整程序验证分类器
def datingClassTest():
filename = "datingTestSet.txt"
datingDataMat, datingLabels = file2matrix(filename)
hoRatio = 0.10
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m * hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
datingLabels[numTestVecs:m], 4)
print("分类结果:%d\t真实类别:%d" % (classifierResult, datingLabels[i]))
if classifierResult != datingLabels[i]:
errorCount += 1.0
print("错误率:%f%%" % (errorCount / float(numTestVecs) * 100))
if __name__ == '__main__':
#打开的文件名
filename = "datingTestSet.txt"
# 打开并处理数据
datingDataMat, datingLabels = file2matrix(filename)
print(datingDataMat)
print(datingLabels)
#归一化特征值
normDataSet, ranges, minVals = autoNorm(datingDataMat)
print(normDataSet)
print(ranges)
print(minVals)
#测试算法
datingClassTest()
- 结果
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :3
the classifier came back with: 1,the real answer is :1
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :3
the classifier came back with: 3,the real answer is :3
the classifier came back with: 2,the real answer is :2
the classifier came back with: 1,the real answer is :1
the classifier came back with: 3,the real answer is :1
the total error rate is: 0.040000
Process finished with exit code 0
错误率4%,可以改变horatio和k值来改变错误率
2.5 使用算法:构建完整可以系统
def classifyPerson():
resultList = ['讨厌','有些喜欢','非常喜欢']
precentTats = float(input("玩视频游戏所耗时间百分比:"))
ffMiles = float(input("每年获得的飞行常客里程数:"))
iceCream = float(input("每周消费的冰激淋公升数:"))
filename = "datingTestSet.txt"
datingDataMat, datingLabels = file2matrix(filename)
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = np.array([precentTats, ffMiles, iceCream])
norminArr = (inArr - minVals) / ranges
classifierResult = classify0(norminArr, normMat, datingLabels, 3)
print("你可能%s这个人" % (resultList[classifierResult - 1]))
2.6 代码结果
图 5
三.总结
3.实验总结
3.1 实验中出现的错误
错误1:
FileNotFoundError: [Errno 2] No such file or directory: 'datingTestSet.txt'
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')
解决办法: datingTestSet.txt要与本py文件在同一个文件夹下
错误2:
ValueError: invalid literal for int() with base 10: 'largeDoses'
解决办法:此处应该要用datingTestSet.txt,因为datingTestSet2.txt的末尾列为数字,而datingTestSet.txt末尾列为字符串。
错误3:
AttributeError: 'dict' object has no attribute 'iteritems'
解决办法:python2.x版本有iteritems,而python3.x将它删除了,将其改为items即可
3.2 实验感悟
k最近邻(kNN)算法是一种简单而有效的机器学习算法,常用于分类和回归问题。基本思路就是计算测试数据与样本的距离,取得距离最近的前k个数据的标签类,将其中出现次数最多的标签类作为测试数据的预测结果。
- KNN算法优点:
1.简单易用,相比其他算法,KNN算是比较简洁明了的算法。即使没有很高的数学基础也能搞清楚它的原理。
2.模型训练时间快,上面说到KNN算法是惰性的,这里也就不再过多讲述。
预测效果好。
3.对异常值不敏感
- KNN算法缺点:
1.对内存要求较高,因为该算法存储了所有训练数据
预测阶段可能很慢
2. 对不相关的功能和数据规模敏感
本次实验中,我们通过对knn算法在约会网站数据上的应用进行了深入研究与实践。通过数据准备、分析和预处理等步骤,我们有效地解决了数据处理中的一些常见问题。在实验过程中,我深刻认识到数据质量和特征选择对算法性能的重要影响。