机器学习笔记1-k近邻算法的实现

k_近邻算法:采用测量不同特征值之间的距离方法进行分类.
优点:精度高,对异常值不明感,无数据输入假定
缺点:计算复杂度高,空间复杂度高
适用数据范围:数值型和标称型
步骤如下:
1.计算一直类别数据集中的点御当前点之间的距离
2.按照距离的递增次序排序
3.选取当前的点距离最小的k个点
4.确定前k个点所在类别的出现频率
5.返回前k个点出现频率最高的类别作为当前点的预测分类
计算二维坐标系中A,B点距离公式:[(xA0-xB0)^2+(xA1-xB1)^2]^(1/2)
设训练样本集为[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]],标签向量为['A','A','B','B'],当输入一个向量时,判断属于哪一类
import numpy
import operator




def createDataSet():
#训练集
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
#标签向量
labels = ['A','A','B','B']
return group,labels
def classify0(inX,dataSet,labels,k):
#读取矩阵第一维度的长度
dataSetSize = dataSet.shape[0]
#输入向量与训练集差值的数组
diffMat = numpy.tile(inX,(dataSetSize,1)) - dataSet
#计算各点与训练集的距离
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distance = sqDistances**0.5
#将距离数组的下标按照距离大小排序
sortedDistIndicies = distance.argsort()
classCount = {}
#在k的范围内,分别计算两类的数目
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
#以k以内类别数目排序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),
 reverse = True)
#返回数目最多的类(即输入向量应该属于的类)
return sortedClassCount[0][0]
def test():
print(classify0([0.3,0.5],group,labels,2))
if __name__ == '__main__':
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
test()
其中几个不熟悉的函数:
shape:读取矩阵的长度,比如shape[0]:就是读取矩阵的一维长度
tile:形如tile(x,y)就是重复x,y次,例如;
>>> numpy.tile([1,1],10)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> numpy.tile([2,1],[2,3])
array([[2, 1, 2, 1, 2, 1],
  [2, 1, 2, 1, 2, 1]])
sum(axis=1):在某一维度上相加:
>>> a = numpy.array([1,2])
>>> a.sum()
3
>>> a.sum(axis=1)
Traceback (most recent call last):
 File "<pyshell#27>", line 1, in <module>
a.sum(axis=1)
 File "F:\python3\lib\site-packages\numpy\core\_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
ValueError: 'axis' entry is out of bounds
>>> numpy.array([[1,2,4],[2,4,5]]).sum(axis=1)
array([ 7, 11])
argsort():将数组的值的下标按值的由大到小的顺序排序
>>> a = numpy.array([8,6,7,9,10,5,7])
>>> a.argsort()
array([5, 1, 2, 6, 0, 3, 4], dtype=int32)
items():字典的值以列表的形式返回
itemgetter():用于返回对象那些维的数据
sorted():函数sorted(iterable[, cmp[, key[, reverse]]]),用于给列表排序,返回一个新的列表
iterable -- 可迭代对象。
cmp -- 比较的函数,这个具有两个参数,参数的值都是从可迭代对象中取出,此函数必须遵守的规则为,大于则返回1,小于则返回-1,等于则返回0。
key -- 主要是用来进行比较的元素,只有一个参数,具体的函数的参数就是取自于可迭代对象中,指定可迭代对象中的一个元素来进行排序。
reverse -- 排序规则,reverse = True 降序 , reverse = False 升序(默认)。
实例,根据玩视频游戏所耗时间百分比,每年获得的飞行常客里程数,每周的冰淇淋公升数来判断魅力值
1.解析数据,书中给出的数据存在一个问题,就是标签向量为一个字符串,需要将其转化成整形形式
    def getValueOfClassLabel(ClassLabel):
        val = 1;
        if not ClassLabel in ValueOfClassLabel.keys():
            ValueOfClassLabel[ClassLabel] = Value.pop()
        return ValueOfClassLabel[ClassLabel]
解析文件的完整代码:
def file2matrix(filename):
'''
用于解析训练集文件
'''
ValueOfClassLabel = {}
Value = [1,2,3]
def getValueOfClassLabel(ClassLabel):
val = 1;
if not ClassLabel in ValueOfClassLabel.keys():
ValueOfClassLabel[ClassLabel] = Value.pop()
return ValueOfClassLabel[ClassLabel]

file = open(filename)
arrayOLines = file.readlines()
#文件的行数
numberOfLines = len(arrayOLines)
#返回创建的训练集
returnMat = numpy.zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFormLine = line.split('\t')
returnMat[index,:] = listFormLine[0:3]
classLabelVector.append(getValueOfClassLabel(str(listFormLine[-1])))
index +=1
return returnMat,classLabelVector
通过上述程序可以将文件内容格式化成我们需要的训练集,标签向量,通过画图来直观的判断他们之间的关系
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt


fig = plt.figure()
ax = fig.add_subplot(111)
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],
15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
plt.xlabel('Percentage of Time Spent Playing Video Games')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()
3D图:
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax.scatter(datingDataMat[:,0],datingDataMat[:,1],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax.set_xlabel('mei nian huo qu de fei xing chang ke li cheng shu')
ax.set_ylabel('wan you xi shi jian bi li')
ax.set_zlabel('mei zhou xiao hao de bing qi li shuliang')
plt.show()
多图:
import numpy
import kNN
import matplotlib
import matplotlib.pyplot as plt


fig = plt.figure()
ax1 = fig.add_subplot(311)
datingDataMat,datingLabels = kNN.file2matrix('f:\\datingTestSet.txt')
ax1.scatter(datingDataMat[:,0],datingDataMat[:,1],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax1.set_xlabel('fly')
ax2 = fig.add_subplot(312)
ax2.scatter(datingDataMat[:,0],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
ax2 = fig.add_subplot(313)
ax2.scatter(datingDataMat[:,1],datingDataMat[:,2],
  15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
plt.show()
不熟悉的函数:
add_subplot:用于指定图像的位置,例如111,指图像分成一行一列,在第一幅图上画
scatter:画散点图,必须输入的有x,y坐标,可选项有颜色形状等
zero:创建0矩阵
归一化:
处理不同取值范围的特征值时,通常需要将数值未硬化,如果将取值范围处理为0到1或者-1到1之间,下面公式可以将任意取值范围的特征值转化为0到1的区间内
newValue = (oldValue-min)/(max-min)
min,max分别是数据集中特征值最大值和最小值,程序如下
def autoNum(dataSet):
#获取每一列的最小值
minVals = dataSet.min(0)
#获取每一列的最大值
maxVals = dataSet.max(0)
#最大值和最小值的差
ranges = maxVals - minVals


#将每一行归一化
normDataSet = numpy.zeros(numpy.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - numpy.tile(minVals,(m,1))
normDataSet = normDataSet/numpy.tile(ranges,(m,1))
return normDataSet,ranges,minVals
容易搞错的是min(0)返回的是每一列的最小值,而不是第0列的最小值,min()返回的是所有值的最小值,min(1)返回的是每一行的最小值
测试程序:
def datingClassTest():
'''
用于测试分类器
'''
hoRatio = 0.10
datingDataMating,datingLabels = file2matrix('f:\\datingTestSet.txt')
normMat,ranges,minVals = autoNum(datingDataMating)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],
datingLabels[numTestVecs:m],3)
print('the classifier came back with: %d,the real answer is:%d'%
 (classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print('the total error rate is:%f'%(errorCount/float(numTestVecs)))
测试结果:
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:3
the classifier came back with: 1,the real answer is:1
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:3
the classifier came back with: 3,the real answer is:3
the classifier came back with: 2,the real answer is:2
the classifier came back with: 1,the real answer is:1
the classifier came back with: 3,the real answer is:1
the total error rate is:0.050000
完整的程序(python3可运行):
import numpy
import operator




def createDataSet():
'''
返回一个训练集和标签向量
'''
#训练集
group = numpy.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
#标签向量
labels = ['A','A','B','B']
return group,labels
def classify0(inX,dataSet,labels,k):
'''
用于实现k_近邻算法,接收输入一个向量,一个训练集,一个标签向量,一个K值
判断向量所属的类别
'''
#读取矩阵第一维度的长度
dataSetSize = dataSet.shape[0]
#输入向量与训练集差值的数组
diffMat = numpy.tile(inX,(dataSetSize,1)) - dataSet
#计算各点与训练集的距离
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distance = sqDistances**0.5
#将距离数组的下标按照距离大小排序
sortedDistIndicies = distance.argsort()
classCount = {}
#在k的范围内,分别计算两类的数目
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
#以k以内类别数目排序
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),
 reverse = True)
#返回数目最多的类(即输入向量应该属于的类)
return sortedClassCount[0][0]


def file2matrix(filename):
'''
用于解析训练集文件
'''
ValueOfClassLabel = {}
Value = [1,2,3]
def getValueOfClassLabel(ClassLabel):
val = 1;
if not ClassLabel in ValueOfClassLabel.keys():
ValueOfClassLabel[ClassLabel] = Value.pop()
return ValueOfClassLabel[ClassLabel]

file = open(filename)
arrayOLines = file.readlines()
#文件的行数
numberOfLines = len(arrayOLines)
#返回创建的训练集
returnMat = numpy.zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFormLine = line.split('\t')
returnMat[index,:] = listFormLine[0:3]
classLabelVector.append(getValueOfClassLabel(str(listFormLine[-1])))
index +=1
return returnMat,classLabelVector
def autoNum(dataSet):
'''
用于将数据归一化
'''
#获取每一列的最小值
minVals = dataSet.min(0)
#获取每一列的最大值
maxVals = dataSet.max(0)
#最大值和最小值的差
ranges = maxVals - minVals


#将每一行归一化
normDataSet = numpy.zeros(numpy.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - numpy.tile(minVals,(m,1))
normDataSet = normDataSet/numpy.tile(ranges,(m,1))
return normDataSet,ranges,minVals
def datingClassTest():
'''
用于测试分类器
'''
hoRatio = 0.10
datingDataMating,datingLabels = file2matrix('f:\\datingTestSet.txt')
normMat,ranges,minVals = autoNum(datingDataMating)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],
datingLabels[numTestVecs:m],3)
print('the classifier came back with: %d,the real answer is:%d'%
 (classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print('the total error rate is:%f'%(errorCount/float(numTestVecs)))
def test():
  datingClassTest()
if __name__ == '__main__':


test()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值