Notes_on_MLIA_kNN

最新推荐文章于 2024-09-09 09:23:02 发布

Kylin-Xu

最新推荐文章于 2024-09-09 09:23:02 发布

阅读量751

点赞数

分类专栏： machine learning

machine learning 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

# k-nearest neighbor algorithm
# function classify0
# arguments: 
# 	inX: the new observation which is to be labeled by the algorithm
#	dataSet: train sample
#	labels: label for train sample
#	k: k in knn
def classify0(inX, dataSet, labels, k):
	dataSetSize = dataSet.shape[0]
	diffMat = tile(inX, (dataSetSize, 1)) - dataSet
	sqDiffMat = diffMat**2
	sqDistances = sqDiffMat.sum(axis=1)
	distances = sqDistances**0.5
	sortedDistIndicies = distances.argsort() 
	classCount = {}
	for i in range(k):
		voteIlabel = labels[sortedDistIndicies[i]]
		classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
	sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse=True)
	return sortedClassCount[0][0]

 
 .shape用于计算array各维度的长度，在python中都是从0开始的。
 
 tile函数是numpy包中的，用于重复array，比如上面代码中的tile(inX,(dataSetSize,1))，表示重复inX，其行重复dataSetSize次，而列不重复
 
 .sum是numpy中用于计算一个array内部行列求和，axis=1表示按列求和，即把每一行的元素加起来
 
 .argsort是numpy中对array进行排序的函数，排序是升序
 
 classCount = {} 其中{}表示生成的是字典，在字典这个类中，有方法get，对classCount元素赋值，其实是个计数器
 
 sorted是内置函数，可以help(sorted)查看用法
 
 operator模块下的itemgetter函数，顾名思义就是提取第X个元素的意思
 
 
这段代码里给出了字典排序的经典方法，还可以使用lambda函数，来进行字典的排序，具体python中的排序方法可以参考：https://wiki.python.org/moin/HowTo/Sorting/
 
 2.2 读入txt文件的函数里有一个小bug
 
 
def file2matrix(filename):
	fr = open(filename)
	arrayOLines = fr.readlines()
	numberOfLines = len(arrayOLines)
	returnMat = zeros((numberOfLines, 3))
	classLabelVector = []
	index = 0
	for line in arrayOLines:
		line = line.strip()
		listFromLine = line.split('\t')
		returnMat[index,:] = listFromLine[0:3]
		classLabelVector.append(int(listFromLine[-1]))
		index += 1
	return returnMat, classLabelVector

 
 这里用到了一个函数line.strip()，里面没有设置参数，会把'\t'也去掉，后面使用tab分割字符就会失效。要改成line.strip('/n')。而且丫循环那块就没写冒号。
 
 
还有一个bug，是生成label标签的时候，不能加int