先介绍一些算法中使用到的numpy函数和python的语法。
1、python中的元组、列表和numpy中的数组
python没有数组,但是有元组(tuple)和列表(list)。以下是元组和列表的初始化。元组不可改变,列表可以改变。
testTuple = ((1, 2, 3), (4, 5, 6), (7, 8, 9)) #initialized use ()
testList = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] #initialized use []
numpy包含数据结构array。提供了很多很有用的函数。比如shape,tile和sum等等。
1、shape
numpy中的函数,功能是读取矩阵的长度,可以直接调用shape(),并传入矩阵作为参数或者使用array.shape[]。其中array.shape[0]表示返回矩阵第一维的长度。
但是有个前提矩阵必须由array构造。
arraySample = array([[1, 2 ,3],
[5, 3, 7],
[4, 9, 6],
[5, 9, 0]])
print(arraySample.shape[0])
print(arraySample.shape[1])
打印结果是:
4
3
2、tile(arr, rep)
arr为源数组,rep描述了重复数组的方式。举例说明:
Array=[1, 2] ==> tile(Array, 2) == [1, 2, 1, 2]
==> tile(Array, (2, 3)) == [[1, 2, 1, 2, 1, 2], [1, 2, 1, 2, 1, 2]]
==> tile(Array, (2, 2, 3)) ==[ [[1, 2,1 ,2 ,1, 2], [1, 2, 1, 2, 1, 2]] [[1, 2,1 ,2 ,1, 2], [1, 2, 1, 2, 1, 2]] ]
tile(arr, rep)函数的操作中,rep中元组中各元素指定将arr复制到各个维度的次数。
tile(Array, 2)表示将Array在第一个维度中复制2次;
tile(Array, (2, 3))表示将Array在第一个维度中复制3次,在第二个维度中复制2次;
tile(Array, (2, 2, 3)) 表示将Array在第一个维度中复制3次,在第二个维度中复制2次,在第三个维度中复制2次。
3、array.sum(axis=k)
sum(axis=k)的用法有些类似于tile。当k=1或者=0的时候,计算的结果可以认为是列或行的和,但是当k>1的时候,还是用维度的和来区分更合适。
假定矩阵array的维度为5,那么array.sum(axis=4)表示第一维度的和,array.sum(axis=0)表示第五维度的和。举例说明:
dataTest = array([[[1, 2, 3],
[3, 4, 5]],
[[5, 6, 7],
[8, 9, 10]]])
print(dataTest.sum(axis = 0))
print(" ")
print(dataTest.sum(axis = 1))
print(" ")
print(dataTest.sum(axis = 2))
输出为:
[[ 6 8 10]
[11 13 15]]
[[ 4 6 8]
[13 15 17]]
[[ 6 12]
[18 27]]
最后的算法如下:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector
'''
dataSet = array([[ 4.09200000e+04, 8.32697600e+00, 9.53952000e-01]
[ 1.44880000e+04, 7.15346900e+00, 1.67390400e+00]
[ 2.60520000e+04, 1.44187100e+00, 8.05124000e-01]
[ 7.51360000e+04, 1.31473940e+01, 4.28964000e-01]
[ 3.83440000e+04, 1.66978800e+00, 1.34296000e-01]
[ 7.29930000e+04, 1.01417400e+01, 1.03295500e+00]
[ 3.59480000e+04, 6.83079200e+00, 1.21319200e+00]
[ 4.26660000e+04, 1.32763690e+01, 5.43880000e-01]
[ 6.74970000e+04, 8.63157700e+00, 7.49278000e-01]
[ 3.54830000e+04, 1.22731690e+01, 1.50805300e+00]])
inX = array([36661, 11.865402, 0.882810])
labels = [3, 2, 1, 1, 1, 1, 3, 3, 1, 3]
'''
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
# dataSetSize = 10
diffMat = tile(inX, (dataSetSize,1)) - dataSet
'''
tile(inX, (dataSetSize,1)) = [[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]
[ 3.66610000e+04 1.18654020e+01 8.82810000e-01]]
diffMat= [[ -4.25900000e+03 3.53842600e+00 -7.11420000e-02]
[ 2.21730000e+04 4.71193300e+00 -7.91094000e-01]
[ 1.06090000e+04 1.04235310e+01 7.76860000e-02]
[ -3.84750000e+04 -1.28199200e+00 4.53846000e-01]
[ -1.68300000e+03 1.01956140e+01 7.48514000e-01]
[ -3.63320000e+04 1.72366200e+00 -1.50145000e-01]
[ 7.13000000e+02 5.03461000e+00 -3.30382000e-01]
[ -6.00500000e+03 -1.41096700e+00 3.38930000e-01]
[ -3.08360000e+04 3.23382500e+00 1.33532000e-01]
[ 1.17800000e+03 -4.07767000e-01 -6.25243000e-01]]
'''
sqDiffMat = diffMat**2
'''
sqDiffMat = [[ 1.81390810e+07 1.25204586e+01 5.06118416e-03]
[ 4.91641929e+08 2.22023126e+01 6.25829717e-01]
[ 1.12550881e+08 1.08649999e+02 6.03511460e-03]
[ 1.48032562e+09 1.64350349e+00 2.05976192e-01]
[ 2.83248900e+06 1.03950545e+02 5.60273208e-01]
[ 1.32001422e+09 2.97101069e+00 2.25435210e-02]
[ 5.08369000e+05 2.53472979e+01 1.09152266e-01]
[ 3.60600250e+07 1.99082788e+00 1.14873545e-01]
[ 9.50858896e+08 1.04576241e+01 1.78307950e-02]
[ 1.38768400e+06 1.66273926e-01 3.90928809e-01]]
'''
sqDistances = sqDiffMat.sum(axis=1)
# sqDistances = [1.81390935e+07 4.91641952e+08 1.12550990e+08 1.48032563e+09 2.83259351e+06 1.32001423e+09 5.08394456e+05 3.60600271e+07 9.50858906e+08 1.38768456e+06]
distances = sqDistances**0.5
# distances = [4259.00147048 22173.00051477 10609.00512094 38475.00002403 1683.03104868 36332.0000412 713.01785142 6005.00017533 30836.00016986 1178.0002365]
sortedDistIndicies = distances.argsort()
print sortedDistIndicies
# sortedDistIndicies = [6 9 4 0 7 2 1 8 5 3] distances[6] < distances[9] < distances[4] < distances[0] < distances[7] < distances[2] < distances[1] < distances[8] < distances[5] < distances[3]
classCount={}
# {} indicates classCount is a dictionary
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
# dict.get(key, default=None) return the key value, if the key is not in the dict, return the default value
# get(voteIlabel,0) return the key value of voteIlabel ,if voteIlabel is not in classCount return 0
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
# reverse=True means sorted in descending order, reverse=False means in ascending order
# key=operator.itemgetter(1) means sorted the second item of classCount
return sortedClassCount[0][0]