案例中有三个特征值:飞行里程数, 玩视频游戏的时间百分比以及每周消费冰淇淋的公升数, 但由于飞行里程数远大于其它两个特征值, 对计算结果产生严重的影响, 而我们认为这三个特征是同等重要的.
解决这个问题可以采用数值归一化的方法, 将取值范围处理为0到1或者-1到1之间. 我们用newValue = (oldValue - min)/(max - min) 这个公式将任意取值范围的特征值转化为0到1区间内的值.
在kNN.py中加入autoNorm()函数
def autoNorm(dataSet):
# minima along the first axis
minVals = dataSet.min(0)
# maxima along the first axis
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1))
return normDataSet, ranges, minVals
执行代码
>>> import kNN
>>> datingDataMat, datingLabels = kNN.file2matrix('datingTestSet.txt')
>>> normMat, range, minVals = kNN.autoNorm(datingDataMat)
>>> normMat
array([[0.44832535, 0.39805139, 0.56233353],
[0.15873259, 0.34195467, 0.98724416],
[0.28542943, 0.06892523, 0.47449629],
...,
[0.29115949, 0.50910294, 0.51079493],
[0.52711097, 0.43665451, 0.4290048 ],
[0.47940793, 0.3768091 , 0.78571804]])
>>> range
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
>>> minVals
array([0. , 0. , 0.001156])