【Machine Learning in Action】Chap1|Classification|kNN

最新推荐文章于 2024-05-02 14:56:05 发布

tong_xin2010

最新推荐文章于 2024-05-02 14:56:05 发布

阅读量256

点赞数

分类专栏：机器学习文章标签：数据挖掘

本文链接：https://blog.csdn.net/tong_xin2010/article/details/78222820

版权

机器学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Comprehension of Listing 2.1

The inputs are:

inX is [0,0];

dataSet is:

labels is:

k is 3

Line 1:

we can use "type" function to notice that 'dataSet' is an numpy.ndarray. And numpy.ndarrap.shape[0] returns the number of rows of matrix, 4 in this example.

so dataSetSize is 4.

Line 2:

Tile is a function from numpy module. We use ‘tile’ to duplicate designated pattern. Here we duplicate inX in order to do matrix substraction.

After the process of this line, we substract inX from every row in dataSet, and the results is stored in diffMat:

Line 3:

‘**’ means ‘the power of’.2**4 stand for 2 to the 4^th power,for instance.

This line of code calculates the 2^nd power of every component in the matrix.

Line 4:

Numpy.ndarray.sum calculates the sum, as indicated by its name. It can also designate the axis along which to do the adding, ‘axis = 0’ means to do it along the vertical direction, ‘axis = 1’meas to do it along the horizontal direction.

So the results shows the sum of every row in the matrix:

Line 5:

Similar meaning, stands for the square root of every component of the distance.

Line 6:

Sorting process of Numpy.ndarray.

Which means that the distace at index 2 is the smallest, and that at index 3 is the second smallest, and so on. We can check this is true.

Line 7~10:the for loop

classCount is a dictionary.

The for loop do the statistics of labels of member of the k smallest distances.

Line 11:

First, let’s check out the prototype of ‘sorted’:

sorted 语法：

sorted(iterable[, cmp[, key[, reverse]]])

参数说明：

· iterable -- 可迭代对象。

· cmp -- 比较的函数，这个具有两个参数，参数的值都是从可迭代对象中取出，此函数必须遵守的规则为，大于则返回1，小于则返回-1，等于则返回0。

· key -- 主要是用来进行比较的元素，只有一个参数，具体的函数的参数就是取自于可迭代对象中，指定可迭代对象中的一个元素来进行排序。

· reverse -- 排序规则，reverse = True 降序， reverse = False 升序（默认）。

Then, we compare the usage in our example:

We can see the the first input parameter is “classCount.iteritems()”, which corresponds to the iterable objects in prototype;

The second input parameter is “key = operator.itemgetter(1)”, which can not correspond to ‘cmp’ in prototype, but it can be the ‘key’ in prototype. So it indicates that we leave out ‘cmp’, and designate ‘key’ directly.

The third input parameter ‘reverse=True’ means the sorting is in descending order.

Line 12:

We returned the label that has the largest count.

********************************************************************************

Comprehension of listing 2.2

Line 1:

Opens the file, whose position is designated by the input parameter.

Line 2:

Readlines会返回一个list，所有的内容存在这个list里面

Line 3:

Generate a matrix whose dimensions must be in accordance with the input file.

Line 4~5:

The reason why we should do ‘open’ again is that we have done ‘readlines’ once, and this operation will cause the file pointer point to the end of the file. Because we need to do ‘readlines’ again, so we should do the ‘open’ operations again.

Later part:

Every loop in for will gather one line from the file, and separate the information according to the format of the file already known to us.

We put the first 3 words into ‘returnMat’, and we put the last word(given by the index ‘-1’) into classLabelVector.

Caution！字符串无法用int()转换，所以书上代码会报错，应该将int()删除