"Machine Learning in Action" Chapter 2

最新推荐文章于 2021-09-27 23:00:09 发布

Haibaral

最新推荐文章于 2021-09-27 23:00:09 发布

阅读量608

点赞数

分类专栏： Machine Learning in Action 文章标签：机器学习

Machine Learning in Action 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

This chapter covers

• The k-Nearest Neighbors classification algorithm

• Parsing and importing data from a text file

• Creating scatter plots with Matplotlib

• Normalizing numeric values

1. The k-Nearest Neighbors classification algorithm

1). Brief Description

给定训练数据样本和标签，对于某测试的一个样本数据，选择距离其最近的k个训练样本，这k个训练样本中所属类别最多的类即为该测试样本的预测标签。

2). General approach to kNN

Collect: Any method.
Prepare: Numeric values are needed for a distance calculation. A structured dataformat is best.
Analyze: Any method.
Train: Does not apply to the kNN algorithm.
Test: Calculate the error rate.
Use: This application needs to get some input data and output structured num-eric values. Next, the application runs the kNN algorithm on this input data anddetermines which class the input data should belong to. The application thentakes some action on the calculated class.

3). 代码实现

(1) Create a DataSet

'''
Created on Sep 16, 2010
kNN: k Nearest Neighbors

Input:      inX: vector to compare to existing dataset (1xN)
            dataSet: size m data set of known vectors (NxM)
            labels: data set labels (1xM vector)
            k: number of neighbors to use for comparison (should be an odd number)

Output:     the most popular class label

@author: pbharrin
'''

from numpy import *
from os import listdir
import operator


def createDataSet():  # 创建DataSet的一个范例
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

In this code, we import two modules. The first one is NumPy, which is our scientific computing package. The second module is the operator module, which is used later in the kNN algorithm for sorting;

(2) k-Nearest Neighbors algorithm (Classifier)

For every point in our dataset:
calculate the distance between inX and the current point
sort the distances in increasing order
take k items with lowest distances to inX
find the majority class among these items
return the majority class as our prediction for the class of inX

k-Nearest Neighbors algorithm

def classify0(inX, dataSet, labels, k):  # kNN classifier 简单K近邻算法
    dataSetSize = dataSet.shape[0]  # get 矩阵行数
    # tile函数：inX的第一维度重复dataSetSize遍，第二维度重复1遍
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2  # squired different matrix
    sqDistances = sqDiffMat.sum(axis=1)  # sum函数：axis = 1 将矩阵每一行向量相加
    '''
    c = np.array([[0, 2, 1], [3, 5, 6], [0, 1, 1]])
    print c.sum()
    print c.sum(axis=0)
    print c.sum(axis=1)
    结果分别是：19, [3 8 8], [ 3 14  2]
    axis=0, 表示列。
    axis=1, 表示行。
    '''
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()  # argsort函数：返回的是数组值从小到大的indice
    classCount = {}
    for i in range(k):  # 选择距离最小的k个点
        voteIlabel = labels[sortedDistIndicies[i]]
        # get函数：在classCount中寻找voterIleabel的值, 如果不存在则返回0
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    sortedClassCount = sorted(classCount.items(),
                              key=operator.itemgetter(1),
                              reverse=True)
    '''
    排序函数sorted可以对list或者iterator进行排序
    iteritems()：迭代输出dict的键值对，返回的是迭代器
    key为函数，指定取待排序元素的哪一项进行排序. operator.itemgetter(1)为根据第二个域进行排序
    reverse = true 降序排列
    '''
    return sortedClassCount[0][0]  # return the label of the item occuring most frequently

The function classify0() takes four inputs: the input vector to classify called inX, our full matrix of training examples called dataSet, a vector of labels called labels, and, finally, k, the number of nearest neighbors to use in the voting. The labels vector should have as many elements in it as there are rows in the dataSet matrix. You calculate the distances using the Euclidian distance. Following the distance calculation, the distances are sorted from least to greatest (this is the default). Next, C the first k or lowest k distances are used to vote on the class of inX. The input k should always be a positive integer. Lastly, you take the classCount dictionary and decompose it into a list of tuples and then sort the tuples by the second item in the tuple using the itemgetter method from the operator module imported in the second line of the program. This sort is done in reverse so you have largest to smallest. Finally, you can return the label of the item occurring the most frequently.

2. Parsing and importing data from a text file

Before we can use this data in our classifier, we need to change it to the format that our classifier accepts. In order to do this, we’ll add a new function to kNN.py called file2matrix. This function takes a filename string and outputs two things: a matrix of training examples and a vector of class labels.

Text record to NumPy parsing code

def file2matrix(filename):  # parse line to list
    fr = open(filename)
    numberOfLines = len(fr.readlines())  # get number of lines in file
    # zeros函数：create Numpy matrix numberOfLines rows and 3 columns to return
    returnMat = zeros((numberOfLines, 3))
    classLabelVector = []  # to create an empty vector; prepare labels return
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()  # strip函数：移除字符串头尾的指定字符（默认为空格）
        # 对于每一行，按照制表符切割字符串，得到的结果构成一个数组，数组的每个元素代表一行中的一列;
        # split the line into a list of elements delimited by the teb character'\t'
        listFromLine = line.split('\t')
        # parse line to a list;
        # take the first three elements and shove them into a row of your
        # matrix
        returnMat[index, :] = listFromLine[0:3]
        # get the last item from the list to put into classLabelVector
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

This code is a great place to demonstrate how easy it is to process text with Python. Initially, you’d like to know how many lines are in the file. It reads in the file and counts the number of lines. Next, you create a NumPy matrix (actually, it’s a 2D array, but don’t worry about that now) to populate and return. I’ve hard-coded in the size of this to be numberOfLines x 3, but you could add some code to make this adapt- able to the various inputs. Finally, you loop over all the lines in the file and strip off the return line character with line.strip(). Next, you split the line into a list of elements delimited by the tab character: '\t'. You take the first three elements and shove them into a row of your matrix, and you use the Python feature of negative indexing to get the last item from the list to put into classLabelVector.

The following code is a small function called img2vector, which converts the image to a vector. The function creates a 1x1024 NumPy array, then opens the given file, loops over the first 32 lines in the file, and stores the integer value of the first 32 characters on each line in the NumPy array. This array is finally returned.

def img2vector(filename):
    returnVect = zeros((1, 1024))  # 1 row, 1024 columns
    fr = open(filename)
    for i in range(32):  # loop for row
        lineStr = fr.readline()  # 读取一行数字 数据类型？
        for j in range(32):  # loop for column
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

3. Creating scatter plots with Matplotlib

Let’s look at the data in further detail by making some scatter plots of the data from Matplotlib. This isn’t hard to do. From the Python console, type the following:

>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
>>> plt.show()

You should see something like figure 2.3.

It’s hard to see any patterns in this data, but we have additional data we haven’t used yet—the class values. If we can plot these in color or use some other markers, we can get a better understanding of the data. The Matplotlib scatter function has addi- tional inputs we can use to customize the markers. Type the previous code again, but this time use the following for a scatter function:

>>> from numpy import *
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2],
15.0*array(datingLabels), 15.0*array(datingLabels))

4. Normalizing numeric values

When dealing with values that lie in different ranges, it’s common to normalize them. Common ranges to normalize them to are 0 to 1 or -1 to 1. To scale everything from 0 to 1, you need to apply the following formula:

newValue = (oldValue-min)/(max-min)

Data-normalizing code

def autoNorm(dataSet):  # Normalizing numeric values
    # The 0 in dataSet.min(0) allows you to take the minimums from the
    # columns, not the rows
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals  # max - min
    normDataSet = zeros(shape(dataSet))  # shape(dataSet)获取矩阵的两个维度
    m = dataSet.shape[0]  # 行数
    normDataSet = dataSet - tile(minVals, (m, 1))  # oldValue - min
    # (oldValue - min)/(max - min) # element wise divide
    normDataSet = normDataSet/tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

In the autoNorm() function, you get the minimum values of each column and place this in minVals; similarly, you get the maximum values. The 0 in dataSet.min(0) allows you to take the minimums from the columns, not the rows. Next, you calculate the range of possible values seen in our data and then create a new matrix to return. To get the normalized values, you subtract the minimum values and then divide by the range. The problem with this is that our matrix is 1000x3, while the minVals and ranges are 1x3. To overcome this, you use the NumPy tile() function to create a matrix the same size as our input matrix and then fill it up with many copies, or tiles. Note that it is element-wise division. In other numeric software packages, the / operator can be used for matrix division, but in NumPy you need to use linalg.solve(matA,matB) for matrix division.

5. Testing the classifier as a whole program

One way you can use the existing data is to take some portion, say 90%, to train the classifier. Then you’ll take the remaining 10% to test the classifier and see how accurate it is.

Classifier testing code for dating site

def datingClassTest():  # classifier testing code for dating site
    hoRatio = 0.10
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]  # 得到total number of data points
    numTestVecs = int(m*hoRatio)  # 用于测试的data number
    errorCount = 0.0
    for i in range(numTestVecs):  # 将每个测试数据的结果与真实值相比较
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
                                     datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is: %d" %
              (classifierResult, datingLabels[i]))
        if(classifierResult != datingLabels[i]):
            errorCount += 1.0
    print("the total error rate is: %f" % (errorCount/float(numTestVecs)))

This uses file2matrix and autoNorm() from earlier to get the data into a form you can use. Next, the number of test vectors is calculated, and this is used to decide which vectors from normMat will be used for testing and which for training. The two parts are then fed into our original kNN classifier, classify0. Finally, the error rate is calculated and displayed. Note that you’re using the original classifier; you spent most of this section manipulating the data so that you could apply it to a simple classifier. Getting solid data is important and will be the subject of chapter 20.

6. Example 1: Improving matches from a dating site with kNN

Collect: Text file provided.
Prepare: Parse a text file in Python.
Analyze: Use Matplotlib to make 2D plots of our data.
Train: Doesn’t apply to the kNN algorithm.
Test: Write a function to use some portion of the data Hellen gave us as test ex- amples. The test examples are classified against the non-test examples. If the predicted class doesn’t match the real class, we’ll count that as an error.
Use: Build a simple command-line program Hellen can use to predict whether she’ll like someone based on a few inputs.

1). Prepare: parsing data from a text file: See above for codes

>>> reload(kNN)
>>> datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')
>>> datingDataMat
array([[  7.29170000e+04,   7.10627300e+00,   2.23600000e-01],
       [  1.42830000e+04,   2.44186700e+00,   1.90838000e-01],
       [  7.34750000e+04,   8.31018900e+00,   8.52795000e-01],
       ...,
       [  1.24290000e+04,   4.43233100e+00,   9.24649000e-01],
       [  2.52880000e+04,   1.31899030e+01,   1.05013800e+00],
       [  4.91800000e+03,   3.01112400e+00,   1.90663000e-01]])
>>> datingLabels[0:20]
['didntLike', 'smallDoses', 'didntLike', 'largeDoses', 'smallDoses',
'smallDoses', 'didntLike', 'smallDoses', 'didntLike', 'didntLike', 'largeDoses', 'largeDose s', 'largeDoses', 'didntLike', 'didntLike', 'smallDoses', 'smallDoses', 'didntLike', 'smallDoses', 'didntLike']

2). Analyze: creating scatter plots with Matplotlib: See above

3).Prepare: normalizing numeric values: See above for codes

>>> reload(kNN)
>>> normMat, ranges, minVals = kNN.autoNorm(datingDataMat)
>>> normMat
array([[ 0.33060119,  0.58918886,  0.69043973],
     [ 0.49199139,  0.50262471,  0.13468257],
       [ 0.34858782,  0.68886842,  0.59540619],
       ...,
       [ 0.93077422,  0.52696233,  0.58885466],
       [ 0.76626481,  0.44109859,  0.88192528],
       [ 0.0975718 ,  0.02096883,  0.02443895]])
>>> ranges
array([  8.78430000e+04,   2.02823930e+01,   1.69197100e+00])
>>> minVals
array([ 0.      ,  0.      ,  0.001818])

4). Test: testing the classifier as a whole program: See above for codes

>>> kNN.datingClassTest()
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
.
.
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the total error rate is: 0.024000

5).Use: putting together a useful system

Dating site predictor function

def classifyPerson():
    resultList = ['not at all', 'in small dosed', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0(
        (inArr-minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person: ",
          resultList[classifierResult - 1])

>>> kNN.classifyPerson()
        percentage of time spent playing video games?10
        frequent flier miles earned per year?10000
        liters of ice cream consumed per year?0.5
        You will probably like this person:  in small doses

7. Example: a handwriting recognition system

Collect: Text file provided.
Prepare: Write a function to convert from the image format to the list format used in our classifier, classify0().
Analyze: We’ll look at the prepared data in the Python shell to make sure it’s correct.
Train: Doesn’t apply to the kNN algorithm.
Test: Write a function to use some portion of the data as test examples. The test examples are classified against the non-test examples. If the predicted class doesn’t match the real class, you’ll count that as an error.
Use: Not performed in this example. You could build a complete program to extract digits from an image, such a system used to sort the mail in the United States.

1).Prepare: converting images into test vectors

def img2vector(filename):
    returnVect = zeros((1, 1024))  # 1 row, 1024 columns
    fr = open(filename)
    for i in range(32):  # loop for row
        lineStr = fr.readline()  # 读取一行数字 数据类型？
        for j in range(32):  # loop for column
            returnVect[0, 32*i+j] = int(lineStr[j])
    return returnVect

>>> testVector = kNN.img2vector('testDigits/0_13.txt')
>>> testVector[0,0:31]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])
>>> testVector[0,32:63]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])

2). Test: kNN on handwritten digits

Make sure to add from os import listdir to the top of the file. This imports one function, listdir, from the os module, so that you can see the names of files in a given directory.

Handwritten digits testing code

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')  # return list containing the names of the directory 
    m = len(trainingFileList)  # 训练集的文件数目  
    trainingMat = zeros((m, 1024))  # 创建training matrix, with m行, 1024列(32*32)
    for i in range(m):  # 将m个训练文件的class number及contents存入label(hwLabels) and training matrix
        fileNameStr = trainingFileList[i]  # 第i个文件夹全名
        fileStr = fileNameStr.split('.')[0]  # 文件名
        classNumStr = int(fileStr.split('_')[0])  # class number
        hwLabels.append(classNumStr)  # 将class num存入hwlabels
        # 将第i个文件的内容存入trainingMat(training matrix)的第i行
        trainingMat[i, :] = img2vector('digits/trainingDigits/%s' % fileNameStr)
    testFileList = listdir('digits/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        # Next, you do something similar for all the files in the testDigits directory
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        # but instead of loading them into a matrix, you test each vector individually with classify0
        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d" %
              (classifierResult, classNumStr))
        if(classifierResult != classNumStr): errorCount += 1.0
    print("\nthe total number of errors is: %d" % errorCount)
    print("\nthetotal error rate is: %f" % (errorCount/float(mTest)))

you get the contents for the trainingDigits directory as a list. Then you see how many files are in that directory and call this m. Next, you create a training matrix with m rows and 1024 columns to hold each image as a single row. You parse out the class number from the filename. The filename is something like 9_45.txt, where 9 is the class number and it is the 45th instance of the digit 9. You then put this class number in the hwLabels vector and load the image with the function img2vector discussed previously. Next, you do something similar for all the files in the testDigits directory, but instead of loading them into a big matrix, you test each vector individually with our classify0 function. You didn’t use the autoNorm() function from section 2.2 because all of the values were already between 0 and 1.

>>> kNN.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
.
.
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 6, the real answer is: 8
.
.
the classifier came back with: 9, the real answer is: 9
the total number of errors is: 11
the total error rate is: 0.011628

8. Summary

The k-Nearest Neighbors algorithm is a simple and effective way to classify data. The examples in this chapter should be evidence of how powerful a classifier it is. kNN is an example of instance-based learning, where you need to have instances of data close at hand to perform the machine learning algorithm. The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage. In addition, you need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
An additional drawback is that kNN doesn’t give you any idea of the underlying structure of the data; you have no idea what an “average” or “exemplar” instance from each class looks like. In the next chapter, we’ll address this issue by exploring ways in which probability measurements can help you do classification.

9. Python Notes

tile (A, reps)

Construct an array by repeating A the number of times given by reps.

>>> import numpy
>>> numpy.tile([0,0],5)#在列方向上重复[0,0]5次，默认行1次
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> numpy.tile([0,0],(1,1))#在列方向上重复[0,0]1次，行1次
array([[0, 0]])
>>> numpy.tile([0,0],(2,1))#在列方向上重复[0,0]1次，行2次
array([[0, 0],
       [0, 0]])
>>> numpy.tile([0,0],(3,1))
array([[0, 0],
       [0, 0],
       [0, 0]])
>>> numpy.tile([0,0],(1,3))#在列方向上重复[0,0]3次，行1次
array([[0, 0, 0, 0, 0, 0]])
>>> numpy.tile([0,0],(2,3))<span style="font-family: Arial, Helvetica, sans-serif;">#在列方向上重复[0,0]3次，行2次</span>
array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])