机器学习入门(一)

最新推荐文章于 2022-04-11 22:49:43 发布

一只小菜皮卡丘

最新推荐文章于 2022-04-11 22:49:43 发布

阅读量479

点赞数

分类专栏：机器学习之路文章标签：机器学习入门 kNN 机器学习实战

本文链接：https://blog.csdn.net/weixin_40836993/article/details/95460728

版权

机器学习之路专栏收录该内容

4 篇文章 0 订阅

订阅专栏

学习《机器学习实战》(kNN)

暑假要看完这本书，边看边总结, 第二章k近邻算法, 看不懂的地方都注释了

kNN通过计算当前数据特征值与数据集中其他所有数据之间的距离，来就近分类，假设有特征(1.0, 0.5, 1.5)，数据集中有(0.5, 0.0, 1.0),这两条数据之间的距离就是 $\sqrt a$ , a = $1.0-0.5)^2$ + $0.5-0.0)^2$ + $1.5-1.0)^2$ ，选出最近的k条数据，统计类别的概率，最大概率的即为预测的类别。

1、约会网站

# 导包	
import numpy as np
import operator # 给字典排序
import matplotlib.pyplot as plt

# 从文本文件解析数据 返回类型为 ndarray, []
def file2matrix(filename):
    fr = open(filename)
    array0Lines = fr.readlines()
    numberOfLines = len(array0Lines)
    returnMat = np.zeros((numberOfLines, 3)) # 得到numberOflines行3列的零矩阵
    classLabelVector = []
    index = 0
    for line in array0Lines:
        line = line.strip() # 去掉尾端回车符
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3] # 每一行的数据添加进去
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

调用一下

调用函数

# 分类函数 四个参数 inX:用于分类的输入向量 dataSet:训练样本集 labels:标签向量 k:选择最近邻居的数目 返回类型:int(1或2或3)
def classify(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0] # 训练集的行数
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet # np.tile瓷砖 纵向复制
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1) # 按行操作 
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort() # 返回数组值从小到大的索引值
    # help(np.argsort)
    print(type(sortedDistIndicies))
    print(sortedDistIndicies)
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    sortedClassCount = sorted(classCount.items(),
                             key=operator.itemgetter(1), reverse=True) # 降序排列
    print(type(sortedClassCount))
    print(sortedClassCount)
    return sortedClassCount[0][0]

再调用一下

调用函数

因为有些特征数据比较大，有些比较小，比如(1.0, 10000, 0.5)和(0.8, 9000, 0.4)，根据上文讲的公式，求距离时基本只取决于中间的数，但是这三个特征的影响应当是差不多的，所以想让中间的值也在(0, 1)之间, 这个过程叫做归一化处理。

# 归一化特征值(将数字特征值转化为0到1的区间)
def autoNorm(dataSet):
    minVals = dataSet.min(0) # 每列最小值
    maxVals = dataSet.max(0) # 每列最大值
    print(minVals)
    print(maxVals)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet/np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

就是让每个值value, value = value / (maxValue - minValue)

还是调用一下

还是调用一下基本就完成了，接着要测试一下，假定数据集中的十分之一为测试集，因为数据本身就是随机的，所以就前百分之十即可。

# 分类器测试
def datingClassTest():
    hoRatio = 0.10 # 测试集比例
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio) # 边界索引
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify(normMat[i,:], normMat[numTestVecs:m, :],
                                   datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is: %d"
             % (classifierResult, datingLabels[i]))
        if(classifierResult != datingLabels[i]): errorCount += 1.0
    print("the total error rate is: %f"% (errorCount/float(numTestVecs)))

我觉得这个函数使用的很巧妙，学到了，接着调用一下

错误率5%，书上是2%，可能是数据的差异吧
接下来就可以使用算法了

# 使用算法
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year？"))
    datingDataMat, datingLabels = file2matrix("datingTestSet2.txt")
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles, percentTats, iceCream])
    classifierResult = classify(inArr, normMat, datingLabels, 3)
    print("You will probably like this person: ", resultList[classifierResult - 1])

用例
函数调用

2、手写识别系统

这是一个案例，运用kNN来分类一下图片,用到的分类函数还是上面案例的classify

每个文件存储一个32*32的图片，把每个像素点当成一个特征，那么就有1024个特征，也就是一行数据的内容,代码都挺好理解的,直接贴上来。

import numpy as np
from os import listdir # 需要读取一个文件夹下的所有文件
import operator

# 读取一个文件的数据
def img2vector(filename):
	returnVect = np.zeros((1, 1024)) # 初始化数据行
	fr = open(filename)
	for i in range(32):
		lineStr = fr.readline()
		for j in range(32):
			returnVect[0, 32*i+j] = lineStr[j]
	return returnVect

# 测试
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')
    m = len(trainingFileList)
    trainingMat = np.zeros((m, 1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('digits/trainingDigits/%s'% fileNameStr)
    testFileList = listdir('digits/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = fileStr.split('_')[0]
        vectorUnderTest = img2vector('digits/testDigits/%s'% fileNameStr)
        classifierResult = classify(vectorUnderTest, trainingMat, hwLabels, 3)
        print('the classifier came back with: %d, the real answer is: %s'
             % (classifierResult, classNumStr))
        if(classifierResult != int(classNumStr)): errorCount += 1.0
    print("\nthe total number of errors is: %d"% errorCount)
    print("\nthe total error rate is: %f"% (errorCount/float(mTest)))

运行
在这里插入图片描述

第二章完结

一只小菜皮卡丘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习入门(一)

学习《机器学习实战》(kNN)暑假要看完这本书，边看边总结, 第二章k近邻算法, 看不懂的地方都注释了kNN通过计算当前数据特征值与数据集中其他所有数据之间的距离，来就近分类，假设有特征(1.0, 0.5, 1.5)，数据集中有(0.5, 0.0, 1.0),这两条数据之间的距离就是a\sqrt aa, a = (1.0−0.5)2(1.0-0.5)^2(1.0−0.5)2 + (0.5...
复制链接

扫一扫