【机器学习实战】决策树算法：预测隐形眼镜类型

置顶爱写BUG的老冉

已于 2022-12-07 17:32:35 修改

阅读量5.2k

点赞数 9

分类专栏：机器学习文章标签：决策树信息熵

于 2021-04-15 11:47:54 首次发布

本文链接：https://blog.csdn.net/RangeLZ/article/details/115720269

版权

机器学习专栏收录该内容

3 篇文章 1 订阅

订阅专栏

【机器学习实战】决策树算法：预测隐形眼镜类型

0.收集数据

这里采用的数据集是《机器学习实战》提供的 lenses.txt 文件，该文件内容如下：

young	myope	no	reduced	no lenses
young	myope	no	normal	soft
young	myope	yes	reduced	no lenses
young	myope	yes	normal	hard
young	hyper	no	reduced	no lenses
young	hyper	no	normal	soft
young	hyper	yes	reduced	no lenses
young	hyper	yes	normal	hard
pre	myope	no	reduced	no lenses
pre	myope	no	normal	soft
pre	myope	yes	reduced	no lenses
pre	myope	yes	normal	hard
pre	hyper	no	reduced	no lenses
pre	hyper	no	normal	soft
pre	hyper	yes	reduced	no lenses
pre	hyper	yes	normal	no lenses
presbyopic	myope	no	reduced	no lenses
presbyopic	myope	no	normal	no lenses
presbyopic	myope	yes	reduced	no lenses
presbyopic	myope	yes	normal	hard
presbyopic	hyper	no	reduced	no lenses
presbyopic	hyper	no	normal	soft
presbyopic	hyper	yes	reduced	no lenses
presbyopic	hyper	yes	normal	no lenses

每列数据类型分别是 $a g e 、 p r e s c r i p t 、 a s t i g m a t i c 、 t e a r R a t e$ ，而最后一列的类型是隐形眼镜的类型。

1.准备数据：解析tab键分隔的数据行

首先由于我们的数据文件是以 $T a b$ 分割开各列之间的数据的，所以我们首先需要获取被分隔的数据行。

代码如下，其中 $s t r i p ()$ 表示删除掉数据中的换行符，则 split('\t') 是数据中遇到 '\t' (既 $T a b$ ) 就隔开。

fr = open('lenses.txt') # 打开数据集文件
lenses = [inst.strip().split('\t') for inst in fr.readlines()] # 解析tab键分割的数据行

由于 $l e n s e s . t x t$ 文件中并没有对每列数据进行命名，这里我将每列数据的名称准备在 $l e n s e s L a b e l s$ 变量中。

lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

数据都准备好了，接下来就可以开始我们的决策树构造了。

2.决策树的构造

决策树算法( $D e c i s i o n T r e e$ )：决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。

缺点：可能会产生过度匹配问题。

适用数据类型：数值型和标称型。

2.1 信息增益

划分数据集的大原则是：将无序的数据变得更加有序。在划分数据集之前之后信息发生的变化称为信息增益，这里我们采用 香农熵 来计算信息的增益。

如果待分类的事务可能划分在多个分类中，则符号 $x_i$ 的信息定义为： $l(x_i)=-log_2p(x_i)$

其中 $p(x_i)$ 是选择该分类的概率。

为了计算熵，我们需要计算所有类别所有可能值包含的信息期望值，通过下面的公式得到(其中 $n$ 是分类的数目)：
$H=-\sum^{n}_{i=1}p(x_i)log_2p(x_i)$

from math import log

#计算给定数据集的香农熵
def calcShannonEnt(dataSet):
    numEntries = len(dataSet) # 获取数据集中实例的总数
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1] # featVec[-1]是指获取最后一个数值
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0 # 新添加的值，所以计数为 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0 # shannonEnt用于记录计算的香农熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries # 计算P(xi)的概率
        shannonEnt -= prob * log(prob, 2) # 计算香农熵
    return shannonEnt

由于熵越高，则混合的数据也越多，因此我们可以通过计算香农熵来划分数据集。

2.2 划分数据集

首先先把当作特征值的属性进行抽取。

# 输入参数分别是：待划分的数据集、划分数据集的特征，需要返回的特征的值
def splitDataSet(dataSet, axis, value):
    retDataSet = [] # 创建新的list对象
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis] # 获取关键特征前面的属性
            reducedFeatVec.extend(featVec[axis + 1 :]) # 填加关键特征后面的属性
            retDataSet.append(reducedFeatVec) # 以上步骤相当于对特征值进行抽取
    return retDataSet # 返回抽取特征后的数据集

然后再依次计算以不同属性值为特征值时的香农熵，判断以何种属性为特征值时是最优的数据划分。

# 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1 #获取每个数据集拥有几个特征（排除最后一个）
    beseEntropy = calcShannonEnt(dataSet) # 计算以最后一个数值为特征的香农熵
    bestInfoGain = 0.0;
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        # 将dataSet中的数据先按行依次放入example中，然后取得example中的example[i]元素，放入列表featList中
        uniqueVals = set(featList) # set() 函数创建一个无序不重复元素集
        newEntropy = 0.0
        for value in uniqueVals: # 计算每种划分方式的信息熵
            subDataSet = splitDataSet(dataSet, i, value) # 按照给定特征划分数据集
            prob = len(subDataSet) / float(len(dataSet)) # 计算当前结果的可能性
            newEntropy += prob * calcShannonEnt(subDataSet) # 不同可能性的香农熵的和
        infoGain = beseEntropy - newEntropy
        if(infoGain > bestInfoGain): # 判断是否是当前最小香农熵，计算出最好的信息增益
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

到这里，我们已经可以计算当前数据的最好划分方式了，但决策树不是只划分一次就好了，而是层层递进的划分下去，因此接下来就开始实现递归构建决策树。

2.3 递归构建决策树

工作原理：得到原始数据，然后基于最好的属性值划分数据集，由于特征值可能多余两个，因此可能存在大于两个分支的数据集划分。第一次划分之后，数据将被向下传递到树分支的下一个节点，再这个节点上，我们可以再次划分数据。因此我们可以采用递归的原则处理数据集。

递归结束的条件是：程序遍历完所有划分数据集的属性，或者每个分支下的所有实例都具有相同的分类。如果所有实例具有相同的分类，则得到一个叶子节点或者终止块。任何到达叶子节点的数据必然属于叶子节点的分类。

首先使用分类名称的列表，然后创建值为 $c l a s s L i s t$ 中唯一值的数据字典，字典对象存储了 $c l a s s L i s t$ 中每个类标签出现的频率，最后利用 $o p e r a t o r$ 操作键值排序字典，并返回出现次数最多的分类名称。

import operator

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
    return sortedClassCount # 返回出现次数最多的分类名称

接着就可以创建树了，其中变量 $m y T r e e$ 包含了很多代表树结构信息的嵌套字典，至此我们已经正确的构建好了树。

# 创建树的函数代码，两个输入参数：数据集和标签列表
def creatTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    # 将dataSet中的数据先按行依次放入example中，然后取得example中的example[-1]元素，放入列表classList中
    if classList.count(classList[0]) == len(classList): # 类别完全相同则停止继续划分
        return classList[0]
    if len(dataSet[0]) == 1: # 遍历完所有特征时返回出现次数最多的类别
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet) # 选择最好的数据集划分方式
    bestFeatLabel = labels[bestFeat] # 获取属性文字标签
    myTree = {bestFeatLabel : {}}
    
    # 得到列表包含的所有属性值
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = creatTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

3.在Python中使用Matplotlib注解绘制树形图

由于这里使用的主要是 $M a t p l o t l i b$ 绘图的知识，与机器学习关系不大，故这里不对代码进行详细讲解。

import matplotlib.pyplot as plt
import matplotlib

# 定义文本框和箭头格式
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")

# 绘制带箭头的注解
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy = parentPt, xycoords = 'axes fraction',
                            xytext = centerPt, textcoords = 'axes fraction',
                            va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)

# 获取叶节点的数目和树的层数
def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])
        else: numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else: thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

# plotTree函数
# 在父子节点间填充文本信息
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

# 计算宽与高
def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    #标记子节点属性值
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            plotTree(secondDict[key], cntrPt, str(key))  # recursion
        else:  # it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

# 这个是真正的绘制，上边是逻辑的绘制
def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)  # no ticks
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW;
    plotTree.yOff = 1.0;
    plotTree(inTree, (0.5, 1.0), '')
    plt.axis('off') # 去掉坐标轴
    plt.show()

4.使用算法

主函数代码：

if __name__ == "__main__":
    fr = open('lenses.txt') # 打开数据集文件
    lenses = [inst.strip().split('\t') for inst in fr.readlines()] # 解析tab键分割的数据行
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    lensesTree = creatTree(lenses, lensesLabels)
    createPlot(lensesTree)