机器学习实战-第3章-决策树算法（ID3）

最新推荐文章于 2023-07-03 13:52:34 发布

wyypersist

最新推荐文章于 2023-07-03 13:52:34 发布

阅读量1.5k

点赞数 1

分类专栏：研0沉淀文章标签：机器学习决策树分类算法

本文链接：https://blog.csdn.net/weixin_43749999/article/details/121548433

版权

研0沉淀专栏收录该内容

71 篇文章 14 订阅

订阅专栏

2021.11.23下午学习笔记

在流程图中，长方形表示判断模块，椭圆形表示中止模块。

从判断模块引出的左右箭头称为分支。

决策树的主要优势在于数据形式非常容易理解。

机器根据数据集创建规则的过程就是机器学习的过程。

3.1 决策树的构造

优缺点：

优点：对中间值的确实不敏感，可以处理不相关的数据。

缺点：可能产生过度匹配的问题。

适用的数据类型：数值型和标称型。

在构造决策树的过程中第一个问题就是：哪一个特征在划分数据的时候起到了决定性的作用。

需要评估每一个特征，在评价了每个特征之后，原始的数据集就被划分成了几个数据子集。

决策树的一般流程：

收集数据
准备数据：对数值型的数据必须进行离散化。
分析数据
训练算法
测试算法
使用算法

这里使用的划分数据集的算法是：ID3。

3.1.1 信息增益

计算不同特征的信息增益，然后信息增益最高的特征就是划分当前数据集的最好的特征。

熵定义为信息的期望值。

信息的定义：

xi的信息定义为：l(xi) = -log 2 p(xi)；

其中的p(xi)是选择该分类的概率。

然后需要计算所有类所有可能包含的信息期望值。通过下面的公式可以得到：

H = - 求和 i = 1 到 n p(xi) * log2p(xi)

其中n是分类的数目。

计算香农熵的代码如下：

Python代码：

# 计算信息熵的函数
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: # the the number of unique elements and their occurance
        currentLabel = featVec[-1] # 得到当前的分类
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0

        labelCounts[currentLabel] += 1

    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2) # log base 2
    return shannonEnt

用来计算最好的划分数据集合的特征代码：

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      # the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet) # 原始数据集的信息熵

    bestInfoGain = 0.0
    bestFeature = -1 # 设置的原始的信息增益和最好的特征的初始值

    # 使用for循环对所有的特征分别计算信息熵，从而找到最好的特征划分数据集
    for i in range(numFeatures): # iterate over all the features

        # 得到对应的i特征的所有的可能出现的特征值，便于后期信息熵的计算
        featList = [example[i] for example in dataSet] # create a list of all the examples of this feature

        # 得到上述的特征值所有可能值的set集合
        uniqueVals = set(featList) # get a set of unique values
        # 用来计算新的数据集的香农熵
        newEntropy = 0.0
        # for循环遍历每一个特征值对应的香农熵
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)

        # 然后计算此时的信息增益
        infoGain = baseEntropy - newEntropy     # calculate the info gain; ie reduction in entropy

        # 比较当前的信息增益和原来的基础信息增益的大小
        if infoGain > bestInfoGain :       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i

    return bestFeature                      #returns an integer

递归创建决策树的函数：

Python代码实现：

# 创建决策树的函数
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet] # 得到数据集合中的所有的分类数据
    if classList.count(classList[0]) == len(classList): # 先判断当前的类列表中是否完全相同，相同的话直接结束循环
        return classList[0] # stop splitting when all of the classes are equal，返回当前的分类结果

    if len(dataSet[0]) == 1: # stop splitting when there are no more features in dataSet
        return majorityCnt(classList) # 表示分类列表的长度等于1，那么不需要再进行分类了

    # 选择出来最好的划分集合的特征
    bestFeat = chooseBestFeatureToSplit(dataSet)

    # 得到最好的划分集合的特征的名称
    bestFeatLabel = labels[bestFeat]

    # 创建的决策数，按照当前的label标签作为Key值，并且value不分是列表，初始化决策树
    myTree = {bestFeatLabel: {}}

    # 删除labels集合中的当前特征
    del(labels[bestFeat])

    # 得到当前最好划分特征的所有可能的值
    featValues = [example[bestFeat] for example in dataSet]
    # 将划分特征的所有出现的值转换为了set集合
    uniqueVals = set(featValues)

    # 使用for循环遍历所有可能出现的值，
    for value in uniqueVals:
        # 得到当前的剩下的特征列表
        subLabels = labels[:]       # copy all of labels, so trees don't mess up existing labels

        # 然后递归调用该函数，将决策树进行充实
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

//截止到2021.11.23晚上 20：17止

2021.11.25号晚上18：13开始笔记

3.2 python中使用可视化库matplotlib将决策树进行注解并绘制

3.2.1 Matplotlib注解

matplotlib库提供了一个annotations注解工具

注解便于解释数据的内容。

使用文本注解绘制树节点

Python代码：

相关属性设置代码：

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

绘制树节点函数：

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

其中的createPlot()函数代码（简单版如下）：

def createPlot():
   fig = plt.figure(1, facecolor='white')
   fig.clf()
   createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
   plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
   plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
   plt.show()

3.2.2 构造注解树

在构造注解树之前需要设置两个函数，分别得到树木的深度和树木的宽度：

得到树木宽度的函数：对应于x轴的范围

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs

得到树木深度的函数：对应于树木y轴的范围：

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

上述得到树木深度使用的dfs递归方法；得到树木宽度使用的也是递归。

为了便于得到树木的结点信息，这里使用一个函数存储树木的结构：

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

这里设置了plotTree()函数：

Python代码实现：

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree) #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = myTree.keys()[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

# if you do get a dictonary you know it's a tree, and the first element will be another dict

同时，设置一个可以在父子结点进行标注的函数：

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

新的createPlot()函数中代码：

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

需要注意的是：

上述的代码中的获得secondDict的时候使用的是myTree.keys()[0]，很遗憾的是python3以上不再支持了，那么可以换砖为list(myTree.keys())[0];即可。

3.3 测试和存储分类器

3.3.1 使用决策树进行分类

使用决策树进行分类的函数：

python代码实现：

def classify(inputTree,featLabels,testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

决策树的优点：

可以将创建好的决策树进行保存。

3.4 示例：使用决策树预测隐形眼睛的类型

代码实现：

# # 测试函数：预测隐形眼镜的类型
fr = open('lenses.txt')

# 根据tab分割数据集
lenses = [inst.strip().split('\t') for inst in fr.readlines()]

lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

lensesTree = trees.createTree(lenses, lensesLabels)

print(lensesTree)

treePlotter.createPlot(lensesTree)