机器学习实战决策树

最新推荐文章于 2022-10-05 19:00:27 发布

晨风漱

最新推荐文章于 2022-10-05 19:00:27 发布

阅读量152

点赞数

文章标签：决策树 python 机器学习

本文链接：https://blog.csdn.net/weixin_46819123/article/details/117230265

版权

本文详细介绍了决策树的学习过程，包括如何构造决策树、计算信息增益、划分数据集以及递归构建决策树。通过Python代码展示了决策树的创建、绘制和测试，以及如何使用pickle模块存储决策树。此外，还涵盖了决策树的优缺点和适用数据类型。

摘要由CSDN通过智能技术生成

1.1 决策树的构造
决策树
在机器学习中，决策树是一个预测模型，他代表的是对象属性与对象值之间的一种映射关系。树中每个节点表示某个对象，而每个分叉路径则代表的某个可能的属性值，而每个叶结点则对应从根节点到该叶节点所经历的路径所表示的对象的值。决策树仅有单一输出，若欲有复数输出，可以建立独立的决策树以处理不同输出。
决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。
优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。
缺点：可能会产生过度匹配问题。
适用数据类型：数值型和标称型。
在构造决策树时，我们需要解决的第一个问题就是，当前数据集上哪个特征在划分数据分类时起决定性作用。为了找到决定性的特征，划分出最好的结果，我们必须评估每个特征。完成测试之后，原始数据集就被划分为几个数据子集。这些数据子集会分布在第一个决策点的所有分支上。如果某个分支下的数据属于同一类型，则当前无需阅读的垃圾邮件已经正确地划分数据分类，无需进一步对数据集进行分割。如果数据子集内的数据不属于同一类型，则需要重复划分数据子集的过程。如何划分数据子集的算法和划分原始数据集的方法相同，直到所有具有相同类型的数据均在一个数据子集内。
决策树的一般流程
(1) 收集数据：可以使用任何方法。
(2) 准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化。
(3) 分析数据：可以使用任何方法，构造树完成之后，我们应该检查图形是否符合预期。
(4) 训练算法：构造树的数据结构。
(5) 测试算法：使用经验树计算错误率。
(6) 使用算法：此步骤可以适用于任何监督学习算法，而使用决策树可以更好地理解数据的内在含义。
1.1.1 信息增益
在划分数据集的前后信息发生的变化称为信息增益，知道如何计算信息增益，我们就可以计算每个特征值划分数据集获得的信息增益，获得信息增益最高的特征就是最好的选择。
熵定义为信息的期望值，在明晰这个概念之前，我们必须知道信息的定义。如果待分类的事务可能划分在多个分类之中，则符号xi的信息定义为:
在这里插入图片描述
其中p(xi)是选择该分类的概率。
为了计算熵，我们需要计算所有类别所有可能值包含的信息期望值，通过下面的公式得到：

其中n是分类的数目。
程序清单1-1 计算给定数据集的香农熵

#计算给定数据集的香农熵
def calcShannonEnt(dataSet):
    #得到文件行数
    numEntries = len(dataSet)
    #创建一个空字典，键为类别，值为该类别出现的次数
    labelCounts = {}
    #统计每个类别出现的次数
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

1.1.2 划分数据集
程序清单1-2 按照给定特征划分数据集

def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

程序清单1-3 选择最好的数据集划分方式

def chooseBestFeatureToSplit(dataSet):
    #计算每条数据的特征数量
    numFeatures = len(dataSet[0]) - 1
    #计算原始数据集的熵值（混乱程度）
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0;bestFeature = -1
    #遍历所有特征，最后计算每个特征的信息增益，选取最好的
    for i in range(numFeatures):
        #创建对应特征的属性值列表  featList=[1,1,0,1,1]
        featList = [example[i] for example in dataSet]
        # 得到当前特征下的不同属性值集合
        uniqueVals = set(featList)
        #初始化熵
        newEntropy = 0.0
        #计算当前特征划分后的数据子集的熵
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        #计算当前特征划分后的数据子集的信息增益
        infoGain = baseEntropy - newEntropy
        #计算的信息增益如果比现有的大，则重新赋值。找到最好的信息增益
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    #返回最有决定性的特征标识
    return bestFeature

1.1.3 递归构建决策树

def majorityCnt(classList):
    classCount= {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
            classCount[vote] += 1
        sortedClassCount = sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
        return sortedClassCount[0][0]

程序清单1-4 创建树的函数代码

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet                     (dataSet,bestFeat,value),subLabels)
    return myTree

1.2 在 Python 中使用 Matplotlib 注解绘制树形图
程序清单1-5 使用文本注解绘制树节点

import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams["font.sans-serif"] = ["SimHei"]
#定义文本框和箭头格式
decisionNode = dict(boxstyle = "sawtooth",fc = "0.8")
leafNode = dict(boxstyle = "round4",fc = "0.8")
arrow_args = dict(arrowstyle = "<-")
#绘制带箭头的注解
def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords = "axes fraction",xytext = centerPt,textcoords = "axes fraction",va = "center",ha = "center",bbox = nodeType,arrowprops = arrow_args)
def createPlot():
    fig = plt.figure(1,facecolor = "white")
    fig.clf()
    createPlot.ax1 = plt.subplot(111,frameon = False)
    plotNode(U"决策节点",(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode(U"叶节点",(0.8,0.1),(0.3,0.8),leafNode)
    plt.show()
createPlot()

在这里插入图片描述
1.2.2 构造注解树
程序清单1-6 获取叶节点的数目和树的层数

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) == dict:
            numLeafs += getNumLeafs(secondDict[key])
        else:numLeafs += 1
    return numLeafs
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) == dict:
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else: thisDepth = 1
        if thisDepth > maxDepth:maxDepth = thisDepth
    return maxDepth

程序清单1-7 plotTree函数

def retrieveTree(i):
    listOfTrees = [{"no surfacing":{0:"no",1:{"flippers":{0:"no",1:"yes"}}}},
                  {"no surfacing":{0:"no",1:{"flippers":{0:{"head":{0:"no",1:"yes"}},1:"no"}}}}
                  ] 
    return listOfTrees[i]
def plotMidText(cntrPt,parentPt,txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0  + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0  + cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)
def plotTree(myTree,parentPt,nodeTxt):
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]
    cntrPt = (plotTree.xOff + (1.0 +float(numLeafs))/2.0/plotTree.totalW,plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key])== dict:
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
def createPlot(inTree):
    fig = plt.figure(1,facecolor = "white")
    fig.clf()
    axprops = dict(xticks = [],yticks = [])
    createPlot.ax1 = plt.subplot(111,frameon = False,**axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW;plotTree.yOff = 1.0;
    plotTree(inTree,(0.5,1.0),"")
    plt.show()

1.3 测试和存储分类器
1.3.1 测试算法：使用决策树执行分类
程序清单1-8 使用决策树的分类函数

def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]) == dict:
                classLabel = classify(secondDict[key],featLabels,testVec)
            else: classLabel = secondDict[key]
    return classLabel

1.3.2 使用算法：决策树的存储
程序清单1-9 使用pickle模块存储决策树

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,"wb")
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename,"rb")
    return pickle.load(fr)

晨风漱

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习实战决策树

1.1 决策树的构造决策树优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。缺点：可能会产生过度匹配问题。适用数据类型：数值型和标称型。在构造决策树时，我们需要解决的第一个问题就是，当前数据集上哪个特征在划分数据分类时起决定性作用。为了找到决定性的特征，划分出最好的结果，我们必须评估每个特征。完成测试之后，原始数据集就被划分为几个数据子集。这些数据子集会分布在第一个决策点的所有分支上。如果某个分支下的数据属于同一类型，则当前无需阅读的垃圾邮件已经正确地划分数据分
复制链接

扫一扫