《机器学习实战》之决策树

最新推荐文章于 2022-05-11 21:06:03 发布

小白终究会黑化

最新推荐文章于 2022-05-11 21:06:03 发布

阅读量503

点赞数

分类专栏：机器学习实战文章标签：决策树 python 机器学习

本文链接：https://blog.csdn.net/qq_34406071/article/details/109090108

版权

机器学习实战专栏收录该内容

3 篇文章 0 订阅

订阅专栏

决策树

决策树的构造
Matplotlib注解绘制树图像
测试和存储分类器
总结

本章内容

决策树的简介
在数据集中度量一致性
使用递归函数构造决策树
使用Matplotlib绘制图形树

决策树一个最重要的任务是为了理解数据中所蕴含的知识信息，因此决策树可以使用不熟悉的数据集集合，并从中提取出一些系列规则，这些机器根据数据集创建规则的过程，就是机器学习的过程。
决策树的优缺点：

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。
缺点：可能产生过度匹配问题
适用数据类型：数值型和标称型

决策树的构造

在构成决策树时，首先考虑的第一个问题是-当前数据集上哪个特征在划分数据分类时起决定性作用。完成特征测试评估后，原数据集就被分为几个数据子集，这些子集会分布在第一个决策点的所有分支上。如果分支下的数据集属于同一个类型，则无需进一步对数据集进行分割；若不是，则需重复划分数据子集的过程。
创建分支的伪代码函数createBranch()如下所示：

检测数据集中每个子项是否属于同一个分类：
If so return 类标签
Else
    寻找划分数据的最好特征
    划分数据集
    创建分支节点
        每个划分的子集
             调用函数createBranch并增加返回结果到分支节点中
    return 分支节点

信息增益

划分数据的大原则就是：将无序的数据变得更加有序。
信息增益—在划分数据前后信息发生的变化。
集合信息的度量方式称为香农熵或者简称熵。
符号 $x_i$ 的信息定义为： $l(x_i)=-log_2p(x_i)$ ,其中 $p(x_i)$ 是选择该分类的概率。
熵的定为信息的期望值： $H=-\displaystyle \sum_{i=1}^np(x_i)log_2p(x_i)$ ,其中n为分类的数目。

from math import log
# 计算给定数据集的香农熵
def calcShannonEnt(dataSet):
    numEntries=len(dataSet)
    labelCounts={}
    # 为所有可能分类创建字典
    for featVec in dataSet:
        currentLabel=featVec[-1]  # 键值为最后一列的数值
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt

输入createDataSet()函数：

def creatDataSet():
    dataSet=[[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels=['no surfacing','flippers']
    return dataSet,labels
myDat,labels=creatDataSet()
print(calcShannonEnt(myDat))

调用函数后，计算得到的熵：0.9182958340544896

划分数据集

将对每个特征划分数据集的结果计算一次信息熵，然后判断按照哪个特征划分数据集是最好的划分方式。

按照给定的特征划分数据集

'''
按照给定特征划分数据集
dataSet:带划分的数据集
axis:划分数据集特征
value:特征的返回值
'''
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    # 抽取
    for featVec in dataSet:
        if featVec[axis]==value:
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    return retDataSet

print(splitDataSet(myDat,0,0))

输出结果为：[[1, ‘no’], [1, ‘no’]]

选择最好的数据集划分方法—获取最大信息增益

'''
选择最好的数据集：熵计算将会告诉我们如何划分数据集中最好的数据组织方式
'''
def chooseBeatFeatureToSplit(dataSet):
    numFeatures=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0
    baseFeature=-1
    for i in range(numFeatures):
        # 创建唯一的分类标签列表
        featList=[example[i] for example in dataSet]  # 列表推导创建新的列表
        uniqueVals=set(featList)   # 创建集合数据类型，其中每个值互不相同【在列表中取唯一的值最快的方法】
        newEntropy=0.0
        # 计算每种划分方式的信息熵
        for value in uniqueVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prb=len(subDataSet)/float(len(dataSet))
            newEntropy+=prb*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        # 计算最好的信息熵
        if (infoGain>bestInfoGain):
            bestInfoGain=infoGain
            baseFeature=i
    return baseFeature

print(chooseBeatFeatureToSplit(myDat))

输出结果为：0
在这里插入图片描述
代码运行结果告诉我们，第0个特征是最好的用于划分数据集的特征。如表3-1所示，如果以第一个特征划分数据，也就是第一个特征是1的放在一组，第一的特征的0放在一组，我们可以得出，第一个特征为1的组含有两个属于鱼类，一个非鱼类，另外一个组则全是是非鱼类。

递归构建决策树

递归结束的条件：程序遍历完所有划分数据集的属性，或者每个分支下的所有实例都具有相同的分类，如果所有实例具有相同的分类，则得到一个叶子节点或者终止块。
在这里插入图片描述
采用多数表决的方法决定该叶子节点的分类：

'''
多数表决的方法决定叶子点的分类
'''

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount[vote]+=1
        sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]

返回了次数最多的分类名称。

# 创建树的代码
def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    # 类别完全相同则停止继续划分
    if classList.count(classList[0])==len(classList):
        return classList[0]
    # 遍历完所以特征时返回次数最多的
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBeatFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    # 得到列表包含的所有属性值
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels=labels[:]  # 确保每次调用函数createTree()时不改变原始列表的内容，使用新列表代替原始列表
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

myTree=createTree(myDat,labels)
print(myTree)

输出结果：
{‘no surfacing’: {0: ‘no’, 1: {‘flippers’: {0: ‘no’, 1: ‘yes’}}}}

最左边开始，第一个关键词no surfacing是第一个划分数据集的特征名称，该关键词也是另一个数据字典。第二个关键词是no surfacing特征划分的数据集，这些关键词的值都是no surfacing节点的子节点。

Matplotlib注解绘制树图像

Matplotlib注解

Matplotlib提供了一个非常有用的注解工具annotation，它可以在数据图像上添加文本注解。

# 定义文本框和箭头格式
decisionNode=dict(boxstyle='sawtooth',fc='0.8')
leafNode=dict(boxstyle='round4',fc='0.8')
arrow_args=dict(arrowstyle='<-')
def createPlot():
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    createPlot.ax1=plt.subplot(111,frameon=False)
    plotNode('a decision node',(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode('a leaf node',(0.8,0.1),(0.3,0.8),leafNode)
    plt.show()

在这里插入图片描述

构造注解树

# 获取叶节点的数目和树的层数
def getNumLeafs(myTree):
    numLeafs=0
    firstStr=next(iter(myTree))
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        # 测试节点的数据类型是否为字典
        if type(secondDict[key]).__name__=='dict':
            numLeafs+=getNumLeafs(secondDict[key])
        else:numLeafs+=1
    return numLeafs
def getTreeDepth(myTree):
    maxDepth=0
    firstStr=next(iter(myTree))
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth=1+getTreeDepth(secondDict[key])
        else:thisDepth=1
        if thisDepth>maxDepth:maxDepth=thisDepth
    return maxDepth

python3中myTree.keys()返回的是dict_keys,不在是list,所以不能使用myTree.keys()[0]的方法获取结点属性，可以使用list(myTree.keys())[0]
函数，或者使用firstStr=next(iter(myTree))

retrieveTree()函数输出预先存储的树消息，避免每次测试代码时都要从数据中创建树的麻烦。

def retrieveTree(i):
    listOfTrees=[{'no surfacing':{0:'no',1:{'flippers':\
                                                {0:'no',1:'yes'}}}},
                 {'no surfacing':{0:'no',1:{'flippers':\
                                                {0:{'head':{0:'no',1:'yes'}},1:'no'}}}}
                 ]
    return listOfTrees[i]

绘制决策树


'''
标注有向边属性值
Parameters:
     cntrPt、parentPt--用于计算标注位置
     txtString--标注的内容
Returns:
     无

'''
def plotMidText(cntrPt,parentPt,txtString):
    xMid=(parentPt[0]-cntrPt[0])/2.0+cntrPt[0]
    yMid=(parentPt[1]-cntrPt[1])/2.0+cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)


'''
绘制决策树
Parameters:
    myTree---决策树（字典）
    parentPt---标注的内容
    nodeTxt---结点名
Returns:
    无
'''

def plotTree(myTree,parentPt,nodeTxt):
    numLeafs=getNumLeafs(myTree)
    depth=getTreeDepth(myTree)
    firstStr=next(iter(myTree))
    cntrPt=(plotTree.xOff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict=myTree[firstStr]
    plotTree.yOff=plotTree.yOff-1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff=plotTree.xOff+1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff=plotTree.yOff+1.0/plotTree.totalD

def createPlot(inTree):
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    axprops=dict(xticks=[],yticks=[])
    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW=float(getNumLeafs(inTree))
    plotTree.totalD=float(getTreeDepth(inTree))
    plotTree.xOff=-0.5/plotTree.totalW;plotTree.yOff=1.0;
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

测试和存储分类器

测试算法：使用决策树执行分类

在执行数据分类时，需要使用决策树以及用于创造决策树的标签向量。程序比较测试数据与决策树上的数值，递归执行该过程直到进入叶子节点，最后将测试数据定义为叶子节点所属的类型。

'''
使用决策树的分类函数
Parameters:
    inputTree---输入决策树
    featLabels---树标签
    testVec--与之对比的树节点的值
Return:
    classLabel--
'''
def classify(inputTree,featLabels,testVec):
    firstStr=next(iter(inputTree))
    sceondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in sceondDict.keys():
        if testVec[featIndex]==key:
            if type(sceondDict[key]).__name__=='dict':
                classLabel=classify(sceondDict[key],featLabels,testVec)
            else:
                classLabel=sceondDict[key]
    return classLabel

特征标签列表将帮助程序处理寻找特征属性的存储位置，使用index方法查找列表中第一个匹配的变量。

使用算法：决策树的存储

构造决策树是很耗时的任务，但如果用创建好的决策树解决分类问题，则可以很快完成任务，可以使用python模块pickle序列化对象。

def storeTree(inputTree,filename):
    import pickle
    fw=open(filename,'wb')
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr=open(filename,'rb')
    return pickle.load(fr)


storeTree(myTree,'classifierStorage.txt')
print(grabTree('classifierStorage.txt'))

输出结果：
{‘no surfacing’: {0: ‘no’, 1: {‘flippers’: {0: ‘no’, 1: ‘yes’}}}}

notice：pickle.dump(inputTree, fw)
TypeError: write() argument must be str, not bytes
看着提示信息是说write（）参数必须是字符，不能是字节，fw=open(filename,‘w’)，fr=open(filename,‘r’)改为fw=open(filename,‘wb’)和fr=open(filename,‘rb’)

使用决策树预测隐形眼镜类型

暂时为找到数据集，此模块待定中。

总结

决策树分类器就类似于带有终止块的流程图，终止块代表分类结果。开始处理数据集时，我们首先需要测试集合中数据的不一致性，也就是熵，然后寻找最优的方案划分数据集，直到数据集中的所有数据属于同一个分类。

小白终究会黑化

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
《机器学习实战》之决策树

决策树决策树的构造信息增益划分数据集递归构建决策树三级目录本章内容决策树的简介在数据集中度量一致性使用递归函数构造决策树使用Matplotlib绘制图形树决策树一个最重要的任务是为了理解数据中所蕴含的知识信息，因此决策树可以使用不熟悉的数据集集合，并从中提取出一些系列规则，这些机器根据数据集创建规则的过程，就是机器学习的过程。决策树的优缺点：优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。缺点：可能产生过度匹配问题适用数据类型：数值型和标称型
复制链接

扫一扫

专栏目录