机器学习实战第三章决策树

最新推荐文章于 2020-11-16 13:54:26 发布

masiro_zhao

最新推荐文章于 2020-11-16 13:54:26 发布

阅读量238

点赞数

分类专栏：机器学习实战文章标签： python 机器学习计算机大数据机器学习实战

本文链接：https://blog.csdn.net/weixin_44024159/article/details/90295511

版权

机器学习实战专栏收录该内容

4 篇文章 0 订阅

订阅专栏

3.1 决策树的构造

关于图示

判断模块->长方形

终止模块->椭圆形

可到达另一个判断模块或终止模块->分支

优缺点

优点：复杂度低，对中间值的缺失不敏感，可以处理不相关特征数据；使不熟悉的数据集合，总结出一条规律

缺点：过拟合

适用数据类型：数值型和标称型

划分数据分分类

选择特征划分数据分类，需找到决定性特征：若某分支下的数据属于同一类型，则已正确的划分数据分类，无需进一步对数据进行分割，若数据自己内的数据不属于同一类型，则需要重复划分数据子集。

决策树流程

准备数据：树构造算法只适用于标称型数据，因此数值型数据必须进行离散化。

过程：

1-用所有属性对数据集进行分割，计算每次分割后的熵

2-计算信息增益：旧熵-新熵，熵越小，代表信息无序度变少，分类越好，因此，新熵越小越好，即信息增益值越大越好

3-按照最佳属性对数据集进行分类，分类后，检查每个类目下的实例的其他属性是否一致，若一致，则无需继续进行分类，若不一致，则需要继续计算信息增益，选择最佳属性继续进行分类

4-循环以上步骤至所有类目下的实力属性都相同

3.1.1 信息增益

信息增益：划分数据集之后信息发生的变化，用香农熵或熵表示

熵越高，混合的数据也越多，因此可以在数据集中添加更多的分类。

得到熵之后，根据获取的最大信息增益的方法分数据集。

信息增益：熵的减少，即信息无序度的减少

基尼不纯度：从一个数据集中随机选取子项，度量其被错误分类到其他分组里的概率。

# 计算香农熵
from math import log
def calShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

# 构建数据集
def createDataSet():
    dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels = ['no surfacing', 'flippers']
    return dataSet, labels

测试

myDat, labels = createDataSet()

myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels

['no surfacing', 'flippers']

calShannonEnt(myDat)

0.9709505944546686

结论：熵越高，混合的数据也越多，因此可以在数据集中添加更多的分类。

myDat[0][-1] = 'maybe'

myDat

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

calShannonEnt(myDat)

1.3709505944546687

跟原始的数据集相比，增加了maybe属性，因此熵增高。

得到熵之后，根据获取的最大信息增益的方法分数据集。

基尼不纯度：从一个数据集中随机选取子项，度量其被错误分类到其他分组里的概率。

3.1.2 划分数据集

对每个特征划分数据集的结果计算一次信息熵，判断按照哪个特征划分数据集是最好的划分方式

# 分割数据集
# dataSet 数据集
# axis 选定的属性
# value 希望属性等于的值
def splitDataSet(dataSet, axis, value): 
    retDataSet = []
    for featVec in dataSet : 
        if featVec[axis] == value : 
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

测试

myDat,labels = createDataSet()
myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

# 第0个特征，每个实例的值为1
splitDataSet(myDat,0,1)

[[1, 'yes'], [1, 'yes'], [0, 'no']]

# 第1个特征，每个实例的值为1
splitDataSet(myDat,1,1)

[[1, 'yes'], [1, 'yes'], [0, 'no'], [0, 'no']]

# 选择最优特征及特征值对数据进行分割
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1 # 其中有一个是属性
    baseEntropy = calShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    
    # 针对每个特征进行分割计算各种情况的熵值
    for i in range(numFeatures): 
        featList = []
        # 取出第i个特征在每个实例中的值
        # 去重
        for example in dataSet:  
            featList.append(example[i])
        uniqueVals = set(featList)
        # 对于每个特征的每个值，把数据进行分割
        # 计算熵值
        # 计算新的熵
        # 计算信息增益
        # 信息增益最大的即为最好的分割的特征
        newEntropy = 0.0 
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value) #分割数据集
            prob = (len(subDataSet))/float(len(dataSet)) #分到这个组的概率
            newEntropy += prob * calShannonEnt(subDataSet)  #计算新的熵
        infoGain = baseEntropy - newEntropy #信息增益 = 旧熵-新熵
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

myDat,labels = createDataSet()
chooseBestFeatureToSplit(myDat)

3.1.3 递归构建决策树

多数投票：C4.5和CART运行时，每次分割并不会减少feature，因此可能分割到最后依然有同一类下类标签不统一，此时需要进行多数投票。

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount:
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
    return sortedClassCount[0][0]

# 递归构建决策树
def createTree(dataSet, labels):
    # classList包含所有实例的label
    classList = []
    for example in dataSet:
        classList.append(example[-1])
        
    # 类别完全相同时，停止划分
    # 或者说，从一开始所有label都相同，即不用划分，直接返回第一个即可
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    
    # 若表格中只有一个属性，直接进行投票，返回标签出现次数最多的label
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del (labels[bestFeat]) # 去掉已经进行分割的特征，避免重复
    # 找出最佳特征想要用来分割的值
    # 去重
    featValues = []
    for example in dataSet:
        featValues.append(example[bestFeat])
    uniqueVals = set(featValues)
    
    # 对每个特征的每种值递归画树
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

myDat, labels = createDataSet()
createTree(myDat,labels)

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

3.2 用matplotlib注解绘制树形图

3.2.1 Matplotlib注解

使用注解工具annotations，可在数据图形上添加文本注解。

定义决策节点，叶子节点，箭头的形状

import matplotlib.pyplot as plt

# 定义文本框和箭头格式
decisionNode = dict(boxstyle="sawtooth", fc="0.8") # 定义决策节点形状
leafNode = dict(boxstyle="round4",fc="0.8") # 定义叶子节点形状
arrow_args = dict(arrowstyle="<-") # 定义箭头形状

定义箭头上注解的样式

#绘制带箭头的注解
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                           va="center",ha="center",bbox=nodeType,arrowprops=arrow_args)

绘制示例图

def createPlot():
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    createPlot.ax1=plt.subplot(111,frameon=False)
    plotNode('decision node',(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode('leaf node',(0.8,0.1),(0.3,0.8),leafNode)
    plt.show()

createPlot()

在这里插入图片描述

3.2.2 构造注解树

获取叶子数目：最后分成的类目数。

算法讲解：

进入字典，查找下一层是否还为字典

若不为字典，则说明该分支分类完成，叶子节点数目+1

若依然为字典，则说明该分支分类未完成，需要进入下一层看分类是否完成，完成则叶子节点数目+1

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict': # 若进入字典后，发现字典内容依然为字典，即还有其他叶子节点
            numLeafs += getNumLeafs(secondDict[key]) 
        else:
            numLeafs += 1
    return numLeafs

获取树的深度：即决策节点的个数

算法讲解：

进入字典，查找下一层是否还为字典

若不为字典，则说明该分支只有一层，层数=1

若依然为字典，则说明该分支分类还有下一层，则需进入下一层看分类是否完成，层数+1

保留层数最大值即可

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for keys in secondDict.keys():
        if type(secondDict[keys]).__name__ == 'dict':
            thisDepth = 1 + getTreeDepth(secondDict[keys])
        else:
            thisDepth = 1
        if thisDepth > maxDepth : 
            maxDepth = thisDepth
    return maxDepth

示例

def retrieveTree(i):
    listOfTrees = [{'no surfacing':{0:'no',1:{'flippers':{0:'no',1:'yes'}}}},
                  {'no surfacing':{0:'no',1:{'flippers':{0:{'head':{0:'no',1:'yes'}},1:'no'}}}}
                  ]
    return listOfTrees[i]

myTree1 = retrieveTree(0)
print('myTree1 : ' + str(myTree1))
nbLeafs1 = getNumLeafs(myTree1)
print("number of leaf nodes of myTree1 is : " + str(nbLeafs1))
nbDepth1 = getTreeDepth(myTree1)
print("depth of myTree1 is : " + str(nbDepth1))


myTree2 = retrieveTree(1)
print('myTree2 : ' + str(myTree2))
nbLeafs2 = getNumLeafs(myTree2)
print("number of leaf nodes of myTree1 is : " + str(nbLeafs2))
nbDepth2 = getTreeDepth(myTree2)
print("depth of myTree1 is : " + str(nbDepth2))

myTree1 : {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
number of leaf nodes of myTree1 is : 3
depth of myTree1 is : 2
myTree2 : {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
number of leaf nodes of myTree1 is : 4
depth of myTree1 is : 3

构建树示意图(这部分大概看一下即可)

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString)

def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt) # 标记子节点属性值
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD # 减少y偏移
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff, plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW;plotTree.yOff = 1.0;
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

createPlot(myTree1)
createPlot(myTree2)

在这里插入图片描述

png

3.3 测试存储分类器

3.3.1 测试算法：是用决策树执行分类

inputTree：构件好的决策树架构 ex：{‘no surfacing’: {0: ‘no’, 1: {‘flippers’: {0: ‘no’, 1: ‘yes’}}}}

featLabels：属性名称 ex : no surfacing

testVec：将要测试的物种，判断其是否为鱼 ex : yes, no

def classify(inputTree, featLabels, testVec):
    firstStr = list(inputTree.keys())[0] # 第一个决策节点：no surfacing
    secondDict = inputTree[firstStr] # 第二层的字典：{0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}
    featIndex = featLabels.index(firstStr) # 第一个属性在featlabels中的index
    for key in secondDict.keys(): # 第二层字典的key：0，1
        if testVec[featIndex] == key: #测试属性的值 = 0 or 1
            if type(secondDict[key]).__name__=='dict': # 若还为字典，则继续调用本函数
                classLabel = classify(secondDict[key],featLabels, testVec)
            else:
                classLabel = secondDict[key] # 直接返回label值
    return classLabel

myDat, labels = createDataSet()
myTree = retrieveTree(0)

# 测试第一个属性值为1，第二个属性值为2的物种，是不是鱼，答案为no
classify(myTree, labels, [1,0])

'no'

3.3.2 使用算法：决策树的存储

存储决策树

def storeTree(inputTree, filename):
    import pickle
    fw = open(filename,'wb')    
    pickle.dump(inputTree, fw) # 将obj对象序列化存入已经打开的file中
    fw.close()

读取决策树

def grabTree(filename):
    import pickle
    fr = open(filename,'rb')
    print(fr)
    return pickle.loads(fr) # 将file中的对象序列化读出。