【机器学习】ID3-决策树及代码实战

最新推荐文章于 2023-11-05 21:37:41 发布

o(*￣︶￣*)o__小肉松

最新推荐文章于 2023-11-05 21:37:41 发布

阅读量738

点赞数

分类专栏：机器学习文章标签：机器学习 ID3决策树

本文链接：https://blog.csdn.net/made_in_china_too/article/details/78807578

版权

机器学习专栏收录该内容

26 篇文章 1 订阅

订阅专栏

一、原理-ID3决策树学习算法（基于信息增益进行划分的决策树）

（1）划分策略
　　在当前分支中的候选节点里，选择“信息增益”最大的属性作为划分条件
（2）所用公式
1.信息熵
这里写图片描述
2.信息增益

（3）例子
1.使用数据集“西瓜数据集2.0”来构建如下决策树

2.构建过程

二、根据数据集构建决策树

（1）数据集
这里写图片描述
（2）代码
1.计算信息熵

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

'''
   函数calcShannonEnt(dataSet)能够返回dataSet的信息熵（香浓熵）
'''
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1] #返回festVec的最后一位元素，也就是dataSet中的分类标签
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

这里写图片描述

2.划分数据集
a.

'''
   函数splitDataSet(dataSet, axis, value)从dataSet中选取所有第axis列的值为value的样本，并将
   这些样本的第axis列元素剔除，然后将其余数据返回
'''
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:  #如果第axis列的值为value
            reducedFeatVec = featVec[:axis]     #取这行样本的0至（axis-1）号元素
            reducedFeatVec.extend(featVec[axis+1:])#取这行样本的（axis+1）至最后一位的元素
            retDataSet.append(reducedFeatVec)
    return retDataSet

这里写图片描述

'''
   函数chooseBestFeatureToSplit(dataSet)通过遍历dataSet的所有特征，每次都计算dataSet以某个特征进行划分时的信息增益，从而
   选出信息增益最大时对应的划分特征（最佳划分特征），并返回最佳划分特征在dataSet中对应的列下标
'''
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #返回当前数据集dataSet的维度（特征的种类数），由于最后一列是分类标签，因此要-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures): #统计以第i列对应的特征对dataSet进行划分的信息增益
        featList = [example[i] for example in dataSet]#将数据集dataSet的第i列元素放入数组festList
        uniqueVals = set(featList)       #featList的元素集合，但在set（）中每种元素只有一个
        newEntropy = 0.0
        for value in uniqueVals: #统计以第i列对应的特征的值为value时进行划分时的信息熵
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #计算以第i列对应的特征对dataSet进行划分的信息增益
        if (infoGain > bestInfoGain):       #在所有for i in range(numFeatures)中，选出最大的那个infoGain，并记录这时特征对应的列i
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature                      #返回最佳划分特征对应的列下标

这里写图片描述

3.递归构建决策树
这里写图片描述
代码：

'''
    classList是dataSet最后一列（标签名称的列表），函数majorityCnt(classList)统计classList各个标签的出现频率，然后
    按标签出现频率降序的顺序存放入数组classCount，并将classCount的第一个元素（出现频率最高）返回
'''
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]  #sortedClassCount = 将classCount={}的标签频率降序的序列

'''
   函数createTree(dataSet,labels)的数组labels对应dataSet每一列的特征名称。
'''
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#当前分支下所有样本都属于同一个标签，返回该标签作为叶子节点
    if len(dataSet[0]) == 1: #dataSet中没有任何特征列了，返回出现频率最高的标签作为叶子节点
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet) #返回dataSet中最佳划分特征对应的列下标
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}} #myTree是一个嵌套字典
    del(labels[bestFeat]) #从数组labels中删除该最佳划分特征
    featValues = [example[bestFeat] for example in dataSet]  #获取dataSet中第bestFeat列所对应的列向量
    uniqueVals = set(featValues) #uniqueVals=dataSet中第bestFeat列元素的种类
    for value in uniqueVals: #遍历uniqueVals
        subLabels = labels[:]
        '''
        Python函数参数是引用方式传递，为了保证每次调用creatTree()时不改变原始列表的内容，这里
        使用新变量subLables代替原始列表
        '''
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

这里写图片描述

三、使用Matplotlib注解绘制树形图

（1）
这里写图片描述
代码：

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8") #定义决策节点的图形格式
leafNode = dict(boxstyle="round4", fc="0.8") #定义叶节点的图形格式
arrow_args = dict(arrowstyle="<-") #定义箭头的图形格式

'''
    函数plotNode(nodeTxt, centerPt, parentPt, nodeType)：在画图区域添加一个名称为“nodeTxt”，坐标为centerPT的节点，该
    节点图形格式nodeType。再添加一个从坐标为parentPt的点指向坐标为centerPt的点的箭头
'''
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

'''
    函数createPlot1()：创建一个新的画图区域,并增加两个组件：
    组件一：
        创建坐标是（0.5,0.1）的决策节点，名称为“a decision node”。创建从（0.1,0.5）指向（0.5,0.1）的箭头
    组件二：
        创建坐标是（0.8, 0.1）的叶子节点，名称为“a leaf node”。创建从（0.3, 0.8）指向（0.8, 0.1）的箭头
'''
def createPlot1():
   fig = plt.figure(1, facecolor='white') #创建一个新的画图区域
   fig.clf() #清空画图区域
   createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
   #创建坐标是（0.5,0.1）的决策节点，名称为“a decision node”。创建从（0.1,0.5）指向（0.5,0.1）的箭头
   plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
   #创建坐标是（0.8, 0.1）的叶子节点，名称为“a leaf node”。创建从（0.3, 0.8）指向（0.8, 0.1）的箭头
   plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
   plt.show()

这里写图片描述

（2）
这里写图片描述
代码：

'''
    函数retrieveTree(i)里面共有两棵决策树用于测试
'''
def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

'''
   获取决策树myTree的叶子节点数
'''
def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in list(secondDict.keys()):
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

'''
   获取决策树myTree的深度
'''
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in list(secondDict.keys()):
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

'''
    plotMidText(cntrPt, parentPt, txtString):在节点centrPt和节点parnetPt之间的连线中点位置，添加文本信息textString
'''
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center")


def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()





if __name__ == "__main__":
    myTree=retrieveTree(0)
    createPlot(myTree)

　　函数 plotTree(myTree, parentPt, nodeTxt)中cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)、函数createPlot(inTree)中plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0的原理如下：
　　首先由于整个画布根据叶子节点数和深度进行平均切分，并且x轴的总长度为1,即如同下图：
　　
这里写图片描述

1、其中方形为非叶子节点的位置，@是叶子节点的位置，因此每份即上图的一个表格的长度应该为1/plotTree.totalW,但是叶子节点的位置应该为@所在位置，则在开始的时候plotTree.xOff的赋值为-0.5/plotTree.totalW,即意为开始x位置为第一个表格左边的半个表格距离位置，这样作的好处为：在以后确定@位置时候可以直接加整数倍的1/plotTree.totalW,
2、对于plotTree函数中的红色部分即如下：
　cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotTree.xOff即为最近绘制的一个叶子节点的x坐标，在确定当前节点位置时每次只需确定当前节点有几个叶子节点，因此其叶子节点所占的总距离就确定了即为float(numLeafs)/plotTree.totalW*1(因为总长度为1)，因此当前节点的位置即为其所有叶子节点所占距离的中间即一半为float(numLeafs)/2.0/plotTree.totalW*1，但是由于开始plotTree.xOff赋值并非从0开始，而是左移了半个表格，因此还需加上半个表格距离即为1/2/plotTree.totalW*1,则加起来便为(1.0 + float(numLeafs))/2.0/plotTree.totalW*1，因此偏移量确定，则x位置变为plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW
3.对于plotTree函数参数赋值为(0.5, 1.0)
　因为开始的根节点并不用划线，因此父节点和当前节点的位置需要重合，利用2中的确定当前节点的位置便为(0.5, 1.0)
总结：利用这样的逐渐增加x的坐标，以及逐渐降低y的坐标能能够很好的将树的叶子节点数和深度考虑进去，因此图的逻辑比例就很好的确定了，这样不用去关心输出图形的大小，一旦图形发生变化，函数会重新绘制，但是假如利用像素为单位来绘制图形，这样缩放图形就比较有难度了

四、决策树的存储

　　针对大型数据集构建决策树是很费时间的，因此构建好的决策树需要存储在硬盘里，方便下一次可以直接从硬盘读取该决策树。

'''
    函数storeTree(inputTree,filename):将决策树inputTree序列化写入文件filename
'''
def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'wb')
    pickle.dump(inputTree,fw)
    fw.close()

'''
    函数grabTree(filename):从文件filename读取决策树的序列化编码，返回决策树
'''
def grabTree(filename):
    import pickle
    fr = open(filename,'rb')
    return pickle.load(fr)

这里写图片描述

五、使用决策树进行分类

'''
    原本由一个dataSet生成了一颗决策树，数组featLabels按次序对应dataSet的特征名称。
    数组testVec是待分类的样本向量
'''
def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0] #获取当前节点的名称
    secondDict = inputTree[firstStr] #获取以当前节点为根节点的决策子树
    featIndex = featLabels.index(firstStr) #返回当前节点名称在数组featLabels对应的下标，即在dataSet中对应特征列的下标
    key = testVec[featIndex] #在testVec中获取当前判断节点所需要进行判断的值
    valueOfFeat = secondDict[key]
    '''
      获取当前节点的值为key的分支节点名称，因此valueOfFeat的值可能为分类结果或进行下一级分类所用的特征
    '''
    if isinstance(valueOfFeat, dict): #如果valueOfFeat是进行下一级分类所用的特征，使用递归继续分类
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel

这里写图片描述

六、使用决策树预测隐形眼镜类型

　　使用小数据集构建决策树，我们可以学好好多知识：根据该决策树，眼科医生最多需要问四个问题就能判断患者需要佩戴的眼镜类型。使用的数据集：隐形眼镜数据集是非常著名的数据集，它包含很多患者眼部状况的观察条件以及医生推荐的隐形眼镜类型。隐形眼镜类型包括硬材质、软材质以及不适合佩戴隐形眼镜。数据来源于UCI数据库。
　　
这里写图片描述
代码

import treePlotter
def tree_lenses(filename):
    fr=open(filename)
    lenses=[inst.strip().split('\t') for inst in fr.readlines()]
    lensesLabels=['age','prescript','astigmatic','tearRate']
    lensesTree = createTree(lenses,lensesLabels)
    print(lensesTree)
    treePlotter.createPlot(lensesTree)