机器学习实战：决策树

最新推荐文章于 2024-09-09 19:39:08 发布

leeeein

最新推荐文章于 2024-09-09 19:39:08 发布

阅读量87

点赞数

分类专栏：笔记文章标签：决策树

本文链接：https://blog.csdn.net/weixin_37655032/article/details/84386107

版权

笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

机器学习实战

第3章决策树

决策树

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。

缺点：可能会产生过度匹配问题。

使用数据类型：数值型和标称型

算法流程

检查数据集中的每个自相是否属于同一分类：

   if so ：返回该类标签

   else：

寻找最好的划分特征

划分数据集

创建分支节点

对划分后的每个子数据集：

   递归调用该算法

return分支节点

准备数据

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

计算香农熵

使用决策树划分数据集的过程，实际上也等价于将原来相对杂乱无章的数据，划分成多个相对有序的子集。

信息被定义为： $l(x_{i})=log_{2}{\frac{1}{p(x_{i})}}$

观察这个公式，可以发现，在原始数据集中，取值概率最小的xi，得到的信息值反而是最高的，而假设某一标签下只有一种取值，那它的信息则为零。

而信息熵的定义是： $H=\sum_{i=1}^{n}p(x_{i})log_{2}{\frac{1}{p(x_{i})}}$ ，它表示信息的期望。

由此可以发现，信息熵越大时，含有的信息量越多，混乱程度越高。

计算ID3构造决策树的思路，就是寻找一个特征，这个特征下的各种取值是当前最不纯净的，因而用这个特征来划分，算出来的信息熵是最大的，从而使得划分后的信息增益最大。

#计算香农熵
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    #计数循环
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    #计算每个标签的出现概率
    #计算香农熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

划分数据集

这个部分书上没有给出明确的计算信息增益的公式，实际上，信息增益计算的是原信息熵和划分后的条件熵之间的差值，并不是求原信息熵和划分后的几个熵止和的差值。

比方说对于特征i，当前存在3个取值，对这三个取值可以分别计算出各自的信息熵，正确的newEntropy的计算方法使用的是公式:

IG(T)=H(C)-H(C|T)，其意义是：用原信息熵减去以T作为划分特征后的信息熵。这是代码中计算newEntropy时，在每一个香农熵前面乘上各自的概率再相加的原因。

#分割数据集的函数，在构建树的函数中将循环调用
#从实现过程上看，有点像SQL中的select语句
#select from dataset where featVect[axis]=value
#axis由下面的函数计算得到
#用过的特征就去掉，因为在后续的划分过程中，不再需要使用这个特征
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

#选择最好的划分特征
#假设当前数据集有m个特征，每个特征的取值分别为(i1,i2,...,in)
#用外层循环轮流选择m个特征，对每个特征，初始化一个值保存香农熵
#用内层循环轮流选择各特征的取值，并增加本特征下的香农熵，内层循环结束后，更新信息增益
#返回信息增益最大的特征索引

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

构造决策树

#表决函数
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

#递归建树
#定义了两个递归出口
#在当前集合中标签全部相同时直接返回该标签
#在特征值全部用完后，返回表决函数计算出的标签
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    #寻找最优划分特征
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    #初始化子树
    myTree = {bestFeatLabel:{}}
    subLabels=labels[:]
    del(subLabels[bestFeat])
    #取值去重
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    #递归建树
    for value in uniqueVals:
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

leeeein

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习实战：决策树

机器学习实战第3章决策树决策树优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。缺点：可能会产生过度匹配问题。使用数据类型：数值型和标称型算法流程检查数据集中的每个自相是否属于同一分类： if so ：返回该类标签 else：寻找最好的划...
复制链接

扫一扫