决策树

最新推荐文章于 2022-03-02 11:17:34 发布

huixinbuding

最新推荐文章于 2022-03-02 11:17:34 发布

阅读量310

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/huixinbuding/article/details/78676399

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

机器学习之决策树

机器学习之决策树

决策树原理

可以将决策树看成一个if-then规则的集合。则决策树的根节点到叶节点的每一条路径构成一条规则。路径上的内部结点的特征对应着规则的条件，而叶节点的类，对应着规则的结论。

决策树的特征选择准则是信息增益或信息增益比。

熵（entropy）:体系的混乱程度，在各领域都有运用。

信息熵：表示信息的混乱程度。信息越有序，信息熵越低。

信息增益：在划分数据集前后信息发生的变化。熵的减少或是数据无序度的减少。

如何构造一个决策树呢？

伪代码

检测数据中所有数据的分类标签是否相同：
    If so return 标签
        Else:
            寻找划分数据集最好的特征（划分后，信息熵最小，或者是信息增益最大的特征）
            划分数据集
            创建分支节点
                for 每个划分的子集
                    调用createBranch(创建分支的函数)，并增加返回结果到分支节点中
            return 分支节点

Python代码实现

计算香农熵（或者直接称为熵entronpy）

def calcShannonEnt(dataSet):        
    numEntries = len(dataSet)  #求list的长度，表示参与训练的训练量         
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

根据给定的特征划分数据集

def splitDataSet(dataSet, axis, value):   # 括号里是数据集、划分数据集的特征、需要返回的特征的值
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

选择最好的数据划分方式

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels最后一个元素是标签
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features 分别计算不同特征值的信息熵
        featList = [example[i] for example in dataSet] #create a list of all the examples of this feature
        #print(featList)
        uniqueVals = set(featList)       #get a set of unique values
        #print(uniqueVals)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature

创建决策树

def createTree(dataSet,labels): #数据集和标签列表
    classList = [example[-1] for example in dataSet] #将dataSet中的最后一列的类别标签存入classList当中
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    #以上是递归停止条件

    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]  #最优分类特征
    myTree = {bestFeatLabel:{}} #初始化myTree
    copylabels = labels.copy()
    del(copylabels[bestFeat]) #删除已经分类使用的特征值
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)  #set函数基本功能包括关系测试和消除重复元素，返回是一个无序不重复元素集
    for value in uniqueVals:
        subLabels = copylabels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

项目案例1：判定鱼类和非鱼类

此处仅插入运行后的决策树

项目案例2：使用决策树预测隐形眼镜类型

此处仅插入运行后的决策树

PS :

《机器学习实战》这本书中使用的python2，因此在python3中使用是，有些小小的不同：
在这章中有：
firstStr = myTree.keys()[0] #找到输入的第一个元素
在python3运行会报错:’dict_keys’ object does not support indexing
这是因为python3中改变了dict.keys,返回的是dict_keys对象,支持iterable 但不支持indexable
因此我们需要将其转化成list，即更改为：

firstStr = list(myTree.keys())[0]

更多详细的讲解和代码，请访问 ApacheCN

huixinbuding

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
决策树

机器学习之决策树机器学习之决策树决策树原理如何构造一个决策树呢Python代码实现计算香农熵或者直接称为熵entronpy根据给定的特征划分数据集选择最好的数据划分方式创建决策树项目案例1判定鱼类和非鱼类项目案例2使用决策树预测隐形眼镜类型PS 决策树原理可以将决策树看成一个if-then规则的集合。则决策树的根节点到叶节点的每一条路径构成一条规则。路径上的内部结点的特征对
复制链接

扫一扫