ML-决策树

最新推荐文章于 2021-09-10 17:10:28 发布

jiuniangyuanzikk

最新推荐文章于 2021-09-10 17:10:28 发布

阅读量336

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/jiuniangyuanzikk/article/details/74066233

版权

机器学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

决策树是一种进行类别递归分类的分类算法，具体的原理就是构造一棵决策树，对所有相同的类别分别作为左右子节点（当然也可以是多树杈的树）

决策树的核心在于如何找到最优的特征值来对于数据集合进行分类。最好的方法就是用香浓定理了

香浓定理：

H 就代表了训练集合所有特征的熵，得到了熵之后，我们就可以获取最大信息增益的方法来选择最佳的划分特征值。

python代码：

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

通过计算出的集合熵对分类特征做出选择（根据信息的增益）

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

bestFeature就是当前划分的最佳划分特征值在训练集合中的index，从上面的代码可以看到，迭代判定每一个特征对于所有样本的香浓值得增益,