Machine Learning In Action - Chapter 3 Decision Tree

最新推荐文章于 2024-09-04 10:56:59 发布

MrTriste

最新推荐文章于 2024-09-04 10:56:59 发布

阅读量303

点赞数

分类专栏： Machine Learning In Action 文章标签：机器学习机器学习实战决策树

本文链接：https://blog.csdn.net/wjc1182511338/article/details/76649107

版权

Machine Learning In Action 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Chapter 3 - Decision Tree

The kNN algorithm in chapter 2 did a great job of classifying, but it didn’t lead to
any major insights about the data. One of the best things about decision trees is that
humans can easily understand the data.

特点

Decision trees
Pros: Computationally cheap to use, easy for humans to understand learned results,
missing values OK, can deal with irrelevant features
Cons: Prone to overfitting
Works with: Numeric values, nominal values

createBranch()伪代码

Check if every item in the dataset is in the same class:
  If so 
      return the class label
  Else
      find the best feature to split the data
      split the dataset
      create a branch node
      for each split
      call createBranch and add the result to the branch node
      return branch node

calculate the Shannon entropy

H = - \sum i = 1 n p (x i) l o g 2 p (x i)

$H = - \sum_{i=1}^{n}p(x_i)log_{2}p(x_i)$

from math import log
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]# 每一行的最后一个数据是标签
        if currentLabel not in labelCounts.keys():
        labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

信息增益

https://www.zhihu.com/question/22104055/answer/67014456

熵：表示随机变量的不确定性。

条件熵：在一个条件下，随机变量的不确定性。

信息增益：熵 - 条件熵在一个条件下，信息不确定性减少的程度！

通俗地讲，X(明天下雨)是一个随机变量，X的熵可以算出来， Y(明天阴天)也是随机变量，在阴天情况下下雨的信息熵我们如果也知道的话（此处需要知道其联合概率分布或是通过数据估计）即是条件熵。两者相减就是信息增益！原来明天下雨例如信息熵是2，条件熵是0.01（因为如果是阴天就下雨的概率很大，信息就少了），这样相减后为1.99，在获得阴天这个信息后，下雨信息不确定性减少了1.99！是很多的！所以信息增益大！也就是说，阴天这个信息对下雨来说是很重要的！

所以在特征选择的时候常常用信息增益，如果IG（信息增益大）的话那么这个特征对于分类来说很关键~~ 决策树就是这样来找特征的！

选择最佳分裂特征

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

建立决策树

有两个label，一定要分清，特征label(feature label)和类别label(class label)，

对每个dataset维护一个feature_label_list,存放当前dataset剩余的feature label
1.将数据集的n个实例的class收集起来，如果全都一样就返回这个class label
2.如果没有特征来split了，返回当前数据集中class最多的那个class label
3.如果都不是，则找到最佳分裂feature，将这个feature从feature_label_list删除
    按照这个feature的取值集合，分别分裂出若干个dataset，对每个dataset递归create tree

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(
            splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

MrTriste

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning In Action - Chapter 3 Decision Tree

Chapter 3 - Decision TreeThe kNN algorithm in chapter 2 did a great job of classifying, but it didn’t lead to any major insights about the data. One of the best things about decision trees is that
复制链接

扫一扫

专栏目录