【机器学习】数据挖掘算法-决策树-------机器学习~了解一下！

最新推荐文章于 2024-11-08 13:43:44 发布

AntioniaMao

最新推荐文章于 2024-11-08 13:43:44 发布

阅读量362

点赞数 1

分类专栏：机器学习文章标签： python 机器学习读书笔记

本文链接：https://blog.csdn.net/AntioniaMao/article/details/79505802

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

数据挖掘算法-决策树

目标：通过选择最佳的特征对数据集进行划分，以此将无序的数据变得有序，获得最佳的分类

数据集：前n-1列为特征，第n列为类别标签

算法流程：

计算原始数据集信息熵->【香农熵】集合信息的度量方式（本文后续给出了详细计算方法）

遍历使用每个特征对数据集进行划分（ID3算法）->【划分原则】将无序的数据变得更加有序

找到最佳的划分特征->递归CreateTree->

直至满足停止划分条件（后续给出具体停止条件）

图3-1【简单决策树示意图】

程序1-1 计算给定数据集的香农熵

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

根据香农熵的计算公式，我们可知，如果要计算香农熵，关键在于求出每一种类标签（即数据集最后一列）的次数

具体计算流程如下：

首先，使用for循环遍历数据集（dataSet）每一条数据，并取出类标签

判断该类标签是否存在于我们设定的labelCounts字典中，如果不存在则设为0，且无论是否存在，都在下一步进行自增操作

现在即得到了每一个类标签的次数，接下来将进行香农熵的计算

使用一个for循环，遍历我们的labelCounts字典，计算每一个类标签的概率，和数据集的香农熵

程序1-2 按照给定特征划分数据集

功能：挑选出符合给定特征值的数据，并形成新集合

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

输入参数：dataSet：待划分的数据集 axis：划分数据集的特征 value：特征的返回值

具体流程：

通过for循环，遍历数据集的每一个数据，判断给定的特征列是否满足特定值

如果满足则将此行数据除特征列外的其余数据提取出来并加入到一个新集合中

【注意】：python的函数传递为引用传递，函数内部对列表的修改将会影响列表本身，故声明为了保证不影响原始数据集，创建一个新的列表对象retDataSet

程序1-3 选择最好的数据集划分方式

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

目标：选择出最佳的划分特征

原理：使用每一个特征对数据集进行划分，计算划分后的香农熵，选择最佳划分方案

要求：1.数据必须是一种由列表元素组成的列表，且所有列表元素具有相同的数据长度

2.数据的最后一列或者最后一个元素必须是当前实例的类别标签

具体流程：

既然要选择出最佳的划分方案，那我们则要将划分后的数据集和划分前的数据集的香农熵进行对比

所以首先计算原数据集的香农熵

然后根据特征数目，进行for循环

提取每一个特征（列）的种类（featset=set（featlist））

然后使用该特征对数据集进行划分

计算香农熵（按照每一个种类的比例得出香农熵：newEntropy +=prob*calcShannonEnt(subDataSet）)

比较所有特征中的信息增益，得出最佳划分特征，返回最好特征划分的索引值

(1)baseEntropy

计算整个数据集的原始香农熵，保证最初的无序度量值，用于与划分完之后的数据集计算的熵值进行比较

(2)列表推导:用于提取所有特征，计算香农熵

featList=[example[i] for example in dataSet]

示例：

myDat = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

a = [example[1] for example in myDat]

a : [1, 1, 0, 1, 1]

构建决策树的模块我们已经全部编写完成了，其中包括[香农熵的计算]、[选择最佳划分特征]、[按照给定特征划分数据集]

接下里我们将开始构建决策树（好官方的说法）

程序1-4 选择频率最高的标签

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount[vote] =+1
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

程序1-5 生成决策树（递归）

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree