机器学习(5)——决策树(下)算法实现

本文介绍了决策树算法的ID3和C4.5实现,包括算法框架、信息熵计算、子数据集划分以及信息增益和信息增益率的选择。通过Python实现了一个完整的决策树生成过程,包括主函数、画图子函数和分类结果展示。虽然未涉及剪枝,但展示了ID3和C4.5在没有剪枝情况下的相同效果。
摘要由CSDN通过智能技术生成

Decision tree

机器学习(5)——决策树(上)原理中介绍了决策树的生成和剪枝原理。介绍了CART,ID3,C4.5等算法的算法流程,其中CART算法可以实现回归和分类,是基于基尼不纯度实现的,这里并未实现。这里主要实现了ID3和C4.5算法,是基于信息熵的,在本处因为没有涉及剪枝,他们最终得到的结果都是一样的。我们先来看ID3的整个算法框架(C4.5也基本类似,不同之处是特征选取的区别):

  • Algotithm 4.1 ID3(D)
  • Input: an attribute-valued dataset D
  • Output: a decision tree
    1. if D is “pure” OR Attribute is null then
    2.   return class
    3. end if
    4. for all attribute aD do
    5.   computer the imformation gain and select best feature
    6. end for
    7. abest= Best attribute feature
    8. Tree= Create a decision node that feature abest in root
    9. Dv= Induced sub-dataset for feature abest
    10. for all Dv do
    11.    Treev=ID3(Dv)
    12. end for
    13. return Tree
算法实现

(1)创建训练数据集:
从.txt文件中读取数据,并去掉空格,分割数据,最终返回dataset数据集合attribute特征类别。

# process training data set
# input: directory
# output: data_set, attribute

def proData(path):
    fileset = open(path)   #loading data file
    dataset = [data.strip().split('\t') for data in fileset.readlines()]
    attribute = dataset [0]
    del(dataset[0])
    return dataset,attribute

(2)计算信息熵:
先统计训练数据的总量,然后统计每个标签类别的数目,得到其概率,最后计算信息熵

H(X)=i=1npilogpi

# calculate the information entropy
# input: dataset
# output: entropy

def calcEntropy(dataset):
    numEntries = len (dataset)
    attributeCounts = {}
    for item in dataset:
        currentAttribute = item[-1]
        if currentAttribute not in attributeCounts.keys():
            attributeCounts[currentAttribute]=0
        attributeCounts[currentAttribute]+=1
    entropy = 0.0
    for key in attributeCounts:
        prob = float (attributeCounts[key])/numEntries
        entropy -= prob *log(prob,2)
    return entropy

(3)划分子数据集:
选取最好的分类特征之后,依据该特征得到新的子训练样本,将子样本进行归类,并去掉本次已选的属性(特征)。

# split data based on different values of attribute
# input: dataset
# output: split data 
def splitData(dataset,axis,value):
    splitdata = [] 
    for feature in dataset:
        if feature[axis] == value:
            #del(feature[axis])
            tempFeaVec = feature[:axis]
            tempFeaVec.extend(feature[axis+1:])
            splitdata.append(tempFeaVec)
    return splitdata

(4)选取最好特征:
在ID3算法中,依据信息增益选取最好的特征,在C4.5中依据信息增益比选取最好特征。
ID3:信息增益

# calculate the entropy of different features
# input: dataset
# output: best feature
def selectBestFeature(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

C4.5:信息增益率

# calculate the information gain ratio for different features
# input: dataset
# output: best feature
def selectBestFeature_C4(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGainRatio = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0;Splitentropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
            Splitentropy -= prob *log(prob,2)
        infoGainRatio = (baseEntropy - newEntropy)/Splitentropy
        if (infoGainRatio > bestInfoGainRatio):
            bestInfoGainRatio = infoGainRatio
            bestFeature = i
    return bestFeature

(5)创建决策树:
首先计算所有属性(特征)对于原经验熵的信息增益(率),据此选取出最好的属性(特征),然后根据所选的最好属性(特征)将原数据集分成不同的子数据集,并迭代计算子数据集的树,直到子数据集不可分或属性集合为空为止。
ID3决策树生成

# train decision tree ID3
# input: dataset, attribute
# output: decision tree
def createTreeID3(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = {}
    if classList.count(classList[0]) ==len(classList):
        return classList[0]                             # stop splitting when all data belong to same labels
    if len(dataset[0]) == 1:                            # stop splitting when attribute = NULL, return the max class
        for value in classList:
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value]+=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = {bestAttribute:{}}
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeID3(splitData(dataset, bestFeature, value), subattributes)
    return myTree

C4.5决策树生成

# train decision tree C4.5
# input: dataset, attribute
# output: decision tree
def createTreeC4(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = {}
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        for value in classList: 
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value] +=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature_C4(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = {bestAttribute:{}}
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeC4(splitData(dataset, bestFeature, value), subattributes)
    return myTree

(6)主函数:
给定数据所在位置,并输出最终的效果。

# main function
if __name__=="__main__":
    # data_set processing
    dataset = []
    attributes = []
    path='F:\Program\Python\Machine_Learning\Decision_tree\lenses.txt'
    dataset,attributes = proData(path)
    myTreeID3 = createTreeID3(dataset,attributes)
    dataset,attributes = proData(path)
    myTreeC4 = createTreeC4(dataset, attributes)
    print str(myTreeID3)
    createPlot(myTreeID3)
    print str(myTreeC4)
    createPlot(myTreeC4)

(7)画图函数:
生成的决策树通过文本形式观看不是很直观,设计一个画图子函数之后,可以很直观的将生成的决策树打印出来,看到不错的效果。

# Project: Machine learning-decision tree
# Author: Lyndon
# date: 2015/10/27

from matplotlib import pyplot as plt

# define the format of text and arrow
decisionNode = dict(boxstyle ="sawtooth",fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrowRrgs = dict (arrowstyle = "<-")

# calculate the number of tree leaves and the depth of tree 
# input: decision tree
# output: numbers of node, depth of the tree
def calNumLeaves(tree):
    numLeaves = 0
    maxDepth = 0
    firstNode = tree.keys()[0]
    secondDict = tree[firstNode]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':        #check if the node is leaf
            subnumLeaves,submaxDepth = calNumLeaves(secondDict[key])
            numLeaves += subnumLeaves
            thisDepth = 1 +submaxDepth
        else: 
            numLeaves +=1
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return numLeaves,maxDepth

# plot the node and leaf
# input: node,leaf, center, parent,   
# output: null
def plotsubtree(node,text,center,parent,nodeType):
    createPlot.ax1.annotate(node,xy=parent,xycoords='axes fraction',
                            xytext=center,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrowRrgs)
    xMid = (parent[0]-center[0])/2.0+center[0]
    yMid = (parent[1]-center[1])/2.0+center[1]
    createPlot.ax1.text(xMid,yMid,text,va='center',ha='center',rotation=30)

# plot the tree
# input: tree
# output: null
def plotTree(tree,parent,nodetxt):
    numLeaves, depth = calNumLeaves(tree)
    firstNode = tree.keys()[0]
    center = (plotTree.xOff+(1+float(numLeaves))/2.0/plotTree.num,plotTree.yOff )
    plotsubtree(firstNode, nodetxt, center, parent, decisionNode)
    secondDict = tree[firstNode]
    plotTree.yOff -=1.0/plotTree.depth 
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict': 
            plotTree(secondDict[key], center, str(key))
        else:
            plotTree.xOff += 1.0/plotTree.num
            plotsubtree(secondDict[key], str(key), (plotTree.xOff,plotTree.yOff), center, leafNode)
    plotTree.yOff += 1.0/plotTree.depth

# plot the Tree
# input: Tree
# output: Null
def createPlot(tree):
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    axprops = dict(xticks=[],yticks=[])
    createPlot.ax1 = plt.subplot(111,frameon=False,**axprops) 
    plotTree.num, plotTree.depth = calNumLeaves(tree)
    plotTree.xOff = -0.5/plotTree.num; plotTree.yOff = 1.0
    plotTree(tree,(0.5,1.0),'')
    plt.show()

(8)分类结果:
决策树文本输出:
{‘tearRate’{‘reduced’: ‘no lenses’, ‘normal’: {‘astigmatic’: {‘yes’: {‘prescriptor’: {‘hyper’: {‘age’: {‘pre’: ‘no lenses’, ‘presbyopic’: ‘no lenses’, ‘young’: ‘hard’}}, ‘myope’: ‘hard’}}, ‘no’: {‘age’: {‘pre’: ‘soft’, ‘presbyopic’: {‘prescriptor’: {‘hyper’: ‘soft’, ‘myope’: ‘no lenses’}}, ‘young’: ‘soft’}}}}}}
决策树图示:
这里写图片描述
在本例中,没有剪枝过程,ID3和C4.5算法实现的最终结果一样。
PS:
本文主要通过Python实现了决策树中的ID3和C4.5算法,只是简单的应用了信息增益和信息增益率来实现分类,代码参考了《机器学习实战》,完整代码及数据

决策树算法是一种广泛应用于分类和回归的机器学习算法,它基于树形结构对样本进行分类或预测。决策树算法的主要思想是通过一系列的判断来对样本进行分类或预测。在决策树中,每个节点表示一个属性或特征,每个分支代表该属性或特征的一个取值,而每个叶子节点代表一个分类或预测结果。 决策树算法的训练过程主要包括以下步骤: 1. 特征选择:根据某种指标(如信息增益或基尼系数)选择最优的特征作为当前节点的分裂属性。 2. 决策树生成:根据选择的特征将数据集分成若干个子集,并递归地生成决策树。 3. 剪枝:通过剪枝操作来提高决策树的泛化性能。 决策树算法的优点包括易于理解和解释、计算复杂度较低、对缺失值不敏感等。但是,决策树算法也存在一些缺点,如容易出现过拟合、对离散数据敏感等。 下面是一个决策树算法的案例:假设我们要根据一个人的年龄、性别、教育程度和职业预测其收入水平(高于或低于50K)。首先,我们需要将这些特征进行编码,将其转换为数值型数据。然后,我们可以使用决策树算法对这些数据进行训练,并生成一个决策树模型。最后,我们可以使用该模型对新的数据进行分类或预测。例如,根据一个人的年龄、性别、教育程度和职业,我们可以使用决策树模型预测该人的收入水平。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值