决策树

最新推荐文章于 2023-06-20 14:18:41 发布

Lanbocsdn

最新推荐文章于 2023-06-20 14:18:41 发布

阅读量609

点赞数 1

分类专栏：机器学习文章标签：决策树算法

本文链接：https://blog.csdn.net/LanboCSDN/article/details/78429629

版权

机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

构建决策树通常包括3个步骤：

特征选择
决策树生成
决策树剪枝

决策树的一般流程

收集数据：可以使用任何方法
准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化
分析数据：可以使用任何方法，构造树完成之后，我们应该检查图形是否符合预期
训练算法：构造树的数据结构
测试算法：使用经验树计算错误率
使用算法：可适用于任何监督学习算法，而使用决策树可以更好地理解数据的内在含义

Python实战
scikit-learn中有两类决策树，它们均采用优化的CART决策树算法。
回归决策树（DecisionTreeRegressor）
DecisionTreeRegressor（）
参数
- criterion：一个字符串，指定切分质量的评价标准。默认为’mse’，且只支持该字符串，表示均方误差。
- splitter：一个字符串，指定切分原则，可以为如下：best，表示选择最优的切分；random：表示随机切分
- max_features：可以为整数、浮点、字符串或者None，指定寻找best splitter时考虑的特征数量。
1）如果为整数，则每次切分只考虑max_features个特征
2）如果为浮点数，则每次切分只考虑max_features*n_features个特征（max_features指定了百分比）
3）如果为None或字符串’auto’或者’sqrt’，则max_features等于n_features
4）如果为字符串’log2’，则max_features等于log2（max_features）

max_depth：可以为整数或者None，指定树的最大深度
1）如果为None，则表示输的深度不限（直到每个叶子都是纯的，即叶节点中所有样本点都属于一个类，或者叶子中包含少于min_sample_split个样本点）
2）如果max_leaf_nodes参数非None，则忽略此项
min_samples_split：为整数，指定每个内部节点（非叶节点）包含的最少的样本数
min_sample_leaf：为整数，指定每个叶节点包含的最少的样本数
min_weight_fraction_leaf：为浮点数，叶节点中样本的最少权重系数
max_leaf_nodes：为整数或者None，指定叶节点的最大数量
1）如果为None，此时叶节点数量不限
2）如果非None，则max_depth被忽略
class_weight：为一个字典、字典的列表、字符串’balanced’或者None，它指定了分类的权重。权重的形式为：{class_label:weight}
1）如果为None，则每个分类的权重都为1
2）字符串’balanced’表示分类的权重是样本中各分类出现的频率的反比
random_state
presort：一个布尔值，指定是否要提前排序数据从而加速寻找最优切分的过程。设为True时，对于大数据集会减慢总体的训练过程；但对于一个小数据集或者设定了最大深度的情况下，则会加速训练过程

属性

feature_importances_：给出了特征的重要程度。该值越高，则该特征越重要（也称为Gini_importance）
max_features_：max_features的推断值
n_features_：当执行了fit之后，特征的数量
n_outputs_：当执行了fit之后，输出的数量
tree：一个Tree对象，即底层的决策树

方法

fit(X,y[,sample_weight,check_input,…)：训练模型
predict(X[,check_input])：用模型进行预测，返回预测值
score(X,y[,sample_weight])：返回预测性能得分

分类决策树（DecisionTreeClassifier）
DecisionTreeClassifier（）

参数

criterion：一个字符串，指定切分质量的评价标准。可为如下：’gini’：表示切分时评价准则是Gini系数；’entropy’：表示切分时评价准则是熵
splitter：一个字符串，指定切分原则，可为如下:‘best’：表示选择最优的切分；’random’：表示随机切分
max_features
max_depth
min_samples_split
min_samples_leaf
min_weight_fraction_leaf
max_leaf_nodes
class_weight
random_state
presort

属性

classes_：分类的标签值
feature_importances_：给出了特征的重要程度。该值越高，则该特征越重要（也称为Gini importance）
max_features：max_features的推断值
n_classes_：给出了分类的数量
n_features_：执行fit之后，特征的数量
n_outputs_：执行fit之后，输出的数量
tree_：一个Tree对象，即底层的决策树

方法

fit(X,y[,sample_weight,check_input,…])：训练模型
predict(X[,check_input])：用模型进行预测，返回预测值
predict_log_proba(X)：返回一个数组，数组的元素依次是X预测为各个类别的概率的对数值
predict_proba(X)：返回一个数组，数组的元素依次是X预测为各个类别的概率值
score(X,y[,sample_weight])：返回在(X,y)上预测的准确率（accuracy）

示例：使用决策树预测隐形眼镜类型

'''
Created on Oct 12, 2010
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington
'''
from math import log
import operator
from treePlotter import retrieveTree,createPlot
def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

###计算给定数据集的香农熵
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

myDat,labels=createDataSet()
myDat
calcShannonEnt(myDat)

myDat[0][-1]='maybe'
calcShannonEnt(myDat)

###按照给定特征划分数据集  
###三个输入参数：待划分的数据集、划分数据集的特征、需要返回的特征的值
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

###为了说明append和extend方法不同
a=[1,2,3]
b=[4,5,6]
a.append(b)
a

a=[1,2,3]
a.extend(b)
a

myDat,labels=createDataSet()
myDat
splitDataSet(myDat,0,1)
splitDataSet(myDat,0,0)
splitDataSet(myDat,1,0)

###选择最好的数据集划分方式   
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer
chooseBestFeatureToSplit(myDat)

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

###创建树的函数代码
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree                            
myTree=createTree(myDat,labels)
myTree

def classify(inputTree,featLabels,testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel
myDat,labels=createDataSet()
labels
myTree=retrieveTree(0)
myTree
classify(myTree,labels,[1,1])

###使用pickle模块存储决策树
def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
#显示    
def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)

storeTree(myTree,'clastro.txt')
grabTree('clastro.txt')

###示例：使用决策树预测隐形眼镜类型
###隐形眼镜类型包括硬材质、软材质以及不适合佩戴隐形眼镜
fr=open('lenses.txt')
lenses=[inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels=['age','prescript','astigmatic','tearRate']
lensesTree=createTree(lenses,lensesLabels)
lensesTree
createPlot(lensesTree)

这里写图片描述
采用文本方式很难分辨决策树的模样，因此，将调用createPlot（）函数绘制树形图。

沿着决策树的不同分支，我们可以得到不同患者需要佩戴的隐形眼镜类型。从中我们可以发现，医生最多需要问4个问题就能确定患者需要佩戴哪种类型的隐形眼镜。

上述的决策树非常好地匹配了实验数据，然而这些匹配选项可能太多了，我们将这种问题称为过度匹配（overfitting）。为了减少过度匹配的问题，我们可以裁剪决策树，去掉一些不必要的叶子结点。如果叶子结点只能增加少许信息，则可以删除该节点，将它并入到其他叶子节点中。

本例使用的是ID3算法，它是一个很好的算法，但并不完美。ID3算法无法直接处理数值型数据，尽管我们可以通过量化的方法将数值型数据转化为标称型数据，但如果存在太多的特征划分，ID3算法人会面临其他问题。

Lanbocsdn

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
决策树

构建决策树通常包括3个步骤：特征选择决策树生成决策树剪枝决策树的一般流程收集数据：可以使用任何方法准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化分析数据：可以使用任何方法，构造树完成之后，我们应该检查图形是否符合预期训练算法：构造树的数据结构测试算法：使用经验树计算错误率使用算法：可适用于任何监督学习算法，而使用决策树可以更好地理解数据的内在含义Python实
复制链接

扫一扫