决策树分类算法（包含隐形眼镜分类的代码）

最新推荐文章于 2024-08-23 19:32:33 发布

heda3

最新推荐文章于 2024-08-23 19:32:33 发布

阅读量1.8k

点赞数 1

分类专栏：机器学习文章标签：决策树 ID3 C4.5 CART

本文链接：https://blog.csdn.net/heda3/article/details/83019800

版权

机器学习专栏收录该内容

28 篇文章 5 订阅

订阅专栏

决策树-（Detection Tree）

首先它是一个有监督学习算法、属于判别模型、非线性分类

优缺点：

优点：（1）能够处理数值型和类别型数据（2）不需要先验知识和参数假设（3）适合高维数据（4）准确性高，计算量小缺点：（1）决策树的结果不太稳定，数据很小变化会导致生成一个完全不同的树（2）决策树学习基于启发式，每次寻找每个节点的局部最优就决策，无法保证全局最优（3）决策树可以创建很复杂的树，但是可能无法推广，也即是过拟合，为此剪枝很关键
适用数据类型：数值型和标称型。

应用场景：常用于分类和回归，市场营销和生物医药

描述：

参考：决策树学习的本质： 从训练集中归纳出一组分类规则，或者说是由训练数据集估计条件概率模型？

包含3个步骤：特征选择、决策树的生成、决策树的修剪

有3个典型的算法：ID3（使用信息增益生成决策树）、C4.5（使用信息增益比生成决策树）、CART

框架：模型、策略（损失函数）、算法

学习模型：目的是找到一个决策模型使得对数据进行正确分类

策略：

损失函数通常是正则化的极大似然函数，策略是以此损失函数为目标函数的最小化

算法：

学习的问题转变为在损失函数意义下选择最优决策树的问题

解决方法：采用递归算法

决策树包含3个步骤：特征选择、决策树的生成、决策树的修剪

1、特征选择

特征数量多时只留下对数据有足够分类能力的特征。

特征选择的准则是：信息增益，当然数据集的熵很大时可以采用信息增益比的方式。

信息增益的计算：

在已知的特征A下使得类Y的信息不确定性减少的程度

经验熵：熵表示为随机变量不确定性的大小，值越大表示不确定性越大

应用于此决策树，i则表示为不同类别，pi表示不同类别所属的概率

经验条件熵：

应用于此决策树，X表示特征数，具体可能为k个特征A1、A2...A k，Y表示数据集；考虑对一个特征A1分析，其它特征同理；首先根据此特征A1的取值不同有不同的样本集划分Di,例如A1有3个取值，则3个取值下有3个样本集，此3个样本集构成数据集Y；

则上述的求和次数为n=3；H（Y|X=xi）相当于H（Di）表示每个取值下的数据集的经验熵；（可参考青年中年老年的贷款例子）

pi表示特征取不同值的概率。

2、决策树生成：

分为两种使用信息增益和信息增益比。也即是ID3算法和C4.5算法；

其中ID3算法是：从根结点出发，对结点计算所有的信息增益，选择信息增益最大的特征作为结点的特征

，由该特征的不同取值建立子结点（不同的样本），再对子结点（样本集）递归调用上述计算信息增益的过程，构建出决策树，直到所有特征的信息增益均很小或者是没有特征可以选择时结束，最后得到一个决策树模型。

C4.5算法类似ID3，不同之处在于选择特征使用的信息增益比

信息增益计算，哪个信息增益最大：

（1）

https://www.cnblogs.com/zy230530/p/6813250.html

关于字符串的和列表的访问详细的介绍https://blog.csdn.net/qq_26442553/article/details/81507972

使用决策树预测隐形眼镜类型解读：

使用的ID3算法

1）递归的构造决策树

# -*- coding: utf-8 -*-
"""
Created on Tue Sep 24 14:55:39 2019

@author: Administrator
"""
from math import log
import operator
import treePlotter as tP
import trees as tr_f
#递归构建决策树
def createTree(dataSet, labels):
    # classList 表示类别
    classList = [example[-1] for example in dataSet]#把dataset中的每一行放入example，然后找到相应的example[-1] 组成列表
   # 参考https://blog.csdn.net/jiangsujiangjiang/article/details/84313227
    #https://blog.csdn.net/weixin_41580067/article/details/82888699
    #两种特殊情况1、最终只剩下一个样本 2、只剩下一个特征
    if classList.count(classList[0]) == len(classList):
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)#多数表决法决定该叶子节点的分类
    #寻找最佳的特征 bestFeat表示索引
    bestFeat = chooseBestFeatureToSplit(dataSet)#选择最好的数据集划分方式
    bestFeatLabel = labels[bestFeat]#找出特征作为根节点
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])#删除此特征
    
    #特征下对应的值
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)#找出不重复的特征值（该特征下可能对应多个特征值）
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

#多数表决法决定该叶子节点的分类
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)#降序
    return sortedClassCount[0][0]    
#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)#计算原始数据的香农熵
    bestInfoGain = 0.0; bestFeature = -1
    # 依据特征数选择最佳的划分特征
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        #创建此特征取值的不重复类型
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        #计算特征下（列）对应的经验条件熵
        for value in uniqueVals:
            #找出第i个特征下对应值的数据样本个数
            subDataSet = splitDataSet(dataSet, i, value)#划分数据集: 按照给定特征划分数据集
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)#计算熵
        #计算信息增益
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature
#划分数据集: 按照给定特征划分数据集
def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
#熵的定义
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    #统计当前数据集的类别标签出现的次数，例如两个类别 类别1出现的次数为x1 类别2出现的次数为x2 总数=x1+x2
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2) #log base 2
    return shannonEnt

2）测试

#使用决策树执行分类
    #解析https://www.cnblogs.com/wyuzl/p/7700872.html
    #依据类别标签和待测数据的的值，找到对应的树子结点，递归的查找。并返回最终分类的值
def classify(inputTree, featLabels, testVec):
    firstStr = list(inputTree)[0]#当前树的根节点的特征名称 
    secondDict = inputTree[firstStr]#根节点的所有子节点
    featIndex = featLabels.index(firstStr)#找到根节点特征对应的下标  
    key = testVec[featIndex] #找出待测数据的特征值  
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):#判断valueOfFeat是否是dict类型，若是dict类型则说明不是叶结点（叶结点为str）
        classLabel = classify(valueOfFeat, featLabels, testVec)#递归的进入下一层结点
    else: classLabel = valueOfFeat#如果是叶结点，则确定待测数据的分类
    return classLabel

3）实际执行（读取数据--构建决策树---分类--存储决策树-绘制决策树）

# main
  #读取眼镜数据并构建树
fr = open('lenses.txt')
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
lensesTree = createTree(lenses,lensesLabels)
print(lensesTree)

# plot tree
tP.createPlot(lensesTree)
#对新数据进行分类
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
testVec=['young','hyper','yes','normal']
result=classify(lensesTree,lensesLabels, testVec)
print(result)

#存储构建的树并加载树
tr_f.storeTree(lensesTree,'ClassfyTree_lenses.txt')
load_tree=tr_f.grabTree('ClassfyTree_lenses.txt')
print(load_tree)

具体见：https://github.com/codehgq/ML-note/tree/master/Lense%20Classify

欢迎Star！