【机器学习】决策树算法--2（代码模块实现）

VIP文章 weixin_51978164

已于 2022-11-19 10:05:21 修改

阅读量408

点赞数

文章标签：决策树算法

于 2022-11-12 16:01:48 首次发布

本文链接：https://blog.csdn.net/weixin_51978164/article/details/127820730

版权

文章目录

接之前的【机器学习】决策树算法--1(算法介绍)
- 三好学生评选表进行代码实例

接之前的【机器学习】决策树算法–1(算法介绍)

三好学生评选表进行代码实例

1、整体决策树模块（先从离散型数据开始）

                            集美大学三好学生评选表

—	是否挂科	获得奖学金次数	综测评价	体质健康是否达标	宿舍检评	是否符合条件
1	no	4	excellect	yes	excellent	yes
2	no	1	good	no	excellent	no
3	no	0	excellect	yes	excellent	yes
4	no	1	excellect	no	excellent	no
5	no	2	good	yes	excellent	yes
6	no	1	excellect	yes	excellent	no
7	no	1	excellect	yes	excellent	yes
8	yes	0	good	yes	excellent	no
9	no	2	good	yes	good	no
10	no	2	excellect	yes	excellent	yes
11	yes	2	excellect	yes	excellent	no
12	yes	0	good	yes	good	no
13	yes	0	excellect	yes	pass	no
14	no	4	excellect	yes	excellent	yes
15	no	2	excellect	yes	excellent	yes

在这里插入图片描述

def createDataSet1():    # 创造示例数据
    dataSet = [['no', '4','excllent', 'yes','excllent','yes'],
               ['no', '1', 'good','no','excllent','no'],
               ['no', '0', 'excllent','yes','excllent','yes'],
               ['no', '1', 'excllent','no','excllent','no'],
               ['no', '2', 'good','yes','excllent','yes'],
               ['no', '1', 'excllent','yes','excllent','no'],
               ['no', '1', 'excllent','yes','excllent','yes'],
               ['yes', '0', 'good','yes','excllent','no'],
               ['no', '2', 'good','yes','good','no'],
               ['no', '2', 'excllent','yes','excllent','yes'],
               ['yes', '2', 'excllent','yes','excllent','no'],
               ['yes', '0', 'good','yes','good','no'],
               ['yes', '0', 'excllent','yes','pass','no'],
               ['no', '4', 'excllent','yes','excllent','yes'],
               ['no', '2', 'excllent','yes','excllent','yes']]
    labels = ['Failclass','Scholarship-num','Grade-ranking','Physically-fit','Hostel-assessment'] 
    #是否挂科，奖学金次数，综测分评价，体质健康是否符合标准，宿舍检评
    return dataSet,labels
 


def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]#递归的调用，判断最后的标签是否都是一样的
    if classList.count(classList[0])==len(classList):#看这里面的标签是否和整体相同
        return classList[0]
    if len(dataSet[0]) == 1:#用了一列就删掉一列，直到只剩下一个标签，遍历完数据集
        return majorityCnt(classList)#返回最多的类别
    bestFeat=chooseBestFeatureToSplit(dataSet)#遍历数据集，选择最优的特征去进行分割
    bestFeatLabel=labels[bestFeat]#找到对应的标签
    myTree={
   bestFeatLabel:{
   }} #字典嵌套字典，对应根节点下面的节点，第一次是根节点，之后嵌套节点
    del(labels[bestFeat])#嵌套一个节点后要删掉，列名
    featValues=[example[bestFeat] for example in dataSet]#统计里面有多少个相同的属性，就是分多少个树杈
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels=labels[:]#去掉一列的标签
        myTree[bestFeatLabel][value]=createTree(splitDataSet\
                            (dataSet,bestFeat,value),subLabels)#在当前最好节点下继续做，递归创建添加最好的节点下面，splitDataSet切分后的数据集
    return myTree
 
def majorityCnt(classList):    #返回最多类别
    classCount={
   }
    for vote in classList:
        if vote not in classCount.keys():#如果说这个vote不在那赋值为0
            classCount[vote]=0
        classCount[vote]+=1#在就加等于一
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)#排序后的
    return sortedClassCount[0][0]

def chooseBestFeatureToSplit(dataSet):  #选择最优的特征
    numFeatures = len(dataSet[0])-1#当前特征数量，要减去labels
    baseEntropy = calcShannonEnt(dataSet)  #基础的熵值，啥都没做的时候的熵值
    bestInfoGain = 0#最好的信息增益
    bestFeature = -1#最好的特征
    for i in range(numFeatures):#遍历特征列
        featList = [example[i] for example in dataSet]#得到当前列的特征
        uniqueVals = set(featList)#得到唯一的个数
        newEntropy = 0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob =len(subDataSet)/float(len(dataSet))#看下去掉列后占总体的比值，后面要用到剩下的占总体的概率值
            newEntropy +=prob*calcShannonEnt(subDataSet) #计算累加后面新的熵值，对每一个特征进行操作
            print("信息熵：%f"  %newEntropy)
        infoGain = baseEntropy - newEntropy #信息增益