统计学习方法学习笔记-决策树(三)之Python实现一棵决策树(基于C4.5算法)

最新推荐文章于 2022-07-04 23:46:21 发布

三岁就很萌@D

最新推荐文章于 2022-07-04 23:46:21 发布

阅读量430

点赞数

分类专栏：机器学习算法统计学习方法

本文链接：https://blog.csdn.net/qq_44822951/article/details/109320395

版权

在之前的决策树模型详解(一)之如何进行特征选择
以及决策树模型详解(二)之如何生成决策树以及剪枝我们已经学习完了决策树算法的三个步骤特征选择决策树生成决策树剪枝
在这篇文章中,就要给大家展示一下基于C4.5生成算法如何来构建一棵决策树，以及如何对决策树进行剪枝

决策树的一个分支

class Edge: #代表树的一个分支
    def __init__(self,  child, value):
        self.child=child #这个边连接的结点
        self.value=value #这个边的值

决策树的一个结点

class Node:
    def __init__(self,data,edges,feature,value):
# data表示这个结点包含的数据,childs表示该结点的孩子结点(因为基于C4.5算法的决策树并不是一棵二叉树,feature表示该结点是以哪个特征来划分子结点的，value是分类值 只有叶结点才有取值
     self.data=data
     self.edges=edges
     self.feature=feature
     self.value=value

决策树模型

class Decision_Tree:#基于C4.5算法的决策树
    def __init__(self,Train,features,label,a,b,feature_name):#features是当前可用特征集初始时是所有特征
        self.train=Train
        self.features=features
        self.label=label#标签所在列
        self.b=b#信息增益比的阈值
        self.a=a #损失函数的参数
        self.T = 0  # 一共有多少叶子结点
        self.CT = 0
        self.feature_name = feature_name
        self.root=self.create_tree(Train)
    def information_entropy(self,data,label):#计算data数据集的信息熵 label表示以哪一列作为求信息熵的基准
        labels=self.class_num(data,label)
        #print(labels)
        ie=0
        for k,v in labels.items():#计算信息熵
            p=v/data.shape[0]
            if p!=0:
              ie+= -p*math.log2(p)
        return ie # 返回信息熵
    def conditional_entropy(self,data,feature): #计算条件熵  feature表示条件熵中的特征的列值 label表示标签的列值
        fv=self.class_num(data,feature)
        datas={
   }# 根据feature的不同取值划分出多个数据块 存放在datas中
        ce=0
        for k,v in fv.items():
            d=data[(data[:, feature] == k),: ] #把feature那一列取值为k的数据放在一起
            datas[k]=d
        for k,v in datas.items():
            p=fv[k]/data.shape[0]
            ie=self.information_entropy(v,self.label)
            ce+=p*ie
        return ce #返回条件熵 ief 是数据集data 关于特征feature的值的熵
    def information_gain(self,data,feature):# 计算某个特征的信息增益比
        ie=self.information_entropy(data,self.label)
        ce=self.conditional_entropy(data,feature)
        ief=self.information_entropy(data,feature)
        # print(ie)
        # print(ce)
        ig=ie-ce
        # if ief !=0:
        #   return  ig / ief
        # else:
        return ig

    def processing_continuous_values(self,data,feature):#对连续值进行处理
        data = data