【机器学习】【决策树】C4.5算法，详解+python代码实现

最新推荐文章于 2024-04-20 22:55:55 发布

CV_ML_DP

最新推荐文章于 2024-04-20 22:55:55 发布

阅读量1.2w

点赞数 6

分类专栏：人工智能机器学习跟我一起学机器学习 Machine Learning

本文链接：https://blog.csdn.net/u012421852/article/details/79808223

版权

ID3算法只有树的生成，所以该算法生成的树容易产生过拟合，C4.5在生成的过程，用信息增益比来选择特征。

ID3算法以信息增益作为划分训练数据集的特征，存在缺点：偏向于选择特征值较多的特征。

C4.5算法使用信息增益比(information gain ratio)，可以对ID3算法这一缺点进行校正

注意：先计算数据集所有特征的信息增益比的平均值，然后找到信息增益比大于平均值的特征，然后从这些特征中找最大信息增益比的特征作为子树的根节点，否则结束子树的生成。

1.使用的样本集

    ''''' 
    提供训练样本集 
     
    每个example由多个特征值+1个分类标签值组成 
    比如第一个example=['youth', 'no', 'no', '1', 'refuse'],此样本的含义可以解读为： 
    如果一个人的条件是：youth age，no working, no house, 信誉值credit为1 
    那么此类人会被分类到refuse一类中，即在相亲中被拒绝 
     
    每个example的特征值类型为： 
    ['age', 'working', 'house', 'credit'] 
     
    每个example的分类标签class_label取值范围为：'refuse'或者'agree' 
    '''  
    data_list = [['youth', 'no',  'no',   '1', 'refuse'],  
                 ['youth', 'no',  'no',   '2', 'refuse'],  
                 ['youth', 'yes', 'no',   '2', 'agree'],  
                 ['youth', 'yes', 'yes',  '1', 'agree'],  
                 ['youth', 'no',  'no',   '1', 'refuse'],  
                 ['mid',   'no',  'no',   '1', 'refuse'],  
                 ['mid',   'no',  'no',   '2', 'refuse'],  
                 ['mid',   'yes', 'yes',  '2', 'agree'],  
                 ['mid',   'no',  'yes',  '3', 'agree'],  
                 ['mid',   'no',  'yes',  '3', 'agree'],  
                 ['elder', 'no',  'yes',  '3', 'agree'],  
                 ['elder', 'no',  'yes',  '2', 'agree'],  
                 ['elder', 'yes', 'no',   '2', 'agree'],  
                 ['elder', 'yes', 'no',   '3', 'agree'],  
                 ['elder', 'no',  'no',   '1', 'refuse']]  
    feat_list = ['age', 'working', 'house', 'credit']

2. 运行得到的样本集的决策树

每个特征值的字典中有一个gain_ratio关键字和对应value，使用gain_ratio选择子树的根节点。

下面是样本集的统计信息字典，为中间结果不作为绘制决策树使用，为生成决策树树字典而生：

=============== samples statistic dict ===================
len( samples statistic dict ): 6
type( samples statistic dict ): <class 'dict'>
np.shape( samples statistic dict ): ()
len(dict_shape): 0
samples statistic dict = {
         samples_ent :  0.9709505944546686
         samples_cnt :  15
         credit :{
                 cond_ent :  0.6079610319175832
                 2 :{
                         agree : {'p_self': 0.6666666666666666, 'cnt': 4}
                         ent :  0.9182958340544896
                         p_self :  0.4
                         cnt :  6
                         refuse : {'p_self': 0.3333333333333333, 'cnt': 2}
                         }
                 info_gain :  0.36298956253708536
                 gain_ratio :  0.2318538812872422
                 3 :{
                         agree : {'p_self': 1.0, 'cnt': 4}
                         ent :  0.0
                         p_self :  0.26666666666666666
                         cnt :  4
                         refuse : {'p_self': 0.0, 'cnt': 0}
                         }
                 1 :{
                         agree : {'p_self': 0.2, 'cnt': 1}
                         ent :  0.7219280948873623
                         p_self :  0.3333333333333333
                         cnt :  5
                         refuse : {'p_self': 0.8, 'cnt': 4}
                         }
                 }
         house :{
                 no :{
                         agree : {'p_self': 0.3333333333333333, 'cnt': 3}
                         ent :  0.9182958340544896
                         p_self :  0.6
                         cnt :  9
                         refuse : {'p_self': 0.6666666666666666, 'cnt': 6}
                         }
                 cond_ent :  0.5509775004326937
                 info_gain :  0.4199730940219749
                 yes :{
                         agree : {'p_self': 1.0, 'cnt': 6}
                         ent :  0.0
                         p_self :  0.4
                         cnt :  6
                         refuse : {'p_self': 0.0, 'cnt': 0}
                         }
                 gain_ratio :  0.4325380677663126
                 }
         working :{
                 no :{
                         agree : {'p_self': 0.4, 'cnt': 4}
                         ent :  0.9709505944546686
                         p_self :  0.6666666666666666
                         cnt :  10
                         refuse : {'p_self': 0.6, 'cnt': 6}
                         }
                 cond_ent :  0.6473003963031123
                 info_gain :  0.32365019815155627
                 yes :{
                         agree : {'p_self': 1.0, 'cnt': 5}
                         ent :  0.0
                         p_self :  0.3333333333333333
                         cnt :  5
                         refuse : {'p_self': 0.0, 'cnt': 0}
                         }
                 gain_ratio :  0.3524465495205019
                 }
         age :{
                 youth :{
                         agree : {'p_self': 0.4, 'cnt': 2}
                         ent :  0.9709505944546686
                         p_self :  0.3333333333333333
                         cnt :  5
                         refuse : {'p_self': 0.6, 'cnt': 3}
                         }
                 cond_ent :  0.8879430945988998
                 elder :{
                         agree : {'p_self': 0.8, 'cnt': 4}
                         ent :  0.7219280948873623
                         p_self :  0.3333333333333333
                         cnt :  5
                         refuse : {'p_self': 0.2, 'cnt': 1}
                         }
                 info_gain :  0.08300749985576883
                 gain_ratio :  0.05237190142858302
                 mid :{
                         agree : {'p_self': 0.6, 'cnt': 3}
                         ent :  0.9709505944546686
                         p_self :  0.3333333333333333
                         cnt :  5
                         refuse : {'p_self': 0.4, 'cnt': 2}
                         }
                 }
         }
======

生成的决策树字典：

=============== samples decision-tree dict ===================
len( samples decision-tree dict ): 1
type( samples decision-tree dict ): <class 'dict'>
np.shape( samples decision-tree dict ): ()
len(dict_shape): 0
samples decision-tree dict = {
house :{
no :{
working : {None: 'refuse', 'yes': 'agree'}
}
yes : agree
}
}

======

根据决策树字典绘制的决策树：

3.代码

# -*- coding: utf-8 -*-  
""" 
@author: 蔚蓝的天空Tom 
Talk is cheap,show me the code 
Aim:C4.5算法生成决策树(字典存储), 并绘制决策树图形
"""  

import numpy as np  
import math
import matplotlib.pyplot as plt
varnamestr = lambda v,nms: [ vn for vn in nms if id(v)==id(nms[vn])][0]  
class CUtileTool(object):  
    '''提供有用的方法 比如dump_list方法，可以打印给定的list的相关信息'''  
    def dump_list(self, src_list, src_li

最低0.47元/天解锁文章

CV_ML_DP

关注

6
点赞
踩
68

收藏

觉得还不错? 一键收藏
2
评论
【机器学习】【决策树】C4.5算法，详解+python代码实现

ID3算法只有树的生成，所以该算法生成的树容易产生过拟合，C4.5在生成的过程，用信息增益比来选择特征。ID3算法以信息增益作为划分训练数据集的特征，存在缺点：偏向于选择特征值较多的特征。C4.5算法使用信息增益比(information gain ratio)，可以对ID3算法这一缺点进行校正注意：先计算数据集所有特征的信息增益比的平均值，然后找到信息增益比大于平均值的特征，然后从这些特征中找最...
复制链接

扫一扫