【机器学习】【决策树】自己动手用Python实现一个类：in样本集，out特征分布、概率密度、熵、条件熵、信息增益、信息增益比

本文链接：https://blog.csdn.net/u012421852/article/details/79790773

本文介绍了如何使用Python实现一个类，该类用于计算样本集的特征分布、概率密度、香农熵、条件熵和信息增益。通过这个类，可以方便地找到决策树的根节点，并探讨了信息增益比在决策树生成中的作用。

摘要由CSDN通过智能技术生成

看懂代码的前提需要理解样本空间分布，概率密度，香农熵，条件熵，信息增益等概念，

否则代码看不懂，不理解的可以看以前博客~

1.说明

1.1要实现的类

class CSamplesTool(object)

1.2输入的样本集

输入的样本集，样例由下面的方法提供：

def create_samples():
    '''
    提供训练样本集
    
    每个example由多个特征值+1个分类标签值组成
    比如第一个example=['youth', 'no', 'no', '1', 'refuse'],此样本的含义可以解读为：
    如果一个人的条件是：youth age，no working, no house, 信誉值credit为1
    那么此类人会被分类到refuse一类中，即在相亲中被拒绝
    
    每个example的特征值类型为：
    ['age', 'working', 'house', 'credit']
    
    每个example的分类标签class_label取值范围为：'refuse'或者'agree'
    '''
    data_list = [['youth', 'no',  'no',   '1', 'refuse'],
                 ['youth', 'no',  'no',   '2', 'refuse'],
                 ['youth', 'yes', 'no',   '2', 'agree'],
                 ['youth', 'yes', 'yes',  '1', 'agree'],
                 ['youth', 'no',  'no',   '1', 'refuse'],
                 ['mid',   'no',  'no',   '1', 'refuse'],
                 ['mid',   'no',  'no',   '2', 'refuse'],
                 ['mid',   'yes', 'yes',  '2', 'agree'],
                 ['mid',   'no',  'yes',  '3', 'agree'],
                 ['mid',   'no',  'yes',  '3', 'agree'],
                 ['elder', 'no',  'yes',  '3', 'agree'],
                 ['elder', 'no',  'yes',  '2', 'agree'],
                 ['elder', 'yes', 'no',   '2', 'agree'],
                 ['elder', 'yes', 'no',   '3', 'agree'],
                 ['elder', 'no',  'no',   '1', 'refuse']]
    feat_type_list = ['age', 'working', 'house', 'credit']
    return data_list, feat_type_list

1.3输出的样本集的特征分布、概率密度、香农熵、条件熵、信息增益组成的字典

对于样例的样本集，通过类的计算，输出如下目标字典，字典内部包含了样本集的香农熵、特征分布、特征概率密度、条件熵、信息增益等信息：

1.3.1运行结果截图

1.3.2运行结果log

feat_dict = {
         house :{
                 condition_entropy :  0.5509775004326937
                 cnt :  15
                 info_gain :  0.4199730940219749
                 yes : {'cnt': 6, 'refuse': 0, 'shannon_entropy': 0.0, 'p_agree': 1.0, 'p_refuse': 0.0, 'p_house': 0.4, 'agree': 6}
                 no : {'cnt': 9, 'refuse': 6, 'shannon_entropy': 0.9182958340544896, 'p_agree': 0.3333333333333333, 'p_refuse': 0.6666666666666666, 'p_house': 0.6, 'agree': 3}
                 }
         credit :{
                 condition_entropy :  0.6079610319175832
                 cnt :  15
                 3 : {'cnt': 4, 'refuse': 0, 'p_credit': 0.26666666666666666, 'shannon_entropy': 0.0, 'p_agree': 1.0, 'p_refuse': 0.0, 'agree': 4}
                 1 : {'cnt': 5, 'refuse': 4, 'p_credit': 0.3333333333333333, 'shannon_entropy': 0.7219280948873623, 'p_agree': 0.2, 'p_refuse': 0.8, 'agree': 1}
                 info_gain :  0.36298956253708536
                 2 : {'cnt': 6, 'refuse': 2, 'p_credit': 0.4, 'shannon_entropy': 0.9182958340544896, 'p_agree': 0.6666666666666666, 'p_refuse': 0.3333333333333333, 'agree': 4}
                 }
         working :{
                 condition_entropy :  0.6473003963031123
                 cnt :  15
                 info_gain :  0.32365019815155627
                 yes : {'cnt': 5, 'refuse': 0, 'shannon_entropy': 0.0, 'p_working': 0.3333333333333333, 'p_agree': 1.0, 'p_refuse': 0.0, 'agree': 5}
                 no : {'cnt': 10, 'refuse': 6, 'shannon_entropy': 0.9709505944546686, 'p_working': 0.6666666666666666, 'p_agree': 0.4, 'p_refuse': 0.6, 'agree': 4}
                 }
         age :{
                 condition_entropy :  0.8879430945988998
                 cnt :  15
                 info_gain :  0.08300749985576883
                 elder : {'cnt': 5, 'refuse': 1, 'p_age': 0.3333333333333333, 'shannon_entropy': 0.7219280948873623, 'p_agree': 0.8, 'p_refuse': 0.2, 'agree': 4}
                 youth : {'cnt': 5, 'refuse': 3, 'p_age': 0.3333333333333333, 'shannon_entropy': 0.9709505944546686, 'p_agree': 0.4, 'p_refuse': 0.6, 'agree': 2}
                 mid : {'cnt': 5, 'refuse': 2, 'p_age': 0.3333333333333333, 'shannon_entropy': 0.9709505944546686, 'p_agree': 0.6, 'p_refuse': 0.4, 'agree': 3}
                 }
         }

2.话不多说，直接上代码

# -*- coding: utf-8 -*-
"""
@author: 蔚蓝的天空Tom
Talk is cheap,show me the code
Aim:样本集中每种类型特征的变量分布、概率分布、香农熵、条件熵
Aim:每种类型特征都分别有多种特征值，请萃取出每个特征值的样本数据
Aim:求每种特征值被分类的概率分布
"""


import numpy as np
import math


'''Tool Function'''
varnamestr = lambda v,nms: [ vn for vn in nms if id(v)==id(nms[vn])][0]


#======&#