python 决策树算法原理及基于numpy的代码实现

1. 基本原理

决策树本身的原理其实很好理解,就是不断做双向选择题,使用多个特征依次进行样本归类,并根据选择输出某一分类。
这里举一个简单的例子。(数据纯属瞎造嗷)

序号身高(cm)头发长声音音调性别
116036
216515
316328
417022
51773
61835
71702
817610
918413

例如需要训练一个鉴别男女性别的模型,根据身高、头发长度、声音音调的高低,3个变量来构造一颗决策树。这3个变量,可对应生成出3道类似如下的双向选择题:

a. 身高是否高于***cm
b. 头发是否长于***cm
c. 声音是否低沉

例如构造如下的一颗决策树:
在这里插入图片描述

2. 决策树的构造方法

决策树如何构造才能最高效的对样本进行正确分类,首先需要先弄清楚以下几个点:

  1. 信息的不纯度
  2. 特征分裂的最佳切分点
  3. 特征在决策树中分裂的先后顺序

2.1 信息的不纯度

不纯度是在决策树中衡量特征分裂优异性的最主要的指标,用于衡量样本在根据某一特征的分类标准分裂后,样本是否被正确分类的准确程度。

主要有3种计算方式,分别对应了3类决策树:

决策树类别不纯度计算方式
ID3信息增益(Information Gain)
C4.5信息增益率(Information Gain Ratio)
CART基尼系数(Gini index)

在介绍这3类不纯度的计算方式之前,先介绍一下信息熵的计算方法

2.1.1 信息熵的计算

信息熵的计算公式:
H ( D ) = − ∑ i = 1 n p i l o g 2 p i H(D)=-\sum_{i=1}^{n}p_ilog_2p_i H(D)=i=1npilog2pi
其中,
n n n 为样本 D D D 中的类别数,例如在该例子中是判断性别的男女,所以n=2
p i p_i pi 为该类别所占样本总数的比例,例如本例的样本有9条,女生有4个, p 女 生 = 4 9 p_{女生}=\frac{4}{9} p=94

序号身高(cm)头发长声音音调性别
116036
216515
316328
417022
51773
61835
71702
817610
918413

所以在还未分类时的总样本的信息熵为:
H ( D ) = − 4 9 l o g 2 4 9 − 5 9 l o g 2 5 9 = 0.9911 H(D)=-{4\over9}log_2{4\over9}-{5\over9}log_2{5\over9}=0.9911 H(D)=94log29495log295=0.9911

2.1.2 信息增益(Information Gain - ID3)

信息增益计算公式:
I n f o r m a t i o n    G a i n = g ( D , A ) = H ( D ) − H ( D ∣ A ) Information\;Gain=g(D, A) = H(D) - H(D|A) InformationGain=g(D,A)=H(D)H(DA)
其中:
H ( D ) 表 示 在 分 类 前 样 本 的 信 息 熵 H(D)表示在分类前样本的信息熵 H(D)
H ( D ∣ A ) 表 示 在 分 类 后 所 有 分 类 样 本 的 信 息 熵 之 和 H(D|A)表示在分类后所有分类样本的信息熵之和 H(DA)

在这个例子中,如果我们按照声音的高低来分类,则:

音调高的样本数:3

  • 其中女生3个,男生0个

音调低的样本数:6

  • 其中女生1个,男生5个

那么该分裂方法的信息增益为:

g ( D , 声 音 音 调 ) = H ( D ) − H ( D ∣ 声 音 音 调 ) = H ( D ) − [ H ( D ∣ 音 调 高 ) + H ( D ∣ 音 调 低 ) ] = 0.9911 − [ ( − 3 3 l o g 2 3 3 ) + ( − 1 6 l o g 2 1 6 − 5 6 l o g 2 5 6 ) ] = 0.9911 − [ 0 + 0.6500 ] = 0.3411 g(D, 声音音调)=H(D) - H(D|声音音调)\\=H(D)-[H(D|音调高) + H(D|音调低)]\\=0.9911-[(-{3\over3}log_2{3\over3})+(-{1\over6}log_2{1\over6}-{5\over6}log_2{5\over6 })]\\=0.9911-[0+0.6500]\\=0.3411 g(D,)=H(D)H(D)=H(D)[H(D)+H(D)]=0.9911[(33log233)+(61log26165log265)]=0.9911[0+0.6500]=0.3411

2.1.2 信息增益率(Information Gain Ratio - C4.5)

I G R = g ( D , A ) H ( A ) IGR= \frac{g(D,A)}{H(A)} IGR=H(A)g(D,A)
其中:
H ( A ) 为 A 分 类 方 法 下 的 信 息 熵 H(A)为A分类方法下的信息熵 H(A)A

信息增益率越大,表示该分类方法越能区分开样本,即不纯度越小。

序号身高(cm)头发长声音音调性别
116036
216515
316328
417022
51773
61835
71702
817610
918413

还是这个例子,这次用 “身高是否大于175cm”作为分类条件,则:

身高小于等于175cm的有5个

  • 其中4个女生,1个男生

身高大于175cm的有4个

  • 其中0个女生,4个男生

首先计算信息增益:
g ( D , 身 高 ) = H ( D ) − [ H ( D ∣ 身 高 ≤ 175 c m ) + H ( D ∣ 身 高 ≥ 175 ) ] = 0.9911 − [ ( − 4 5 l o g 2 4 5 − 1 5 l o g 2 1 5 ) + ( − 4 4 l o g 2 4 4 ) ] = 0.9911 − [ 0.7219 + 0 ] = 0.2692 g(D, 身高)=H(D)-[H(D|身高\le175cm) + H(D|身高\ge175)]\\=0.9911-[(-{4\over5}log_2{4\over5}-{1\over5}log_2{1\over5})+(-{4\over4}log_2{4\over4})]\\=0.9911-[0.7219+0]\\=0.2692 g(D,)=H(D)[H(D175cm)+H(D175)]=0.9911[(54log25451log251)+(44log244)]=0.9911[0.7219+0]=0.2692

顺带一提,这里我们相当于已经计算出了“声音音调的高低”和“身高是否大于175cm”两类属性的信息增益,分别为0.3411和0.2692,信息增益越大,说明信息贡献度越大,因此可以说,在本例中“声音音调的高低”比“身高是否大于175cm”区分能力更强

接下来计算 H ( A ) H(A) H(A)
因为“身高小于等于175cm”的有5个,“身高大于175cm”的有4个,所以
H ( A ) = − 5 9 l o g 2 5 9 − 4 9 l o g 2 4 9 = 0.9911 H(A)=-{5\over9}log_2{5\over9}-{4\over9}log_2{4\over9}=0.9911 H(A)=95log29594log294=0.9911
因此:
I G R ( D , 身 高 是 否 大 于 175 c m ) = g ( D , 身 高 ) H ( A ) = 0.2692 0.9911 = 0.2716 IGR(D, 身高是否大于175cm)={g(D, 身高) \over H(A)}={0.2692\over0.9911}=0.2716 IGR(D,175cm)=H(A)g(D,)=0.99110.2692=0.2716

2.1.3 基尼系数(Gini Index - CART)

G i n i ( p ) = ∑ i = 1 n p i ( 1 − p i ) = 1 − ∑ i = 1 n p i 2 Gini(p)=\sum_{i=1}^{n}p_i(1-p_i)=1-\sum_{i=1}^{n}p_i^2 Gini(p)=i=1npi(1pi)=1i=1npi2
其中:

n n n 为样本 D D D 中的类别数,例如在该例子中是判断性别的男女,所以n=2
p i p_i pi 为该类别所占样本总数的比例,例如本例的样本有9条,女生有4个, p 女 生 = 4 9 p_{女生}=\frac{4}{9} p=94

基尼系数值越小,表示该分类方法贡献的信息度越高,即不纯度越小。

序号身高(cm)头发长声音音调性别
116036
216515
316328
417022
51773
61835
71702
817610
918413

这次用头发长度举例,“头发长度是否大于20cm”,则:
“头发长大于20cm”的有3个

  • 其中3个女生,0个男生

“头发长小于等于20cm”的有6个

  • 其中1个女生,5个男生

则:
g i n i ( 头 发 长 度 是 否 大 于 20 c m ) = 1 − [ ( 3 9 ∗ g i n i ( 头 发 长 度 > 20 c m ) ) + ( 6 9 ∗ g i n i ( 头 发 长 度 ≤ 20 c m ) ) ] gini(头发长度是否大于20cm)=1-[({3\over9}*gini(头发长度\gt20cm))+({6\over9}*gini(头发长度\le20cm))] gini(20cm)=1[(93gini(>20cm))+(96gini(20cm))]
其中,
g i n i ( 头 发 长 度 > 20 c m ) = 1 − ( 3 3 ) 2 = 0 gini(头发长度\gt20cm)=1-({3\over3})^2=0 gini(>20cm)=1(33)2=0
g i n i ( 头 发 长 度 ≤ 20 c m ) = 1 − [ ( 1 6 ) 2 + ( 5 6 ) 2 ] = 5 18 gini(头发长度\le20cm)=1-[({1\over6})^2+({5\over6})^2]={5\over18} gini(20cm)=1[(61)2+(65)2]=185

因此:
g i n i ( 头 发 长 度 是 否 大 于 20 c m ) = 1 − [ ( 3 9 ∗ 0 ) + ( 6 9 ∗ 5 18 ) ] = 0.8148 gini(头发长度是否大于20cm)=1-[({3\over9}*0)+({6\over9}*{5\over18})]=0.8148 gini(20cm)=1[(930)+(96185)]=0.8148

搞清了不纯度的计算后,下面的两个问题就可以比较轻松的解决了,也就是构造一棵决策树最主要的两大问题:

  1. 特征的最佳切分点
  2. 特征在决策树中决策的先后顺序

2.2 特征的最佳切分点

例如在本例中,对于特征“身高是否高于***cm”,多少的身高阈值才是最好的分裂点呢?怎样计算最佳分裂点呢?

答案就是依据上一节所讲的不纯度

首先对于特征的类型分为两种情况:

  • 离散型变量
  • 连续型变量
2.2.1 离散型变量的最佳切分点划分

在决策树中,离散变量的分叉方法有两类,一类是多叉树,一类是二叉树。

一般来说,CART树为二叉树,C4.5和ID3则可以为多叉树。

离散变量的二叉树:
将多属性的变量再拆分为多个“是否”类型的划分问题。例如:

将“声音音调”分为3类属性值:高、中、低
那么在决策树中,需将该特征划分为 ( 类 别 数 − 1 ) (类别数 - 1) (1) 个分类属性:

  • 声音是否是高音调
  • 声音是否是中音调
    在这里插入图片描述

离散变量的多叉树:
在这里插入图片描述

不过,在现实世界的算法实现上,由于需要更多考虑算法的运算性能,大部分的库包都没有直接支持离散变量的训练,例如python的sklearn包,当中所有树类模型均为二叉树,且样本输入均限定为必须是数值型,也就是说,若样本中含离散变量,需预先将其利用one-hot编码或者binary编码进行编码变换。
但是这样的话就会很容易面临输入的数据过于稀疏,很多离散属性的样本数量可能本来就很少,导致分裂的信息增益过小,以至于很多样本的分类不准确。

目前LightGBM是直接支持离散变量的

2.2.2 连续型变量的最佳切分点划分
  1. 首先将样本的某一连续变量的值去重后按照升序进行排列,记为 A = { a 1 , a 2 , . . . a n } A=\{a_1,a_2,...a_n\} A={a1,a2,...an}
  2. 计算两两相邻的平均值 { a 1 + a 2 2 , a 2 + a 3 2 , . . . , a n − 1 + a n 2 } \{{a_1+a_2\over2},{a_2+a_3\over2},...,{a_{n-1}+a_n\over2}\} {2a1+a2,2a2+a3,...,2an1+an},记为 B = { b 1 , b 2 , . . . , b n − 1 } B=\{{b_1, b_2,..., b_{n-1}}\} B={b1,b2,...,bn1}
  3. 遍历 B B B,将 B B B的每一个点都作为该连续变量的切分点,并计算其分裂的不纯度,获得长度为 n − 1 n-1 n1的不纯度集合,记为 C = { c 1 , c 2 , . . . c n − 1 } C=\{c_1,c_2,...c_{n-1}\} C={c1,c2,...cn1}
  4. C C C中最大的不纯度,其切分点即为最佳切分点。
序号身高(cm)头发长声音音调性别
116036
216515
316328
417022
51773
61835
71702
817610
918413

例如身高这个变量:

  1. 按照升序排序:160, 163, 165, 170, 176, 177, 183, 184
  2. 求相邻两个的平均值:161.5, 164, 167.5, 173, 176.5, 180, 183.5
  3. 以161.5作为分裂点,计算其不纯度,这里以计算信息增益为例,不知道怎么算的回看2.1.2…
  4. 分别计算出每个点的不纯度,得到不纯度最大的那个点,即是该变量的最佳分裂点。

2.3 特征在决策树中决策的先后顺序

决策的先后顺序,即为根据不同变量进行分裂的顺序。在找出每个变量的最佳分裂点后,可以计算出以该点分裂所能获得的信息度(信息增益/信息增益率/基尼系数…等),以最大信息度的变量放在最前面进行分裂,最小的放在最后面分裂。这样就确定了在对样本进行区分的时候,越早分裂的样本能以最佳的区分方法进行划分。

2.4 ID3, C4.5, CART对比说明

类型特点劣势
ID3多叉树;特征只用一次;健壮性较好,能训练属性值有缺失的情况1.当特征取值类型较多时,信息增益会越大,容易造成过拟合;2. 只能用于分类;3. 只能处理离散变量;4. 对缺失值敏感
C4.5多叉树;特征只用一次;使用信息增益比对特征值多的特征进行惩罚,减少过拟合;可以处理连续变量;可以处理缺失值处理连续值只是将其离散化,形成多个离散值
CART二叉树;特征使用多次;可以用于回归任务-

3. numpy代码实现

接下来就是运用以上所讲的,只用numpy包进行决策树的算法实现。

由于代码是之前写的了,当时为了能输出每棵树的分叉详情的dict字典,费了不少力,可能代码比较臃肿,实现的思路也比较麻烦,但是大体的算法实现的方法是和上述所讲保持一致的。

ID3、C4.5、CART树的实现,只实现了分叉时的增益算法,其他方面的特征没有完全还原

3.1 代码

import numpy

class Tree():
    def __init__(self, node, impurity, depth, left, right, is_leaf, label, index, split):
        """
        Initialize the tree dict.
        This class builds tree dictionary, based on the growing path.

        Parameters:
        -----------
        node: dict
            A dictionary include the amount of the observations,and the amount for each label.
        impurity: float
            The impurity of this node.
        depth: int
            The depth of this node.
        left: dict
            The left child node.
        right: dict
            The right child node.
        is_leaf: boolean
            To determinate the leaf node
        label: int
            The label of this node
        index: int
            The index of variable for this node.
        split: int
            The best split point for this variable.
        """
        self.tree = {"num": node, "impurity": impurity, "depth": depth, "left": left, "right": right,
                     "is_leaf": is_leaf, "label": label, "index": index, "split": split}

    def generate_tree_path(self, path):
        """
        Generate the tree path.
        Based on tree path, which is the main function running order, 
        '0' represents the tree goes left child node,
        '1' represents the tree goes right child node.
        This function transforms the running order into a tree dictionary 
        indices.

        For example:
            >> # which means the tree growing order is: 
               # left > left > left > right > right > right > left > right
               path = '00011101' 
            >> generate_tree_path(path)
            >> ["right"]["right"]
        Parameters:
        -----------
        path: str
            A string of the tree path.

        Returns:
        -------
        str
            A string that represents the path of the 
            tree dict.
        """
        dict_index = ""
        for i in path:
            if i == "0":
                dict_index = dict_index + "0"
            else:
                last_l = dict_index.rfind("0")
                dict_index = dict_index[:last_l] + "1"
        return dict_index.replace("1", '["right"]').replace("0", '["left"]')

    def add_node(self, path, node, impurity, depth, left, right, is_leaf, label, index, split):
        """
        Add a node for the tree.
        Update the tree dict.

        Parameters:
        -----------
        path: str
            A string that represents the path of the tree dict.
        node: dict
            A dictionary with the total amount of the observations,and the amount for each label.
        impurity: float
            The impurity of this node.
        depth: int
            The depth of this node.
        left: dict
            The left child node.
        right: dict
            The right child node.
        is_leaf: boolean
            To determinate the leaf node.
        label: int
            The label of this node.
        index: int
            The index of variable for this node.
        split: int
            The best split point for this variable.
        """
        tree_path_index = self.generate_tree_path(path)
        set_dict = {"num": node, "impurity": impurity, "depth": depth, "left": left, "right": right, "is_leaf": is_leaf,
                    "label": label, "index": index, "split": split}
        exec("self.tree" + tree_path_index.__str__() + " = set_dict")


class btree():
    def __init__(self, method='ID3', sample_weight=None, depth=10, min_impurity=0, min_samples_split=2):
        """

        Parameters:
        -----------
        method: str (default='ID3')
            The node split method. 
            'ID3' for information gain
            'C45' for information gain ratio
            'CART' for gini.
        sample_weight: list (default=None)
            Sample weight.
        depth: int (default=10),
            Maximum depth of the tree.
        min_impurity: float (defualt=0)
            A node will split if its impurity is above this threshold, 
            otherwise it is a leaf.
        min_samples_split: int (default=2)
            A node will split if its numbter of samples is greater or equal to 
            this threshold.

        Returns:
        -------
        count_array: array
            An array of amounts for each label.
        """
        self.method = method
        self.sample_weight = []
        self.node_list = []
        self.feature_importance = []
        self.depth = depth
        self.min_impurity = min_impurity
        self.min_samples_split = min_samples_split
        self.t = None
        self.path = ''

    def group_count(self, array):
        """
        Calculate the amount for each label.

        Parameters:
        -----------
        array: array
            Array of the label

        Returns:
        -------
        count_array: array
            An array of amounts for each label.
        """
        groups = np.unique(array)
        count_array = np.array(list(map(lambda x: array[array == x].__len__(), groups)))
        return count_array.astype(int)

    def calc_entropy(self, label):
        """
        Calculate entropy.

        Parameters:
        -----------
        label: array
            Array of the label.

        Returns:
        -------
        entropy: float
            The entropy of the label.
        """
        label_count = self.group_count(label)
        return sum(-label_count / label_count.sum() * np.log2(label_count / label_count.sum()))

    def calc_gini(self, label):
        """
        Calculate gini.

        Parameters:
        -----------
        label: array
            Array of the label.

        Returns:
        -------
        gini: float
            The gini of the label.
        """
        label_count = self.group_count(label)
        return 1 - sum((label_count / label_count.sum()) ** 2)

    def calc_impurity(self, combine, left_combine, right_combine):
        """
        Calculate impurity depends on chosen method.

        Parameters:
        -----------
        combine: array
            The array of dataset.
        left_combine: array
            The array of dataset for left child node.
        right_combine: array
            The array of dataset for right child node.

        Returns:
        -------
        impurity: float 
            The impurity of the node.
        """
        total_entropy = self.calc_entropy(combine[:, -1])
        if self.method != 'CART':
            entropy_left = self.calc_entropy(left_combine[:, -1])
            entropy_right = self.calc_entropy(right_combine[:, -1])
            entropy_node = left_combine.shape[0] / combine.shape[0] * entropy_left + right_combine.shape[0] / \
                           combine.shape[0] * entropy_right
            if self.method == 'ID3':
                entropy_increment = total_entropy - entropy_node
                impurity = entropy_increment

            elif self.method == 'C45':
                entropy_increment = total_entropy - entropy_node
                split_node = np.hstack((left_combine[:, -1], right_combine[:, -1]))
                entropy_split_node = self.calc_entropy(split_node)
                entropy_ratio = entropy_increment / entropy_split_node
                impurity = entropy_ratio
        else:
            gini_left = self.calc_gini(left_combine[:, -1])
            gini_right = self.calc_gini(right_combine[:, -1])
            gini = left_combine.shape[0] / combine.shape[0] * gini_left + right_combine.shape[0] / combine.shape[
                0] * gini_right
            impurity = 1 - gini
        return impurity

    # 每个节点遍历计算
    def continuous_variable_node(self, combine):
        """
        Calculate the best split point for every variable.

        Parameters:
        -----------
        combine: array
            The array of dataset.

        Returns:
        -------
        best_node_list: list
            The list of best split point for each variable.
        """
        sorted_data = np.sort(combine[:, :-1], axis=0)
        sorted_list = sorted_data.T.tolist()
        best_node_list = list()
        #        def run_m(node):
        #            left_combine = combine[np.where(combine[:,index_] <= node)[0],:]
        #            right_combine = combine[np.where(combine[:,index_] > node)[0],:]
        #            impurity = self.calc_impurity(combine,left_combine,right_combine)
        #            return impurity

        for index_ in range(len(sorted_list)):
            sorted_ = sorted_list[index_]
            sorted_set = sorted(list(set(sorted_)))
            max_impurity = -np.Inf

            node_l = [(sorted_set[i] + sorted_set[i + 1]) / 2 for i in range(len(sorted_set)) if
                      i <= len(sorted_set) - 2]
            for node_ in node_l:
                left_combine = combine[np.where(combine[:, index_] <= node_)[0], :]
                right_combine = combine[np.where(combine[:, index_] > node_)[0], :]
                impurity = self.calc_impurity(combine, left_combine, right_combine)
                if impurity >= max_impurity:
                    max_impurity = impurity
                    best_node = node_

            # print(self.method + ' max value: ' + str(max_impurity))
            # print(combine_df[combine_df[index_] <= best_node].groupby(combine_df['label'])['label'].count())
            # print(combine_df[combine_df[index_] > best_node].groupby(combine_df['label'])['label'].count())
            best_node_list.append(best_node)
        return best_node_list

    ## 更快的寻找最佳分裂点的方法, 1个中位数点,1个四分位点    
    #    def continuous_variable_node(self,data,label):
    ##        data_df = pd.DataFrame(data)
    ##        label_df = pd.DataFrame(label,columns=['label'])
    ##        combine_df = pd.concat([data_df,label_df],axis=1)
    #        combine = np.column_stack((data,label))
    #        sorted_data = np.sort(data,axis=0)
    #        sorted_list = sorted_data.T.tolist()
    #        best_node_list = list()
    #        def run_m(node):
    #            left_combine = combine[np.where(combine[:,index_] <= node)[0],:]
    #            right_combine = combine[np.where(combine[:,index_] > node)[0],:]
    #            impurity = self.calc_impurity(combine,left_combine,right_combine)
    #            return impurity
    #        
    #        def find_two_pivot_index(list_):
    #            n = len(list_)
    #            if n <= 4 and n >= 2:
    #                return n-2,n-1
    #            if n == 1:
    #                return 0,1
    #            if n % 4 == 0:
    #                index_node_l = int(n/2 - 1)
    #                index_node_r = int(n/4 + n/2 - 1)
    #            else:
    #                index_node_l = int(np.floor(n/2))
    #                index_node_r = int(np.floor(n/4) + np.floor(n/2))
    #            return index_node_l,index_node_r
    #        
    #        def three_way(node_l):
    #            index_node_l,index_node_r = find_two_pivot_index(node_l)
    #            
    #            left_entropy = run_m(node_l[index_node_l])
    #            #print(left_entropy,end=' ')
    #            right_entropy = run_m(node_l[index_node_r])
    #            #print(right_entropy)
    #            if len(node_l)<=2:
    #                if left_entropy <= right_entropy:
    #                    return node_l[-1]
    #                else:
    #                    return node_l[0]
    #            else:
    #                if left_entropy <= right_entropy:
    #                    node_l = node_l[index_node_l:]
    #                else:
    #                    node_l = node_l[0:index_node_r]
    #                return three_way(node_l)
    #            
    #        for index_ in range(len(sorted_list)):
    #            sorted_ = sorted_list[index_]
    #            sorted_set = sorted(list(set(sorted_)))
    #            node_l = [(sorted_set[i] + sorted_set[i+1])/2 for i in range(len(sorted_set)) if i <= len(sorted_set)-2]
    #            best_node = three_way(node_l)
    #            best_node_list.append(best_node)
    #        return best_node_list

    def get_feature_importance_index(self, combine):
        """
        Get variable index, return new index with highest impurity 
        which was not in self.feature_importance for each time run this function.

        Parameters:
        -----------
        combine: array
            The array of dataset.

        Returns:
        -------
        int: 
            Return new index with highest impurity 
            which was not in self.feature_importance for each time run this function.
        """
        impurity_list = list()
        for index_, node_ in enumerate(self.node_list):
            left_combine = combine[np.where(combine[:, index_] <= node_)[0], :]
            right_combine = combine[np.where(combine[:, index_] > node_)[0], :]
            impurity = self.calc_impurity(combine, left_combine, right_combine)
            impurity_list.append(impurity)
        sorted_index = sorted(range(len(impurity_list)), key=lambda k: impurity_list[k], reverse=True)
        sorted_index_filtered = [i for i in sorted_index if i not in self.feature_importance]
        if len(sorted_index_filtered) != 0: return sorted_index_filtered[0]

    def cbind(self, data, label):
        """
        Column bind data and label arrays.
        """
        combine = np.column_stack((data, label))
        return combine

    #   #每一层的节点选择为同一特征
    def tree_growth(self, combine, depth):
        """
        Split data by the chosen variable by depth.

        Parameters:
        -----------
        combine: array
            The array of dataset.
        depth: int
            The depth of the tree.

        Returns:
        -------
        list:
            A list with left child node data and right child node data.
        """
        # selected_index = self.sorted_feature_importance[depth]
        self.feature_importance.append(self.get_feature_importance_index(combine))
        selected_index = self.feature_importance[depth]
        left_combine = combine[np.where(combine[:, selected_index] <= self.node_list[selected_index])[0], :]
        right_combine = combine[np.where(combine[:, selected_index] > self.node_list[selected_index])[0], :]
        return [left_combine, right_combine]

    def build_tree(self, combine, depth=0):
        """
        A recursive function to generate tree.

        Parameters:
        -----------
        combine: array
            The array of dataset.
        depth: int (default=0)
            The depth of the tree.

        """
        child_df = self.tree_growth(combine, depth)
        left, right = child_df[0], child_df[1]

        impurity = self.calc_impurity(combine, left, right)

        # start growth if satisfies the conditions
        if impurity > self.min_impurity and depth < self.depth and left.shape[0] >= self.min_samples_split and \
                right.shape[0] >= self.min_samples_split:
            if depth == 0:
                # initialize Tree class
                self.t = Tree({"total": combine.shape[0],
                               "group count": dict(zip(np.unique(combine[:, -1]), self.group_count(combine[:, -1])))},
                              impurity,
                              0,
                              dict(zip(np.unique(left[:, -1]), self.group_count(left[:, -1]))),
                              dict(zip(np.unique(right[:, -1]), self.group_count(right[:, -1]))),
                              False,
                              np.unique(combine[:, -1])[0] if len(
                                  self.group_count(combine[:, -1])) == 1 else self.group_count(combine[:, -1]).argmax(),
                              self.feature_importance[depth],
                              self.node_list[self.feature_importance[depth]])
            else:
                # add node for tree dict.
                self.t.add_node(self.path,
                                {"total": combine.shape[0],
                                 "group count": dict(zip(np.unique(combine[:, -1]), self.group_count(combine[:, -1])))},
                                impurity,
                                depth,
                                dict(zip(np.unique(left[:, -1]), self.group_count(left[:, -1]))),
                                dict(zip(np.unique(right[:, -1]), self.group_count(right[:, -1]))),
                                False,
                                np.unique(combine[:, -1])[0] if len(
                                    self.group_count(combine[:, -1])) == 1 else self.group_count(
                                    combine[:, -1]).argmax(),
                                self.feature_importance[depth],
                                self.node_list[self.feature_importance[depth]])
            # growing tree by left child node and right child node consecutively in a for loop.
            for i in range(len(child_df)):
                self.path = self.path + str(i)
                df = child_df[i]
                # here starts the recurse by input dataset and depth
                self.build_tree(df, depth + 1)
        # else add leaf node, where the parameter "is_leaf" would be True.
        else:
            self.t.add_node(self.path,
                            {"total": combine.shape[0],
                             "group count": dict(zip(np.unique(combine[:, -1]), self.group_count(combine[:, -1])))},
                            impurity,
                            depth,
                            dict(zip(np.unique(left[:, -1]), self.group_count(left[:, -1]))),
                            dict(zip(np.unique(right[:, -1]), self.group_count(right[:, -1]))),
                            True,
                            np.unique(combine[:, -1])[0] if len(
                                self.group_count(combine[:, -1])) == 1 else self.group_count(combine[:, -1]).argmax(),
                            self.feature_importance[depth],
                            self.node_list[self.feature_importance[depth]])

    def fit(self, data, label):
        """
        Fit the model by growing the tree.

        Parameters:
        -----------
        data: array
            The array of data.
        label: int
            The array of label.
        """
        combine = self.cbind(data, label)
        self.node_list = self.continuous_variable_node(combine)
        self.build_tree(combine)

    def predict_main(self, data):
        """
        The main predict function.

        Parameters:
        -----------
        data: array
            The array of dataset.

        Returns:
        -------
        label: array
            The prediction array for the dataset.
        """
        dic = self.t.tree
        while dic['is_leaf'] == False:
            if data[dic['index']] <= dic['split']:
                dic = dic['left']
            else:
                dic = dic['right']
        label = dic['label']
        return label

    def predict(self, data):
        """
        The predict function.

        Parameters:
        -----------
        data: array
            The array of dataset.

        Returns:
        -------
        label: array
            The prediction array for the dataset.
        """
        return np.apply_along_axis(self.predict_main, 1, data)

    def score(self, predict, test):
        """
        Calculate the accuracy for the model on the test dataset.

        Parameters:
        -----------
        predict: array
            The array of prediction array.
        test: array
            The array of test label data.

        Returns:
        -------
        int:
            The accuracy.
        """
        count = 0
        for i, j in zip(predict.tolist(), test.tolist()):
            if i == j:
                count += 1
        return count / len(predict)

    @property
    def tree(self):
        """
        A dictionary contains all information for every depth , including the best split
        point, variable index, observations amount and left/right child node.

        Returns:
        -------
        dict: 
            The tree dict.
        """
        return self.t.tree

3.2 测试

import sklearn.datasets as ds

def get_train_test_data(data, label, percentile=0.8):
    data_df = pd.DataFrame(data)
    label_df = pd.DataFrame(label, columns=['label'])
    combine_df = pd.concat([data_df, label_df], axis=1)
    label_count = label_df.groupby(label).count()
    train_df = pd.DataFrame()
    for label_name in label_count.index.tolist():
        tmp = combine_df[combine_df['label'] == label_name]
        index_list = tmp.index.tolist()
        random_select_index = np.random.choice(index_list, round(len(index_list) * percentile), replace=False)
        tmp_df = tmp.loc[random_select_index]
        train_df = pd.concat([train_df, tmp_df], axis=0)
    test_df = combine_df.drop(train_df.index)
    train_data, train_label, test_data, test_label = train_df[train_df.columns[:-1]], train_df['label'], test_df[
        test_df.columns[:-1]], test_df['label']
    return np.array(train_data), np.array(train_label), np.array(test_data), np.array(test_label)

# 测试数据
d = ds.load_breast_cancer()
data = d['data']
label = d['target']

# 拆分训练集测试集
train_data, train_label, test_data, test_label = get_train_test_data(data, label)

# 训练决策树,采用4层的ID3树
bt1 = btree(method='ID3', depth=4)
bt1.fit(train_data, train_label)

# 获取决策树的分叉详情
t_dict = bt1.tree

查看决策树的分裂详情:

在这里插入图片描述
在这里插入图片描述

叶子节点:

在这里插入图片描述

查看每个变量的最佳分裂点

node_list1 = bt1.node_list

在这里插入图片描述

查看特征重要度排序

返回的是变量的index,越靠前的变量越重要。

feature_importance = bt1.feature_importance

在这里插入图片描述

预测

def compare_result(predict, test):
    count = 0
    for i, j in zip(predict.tolist(), test.tolist()):
        if i == j:
            count += 1
    return count / len(predict)

y_predict = bt1.predict(test_data)
compare_result(y_predict, test_label)

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值