【Python机器学习】决策树ID3算法结果可视化附源代码 对UCI数据集Caesarian Section进行分类


决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。 来源:决策树_百度百科

数据集使用UCI数据集 Caesarian Section Classification Dataset Data Set

与数据集相关的详细信息和下载地址

  • 本代码实现了决策树ID3算法,并使用决策树ID3算法进行预测。
  • 决策树算法写到中,实现代码复用,并在使用过程中降低复杂度。
  • 将logging日志等级调整为DEBUG,可以输出决策树每一步的详细过程
  • 通过使用mermaid的文本绘图格式对决策树进行了可视化

实现所用到的库

  • Python 3
  • Pandas
  • sklearn(仅用于切分数据集)
  • numpy

实现

经验熵计算

熵中的概率由数据估计(特别是最大似然估计)得到时,所对应的熵称为经验熵

经验熵计算公式

H = − ∑ i = 1 n p ( x i ) l o g 2 ( p ( x i ) ) H = -\sum^n_{i=1}p(x_i)log_2(p(x_i)) H=i=1np(xi)log2(p(xi))

    def empirical_entropy(self, dataset=None):
        """
        求经验熵
        $$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$
        :return: Float 经验熵
        """
        if dataset is None:
            dataset = self.DataSet
        columns_count = dataset.iloc[:, -1].value_counts()
        entropy = 0
        total_count = columns_count.sum()
        for count in columns_count:
            p = count / total_count
            entropy -= p * np.log2(p)
        return entropy

条件熵

条件熵 H(Y∣X)H(Y|X)H(Y∣X)表示在已知随机变量X的条件下随机变量Y的不确定性。

定义X给定条件下Y的条件概率分布的对X的数学期望

H ( Y ∣ X ) = ∑ i = 1 n p ( i ) H ( Y ∣ X = x i ) H(Y|X) = \sum_{i=1}^np(i)H(Y|X=x_i) H(YX)=i=1np(i)H(YX=xi)

信息增益

信息增益表示得知特征X的信息而使得类Y的信息不确定性减少的程度。
即:选择该特征对分类的帮助程度。

在分类问题困难时,也就是说在训练数据集经验熵大的时候,信息增益值会偏大反之信息增益值会偏小

使用信息增益比可以对这个问题进行校正,这是特征选择的另一个标准。

特征A对训练数据集D的信息增益g(D,A),定义为集合D的经验熵H(D)与特征A给定条件下D的经验条件熵H(D|A)之差:

g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A) = H(D)-H(D|A) g(D,A)=H(D)H(DA)

ID3

简单来说,就是不断选取能够对分类提供最大效果的属性,然后根据属性的各个值选取接下来的最佳属性

选择信息增益最大的属性

因为条件经验熵越小(表示该分类的结果比较统一,即信息增益越大)表示该属性对于分类重要性越大

其中extract_dataset 相当于在符合指定条件下数据集,用于接下来计算条件经验熵,并获得信息增益。

    def extract_dataset(self, dataset: pd.DataFrame, column, label):
        """
        根据column和label筛选出指定的数据集
        :return: pd.DataFrame 筛选后的数据集
        """
        if type(column) == int:
            split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)
        else:
            split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)
        return split_dataset

    def best_empirical_entropy(self, dataset: pd.DataFrame = None):
        """
        选取数据集中的columns中,最好的column(经验熵最大)
        :param dataset: 带选取的数据集
        :return: 返回column
        """
        if dataset is None:
            dataset = self.DataSet
        columns = dataset.columns[:-1]
        total_count = dataset.shape[0]
        empirical_entropy = self.empirical_entropy(dataset)
        logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")
        logging.debug(f"empirical_entropy is {empirical_entropy}")
        informationGain_max = -1
        best_column = None
        for column in columns:
            entropy_tmp = 0
            data_counts = dataset.loc[:, column].value_counts()
            data_labels = data_counts.index
            logging.debug(f"now is {column}")
            for label in data_labels:
                split_dataset = self.extract_dataset(dataset, column, label)
                count = split_dataset.shape[0]
                p = count / total_count
                entropy_tmp += p * self.empirical_entropy(split_dataset)
                logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "
                              f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "
                              f"entropy: {self.empirical_entropy(split_dataset)}")
            informationGain = empirical_entropy - entropy_tmp
            logging.debug(f"entropy: {entropy_tmp}, {column} informationGain:{informationGain}")
            if informationGain > informationGain_max:
                best_column = column
                informationGain_max = informationGain
        logging.debug(f"Choose {best_column}:{informationGain_max}")
        return best_column

过程

  1. 选取信息增益最大的属性。
  2. 如果各个属性的最大的信息增益不够大,即对分类帮助有限,此时直接设定为结果分类中,数量最多的一个值
  3. 如果没有可以选取的属性(因为属性在之前已经选择完了),此时同样选取结果数量最多的一个值

    造成没有可以选取的原因:因为可能同一个属性,可能有不同结果。

  4. 选取当前属性的各个值,然后分别执行1;
  5. 当递归完毕,即每个属性的值最终都有一个值,即为决策树,如果在测试过程出现训练阶段没有出现的结果,可以为每一个属性单独设置一个其他值用于表示决策树中没有该属性的值时决策树的输出结果,这个值可以设置为当前属性数量最多的结果值
    def id3(self, dataset: pd.DataFrame = None):
        '''
        实现决策树的ID3算法
        :param dataset: 输入的数据集
        :return: dict 决策树节点
        '''
        if dataset is None:
            dataset = self.DataSet
        next_tree = {}
        result_count = dataset.iloc[:, -1].value_counts()
        result_max = result_count.idxmax()
        next_tree["其他"] = result_max
        if result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:
            self._leafCount += 1
            logging.debug(f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")
            tree = {"next": next_tree}
        else:
            best_column = self.best_empirical_entropy(dataset)
            value_counts = dataset[best_column].value_counts()
            labels = value_counts.index
            for label in labels:
                logging.debug(f"now choose_column:{best_column}, label: {label}")
                split_dataset = self.extract_dataset(dataset, best_column, label)
                next_decision = self.id3(split_dataset)
                next_tree[label] = next_decision
            tree = {"column": best_column, "next": next_tree}
        return tree

拟合

    def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):
        '''
        拟合函数,输入数据集进行拟合,其中如果y没有输入,则x的最后一列应包含分类结果
        :param x: pd.DataFrame数据集的属性(当y为None时,为整个数据集-包含结果)
        :param y: list like,shape=(-1,)数据集的结果
        :param algorithm: 选择算法(目前仅有ID3)
        :param threshold: 选择信息增益的阈值
        :return: 决策树的根节点
        '''
        self.check_dataset(x, dimension=2)
        self.check_dataset(y, dimension=1)
        self._threshold = threshold
        dataset = x
        if y is not None:
            dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)
        self.decision_tree = eval("self." + algorithm)(dataset)
        logging.info(f"decision_tree leaf:{self._leafCount}")
        return self.decision_tree

预测

    def predict(self, x: pd.DataFrame):
        '''
        预测数据
        :param x:pd.DataFrame 输入的数据集
        :return: 分类结果
        '''
        self.y_predict = x.apply(self._predict_line, axis=1)
        return self.y_predict

    def _predict_line(self, line):
        """
        私有函数,用于在predict中,对每一行数据进行预测
        :param line: 输入的数据集的某一行数据
        :return: 该一行的分类结果
        """
        tree = self.decision_tree
        while True:
            try:
                if len(tree["next"]) == 1:
                    return tree["next"]["其他"]
                else:
                    value = line[tree["column"]]
                    tree = tree["next"][value]
            except:
                return tree["next"]["其他"]

评估

评估结果的准确度,精确度,召回率。

  • score评估函数:仅适用于二分类,对于多分类该算法不适用(但是决策树代码可以predict预测)
  • 同时score判断正例需要结果为1,反例结果为0。
    def score(self, y):
        '''
        评估函数,用于评估结果
        :param y: 输入实际的结果
        :return: None
        '''
        if self.y_predict is None:
            raise Exception("before score should predict first!")
        y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]
        y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]
        y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]
        y_true = y[y == 1].shape[0]
        y_total = y.shape[0]
        logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "
                      f"y_true:{y_true}, y_total:{y_total}")
        Accuracy = (y_acutalTrue + y_acutalFalse) / y_total
        Precision = y_acutalTrue / y_predictTrue
        Recall = y_acutalTrue / y_true
        print("Accuracy: ", Accuracy,
              "Precision: ", Precision,
              "Recall: ", Recall
              )

决策树可视化

利用mermaid文本绘图,将预测的值做了合并,同一属性的不同值但是分类结果相同,则可视化时都指向同一个输出节点。

  • 可视化函数提供了两种输出格式
    • markdown格式
    • html格式(推荐,使用浏览器即可查看决策树)

决策树保存

    def save(self, savePath: str):
        open(savePath, "w").write(str(decisionTree.decision_tree))
        logging.info(f"决策树已保存,位置:{savePath}")

决策树读取

    def load(self, savePath: str):
        tree = eval(open(savePath, "r").read())
        if type(tree) == dict:
            self.decision_tree = tree
        else:
            raise Exception("Load Faild!")

效果图

示例图,非数据集分类结果图
在这里插入图片描述

    def visualOutput(self, savePath="", outputFormat="html", direction="TD"):
        '''
        将决策树可视化输出,格式为‘md'或’html'
        :param outputFormat: 设置输出格式
        :return: 对应输出格式的文本
        '''
        if self.decision_tree is None:
            raise Exception("should fit first!")
        text = ""
        if outputFormat == "md":
            text = self._format_md(direction=direction)
        elif outputFormat == "html":
            text = self._format_html(direction=direction)
        if savePath != "":
            open(savePath, "w", encoding="utf-8").write(text)
        return text


    def _format_html(self, direction="TD"):
        '''
        决策树的可视化为html格式
        :return: html代码
        '''
        html_start = '<!DOCTYPE html><html lang="en"><head>' \
                     '<meta charset="UTF-8">' \
                     '<meta name="viewport" content="width=device-width, initial-scale=1.0">' \
                     '<meta http-equiv="X-UA-Compatible" content="ie=edge">' \
                     '<title>DecisionTree</title>' \
                     '<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \
                     '</script></head><body><div class="mermaid">{}</div>'
        html_end = '</body></html>'
        mermaid = self._format_md(end=";", direction=direction)
        mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")
        html = html_start.replace("{}", mermaid)+html_end
        return html

    def _format_md(self, direction="TD", end="\n"):
        '''
        决策树的可视化为md代码(mermaid代码)
        :param end: 设置每行结尾符号
        :param direction: 设置方向
        :return:
        '''
        md = "```mermaid\n"
        md += f"graph {direction}{end}"
        total_node = 1
        current_nodeID = 0
        if len(self.decision_tree) != 2:
            code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"
            return md + code_line + "\n```"

        queue = [self.decision_tree]
        while len(queue) > 0:
            node = queue.pop(0)
            ans_node = []
            for key in node["next"].keys():
                if type(node['next'][key]) == dict:
                    if len(node['next'][key]) == 1:
                        decision = node['next'][key]['next']['其他']
                        if decision not in ans_node:
                            ans_node.append(decision)
                        nodeID_ans = ans_node.index(decision)
                        code_line = f"{current_nodeID}({node['column']})--{key}-->" \
                                    f"L{current_nodeID}_{nodeID_ans}({decision})"
                    else:
                        code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"
                        queue.append(node["next"][key])
                        total_node += 1
                else:
                    decision = node['next'][key]
                    if decision not in ans_node:
                        ans_node.append(decision)
                    nodeID_ans = ans_node.index(decision)
                    code_line = f"{current_nodeID}({node['column']})--{key}-->" \
                                f"L{current_nodeID}_{nodeID_ans}({decision})"
                # code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")
                md += code_line+end
            current_nodeID += 1
        return md + "```"

总代码

如何获得每一步计算结果

不想要那么多过程,可以将开头的logging.basicConfig中的level设置为INFO即可。

即:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

修改为:
logging.basicConfig(level=logging.INFO, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

如果需要导出日志:
 参数filename为输出日志位置。
 参数filemode为输出日志写入模式。
logging.basicConfig(level=logging.DEBUG, filename='DecisionTree.log', filemode='w', format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

运行代码可能存在问题

  • 数据集不对:Caesarian Section Classification Dataset下载后为arff格式,该代码使用的数据集格式为csv,需要将arff中的数据提取出来,可以使用记事本,将arff的数据部分保存为csv格式即可。
  • 此外本代码提供一个demo,无需外部数据集亦可运行。
  • score评估函数:仅适用于二分类,对于多分类该算法不适用(决策树可以predict),同时score判断正例需要结果为1,反例结果为0。
import pandas as pd
import numpy as np
import logging
from sklearn.model_selection import train_test_split

logging.basicConfig(level=logging.DEBUG,
                    format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
"""
application: Decision_tree-ID3
writer: Flysky
Date: 2020年10月14日
"""


class DecisionTree:
    def __init__(self):
        self.DataSet = None
        self._threshold = 0.1
        self._leafCount = 0
        self.decision_tree = None
        self.y_predict = None

    def check_dataset(self, dataset: pd.DataFrame, dimension=2):
        if len(dataset.shape) != dimension:
            raise ValueError(f"data dimension not {dimension} but {len(dataset.shape)}")

    def empirical_entropy(self, dataset=None):
        """
        求经验熵
        $$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$
        :return: Float 经验熵
        """
        if dataset is None:
            dataset = self.DataSet
        columns_count = dataset.iloc[:, -1].value_counts()
        entropy = 0
        total_count = columns_count.sum()
        for count in columns_count:
            p = count / total_count
            entropy -= p * np.log2(p)
        return entropy

    def extract_dataset(self, dataset: pd.DataFrame, column, label):
        """
        根据column和label筛选出指定的数据集
        :return: pd.DataFrame 筛选后的数据集
        """
        if type(column) == int:
            split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)
        else:
            split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)
        return split_dataset

    def best_empirical_entropy(self, dataset: pd.DataFrame = None):
        """
        选取数据集中的columns中,最好的column(经验熵最大)
        :param dataset: 带选取的数据集
        :return: 返回column
        """
        if dataset is None:
            dataset = self.DataSet
        columns = dataset.columns[:-1]
        total_count = dataset.shape[0]
        empirical_entropy = self.empirical_entropy(dataset)
        logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")
        logging.debug(f"empirical_entropy is {empirical_entropy}")
        informationGain_max = -1
        best_column = None
        for column in columns:
            entropy_tmp = 0
            data_counts = dataset.loc[:, column].value_counts()
            data_labels = data_counts.index
            logging.debug(f"now is {column}")
            for label in data_labels:
                split_dataset = self.extract_dataset(dataset, column, label)
                count = split_dataset.shape[0]
                p = count / total_count
                entropy_tmp += p * self.empirical_entropy(split_dataset)
                logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "
                              f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "
                              f"entropy: {self.empirical_entropy(split_dataset)}")
            informationGain = empirical_entropy - entropy_tmp
            logging.debug(f"entropy: {entropy_tmp}, {column} informationGain:{informationGain}")
            if informationGain > informationGain_max:
                best_column = column
                informationGain_max = informationGain
        logging.debug(f"Choose {best_column}:{informationGain_max}")
        return best_column

    def id3(self, dataset: pd.DataFrame = None):
        '''
        实现决策树的ID3算法
        :param dataset: 输入的数据集
        :return: dict 决策树节点
        '''
        if dataset is None:
            dataset = self.DataSet
        next_tree = {}
        result_count = dataset.iloc[:, -1].value_counts()
        result_max = result_count.idxmax()
        next_tree["其他"] = result_max
        if result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:
            self._leafCount += 1
            logging.debug(
                f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")
            tree = {"next": next_tree}
        else:
            best_column = self.best_empirical_entropy(dataset)
            value_counts = dataset[best_column].value_counts()
            labels = value_counts.index
            for label in labels:
                logging.debug(f"now choose_column:{best_column}, label: {label}")
                split_dataset = self.extract_dataset(dataset, best_column, label)
                next_decision = self.id3(split_dataset)
                next_tree[label] = next_decision
            tree = {"column": best_column, "next": next_tree}
        return tree

    def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):
        '''
        拟合函数,输入数据集进行拟合,其中如果y没有输入,则x的最后一列应包含分类结果
        :param x: pd.DataFrame数据集的属性(当y为None时,为整个数据集-包含结果)
        :param y: list like,shape=(-1,)数据集的结果
        :param algorithm: 选择算法(目前仅有ID3)
        :param threshold: 选择信息增益的阈值
        :return: 决策树的根节点
        '''
        self.check_dataset(x, dimension=2)
        self.check_dataset(y, dimension=1)
        self._threshold = threshold
        dataset = x
        if y is not None:
            dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)
        self.decision_tree = eval("self." + algorithm)(dataset)
        logging.info(f"decision_tree leaf:{self._leafCount}")
        return self.decision_tree

    def leaf_count(self):
        '''
        统计叶子节点个数(此处的叶子节点即能确定分类的属性值所对应的分类结果值
        :return: 叶子节点个数
        '''
        return self._leafCount

    def predict(self, x: pd.DataFrame):
        '''
        预测数据
        :param x:pd.DataFrame 输入的数据集
        :return: 分类结果
        '''
        self.y_predict = x.apply(self._predict_line, axis=1)
        return self.y_predict

    def _predict_line(self, line):
        """
        私有函数,用于在predict中,对每一行数据进行预测
        :param line: 输入的数据集的某一行数据
        :return: 该一行的分类结果
        """
        tree = self.decision_tree
        while True:
            try:
                if len(tree["next"]) == 1:
                    return tree["next"]["其他"]
                else:
                    value = line[tree["column"]]
                    tree = tree["next"][value]
            except:
                return tree["next"]["其他"]

    def score(self, y):
        '''
        评估函数,用于评估结果
        :param y: 输入实际的结果
        :return: None
        '''
        if self.y_predict is None:
            raise Exception("before score should predict first!")
        y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]
        y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]
        y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]
        y_true = y[y == 1].shape[0]
        y_total = y.shape[0]
        logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "
                      f"y_true:{y_true}, y_total:{y_total}")
        Accuracy = (y_acutalTrue + y_acutalFalse) / y_total
        Precision = y_acutalTrue / y_predictTrue
        Recall = y_acutalTrue / y_true
        print("Accuracy: ", Accuracy,
              "Precision: ", Precision,
              "Recall: ", Recall
              )

    def visualOutput(self, savePath="", outputFormat="html", direction="TD"):
        '''
        将决策树可视化输出,格式为‘md'或’html'
        :param outputFormat: 设置输出格式
        :return: 对应输出格式的文本
        '''
        if self.decision_tree is None:
            raise Exception("should fit first!")
        text = ""
        if outputFormat == "md":
            text = self._format_md(direction=direction)
        elif outputFormat == "html":
            text = self._format_html(direction=direction)
        if savePath != "":
            open(savePath, "w", encoding="utf-8").write(text)
        return text

    def _format_html(self, direction="TD"):
        '''
        决策树的可视化为html格式
        :return: html代码
        '''
        html_start = '<!DOCTYPE html><html lang="en"><head>' \
                     '<meta charset="UTF-8">' \
                     '<meta name="viewport" content="width=device-width, initial-scale=1.0">' \
                     '<meta http-equiv="X-UA-Compatible" content="ie=edge">' \
                     '<title>DecisionTree</title>' \
                     '<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \
                     '</script></head><body><div class="mermaid">{}</div>'
        html_end = '</body></html>'
        mermaid = self._format_md(end=";", direction=direction)
        mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")
        html = html_start.replace("{}", mermaid) + html_end
        return html

    def _format_md(self, direction="TD", end="\n"):
        '''
        决策树的可视化为md代码(mermaid代码)
        :param end: 设置每行结尾符号
        :param direction: 设置方向
        :return:
        '''
        md = "```mermaid\n"
        md += f"graph {direction}{end}"
        total_node = 1
        current_nodeID = 0
        if len(self.decision_tree) != 2:
            code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"
            return md + code_line + "\n```"

        queue = [self.decision_tree]
        while len(queue) > 0:
            node = queue.pop(0)
            ans_node = []
            for key in node["next"].keys():
                if type(node['next'][key]) == dict:
                    if len(node['next'][key]) == 1:
                        decision = node['next'][key]['next']['其他']
                        if decision not in ans_node:
                            ans_node.append(decision)
                        nodeID_ans = ans_node.index(decision)
                        code_line = f"{current_nodeID}({node['column']})--{key}-->" \
                                    f"L{current_nodeID}_{nodeID_ans}({decision})"
                    else:
                        code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"
                        queue.append(node["next"][key])
                        total_node += 1
                else:
                    decision = node['next'][key]
                    if decision not in ans_node:
                        ans_node.append(decision)
                    nodeID_ans = ans_node.index(decision)
                    code_line = f"{current_nodeID}({node['column']})--{key}-->" \
                                f"L{current_nodeID}_{nodeID_ans}({decision})"
                # code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")
                md += code_line + end
            current_nodeID += 1
        return md + "```"

    def load(self, savePath: str):
        tree = eval(open(savePath, "r").read())
        if type(tree) == dict:
            self.decision_tree = tree
        else:
            raise Exception("Load Faild!")

    def save(self, savePath: str):
        open(savePath, "w").write(str(decisionTree.decision_tree))
        logging.info(f"决策树已保存,位置:{savePath}")


if __name__ == '__main__':
    # 初始化决策树
    decisionTree = DecisionTree()

    # 不需要外部数据集的demo

    demo_data = [[0, 2, 0, 0, 0],
                 [0, 2, 0, 1, 0],
                 [1, 2, 0, 0, 1],
                 [2, 1, 0, 0, 1],
                 [2, 0, 1, 0, 1],
                 [2, 0, 1, 1, 0],
                 [1, 0, 1, 1, 1],
                 [0, 1, 0, 0, 0],
                 [0, 0, 1, 0, 1],
                 [2, 1, 1, 0, 1],
                 [0, 1, 1, 1, 1],
                 [1, 1, 0, 1, 1],
                 [1, 2, 1, 0, 1],
                 [2, 1, 0, 1, 0]]
    dataset = pd.DataFrame(demo_data)
    dataset.columns = ['年龄', '有工作', '是学生', '信贷情况', "借贷"]

    # UCI数据集Caesarian Section Classification

    # dataset = pd.read_csv("caesarian.csv", header=None)
    # dataset.columns = ["Age", "Delivery_number", "Delivery_time", "Blood_of_Pressure", "Heart_Problem", "Caesarian"]
    # age = dataset["Age"].value_counts().sort_index()  # 将Age分为三层,低于24岁,低于31岁,高于30岁
    # dataset["Age"][dataset["Age"] < 24] = 0
    # dataset["Age"][(dataset["Age"] > 23) & (dataset["Age"] < 31)] = 1
    # dataset["Age"][30 < dataset["Age"]] = 2
    # print(dataset.info())

    # 将数据集的属性和结果分开
    X = dataset.iloc[:, :-1]
    Y = dataset.iloc[:, -1]

    # 使用skleran切分数据集
    # X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, shuffle=True)
    # else直接使用数据集作为测试集
    X_train = X_test = X
    Y_train = Y_test = Y

    # 拟合
    e = decisionTree.fit(X_train, Y_train, threshold=-1)
    # 保存决策树
    decisionTree.save("decisionTree.txt")
    # 加载决策树
    decisionTree.load("decisionTree.txt")
    # 预测
    predict_y = decisionTree.predict(X_test)
    # 评估
    decisionTree.score(Y_test)
    # 可视化输出(html格式)
    # visualOutput可选参数outputFormat=["md", "html"],direction方向,设置决策树的方向=["LR","RL","TD","DT"],默认TD,从上到下
    decisionTree.visualOutput(savePath="decisionTree.html", outputFormat="html")

实验结果(决策树)

在这里插入图片描述

debug模式

使用demo数据集运行

2020-10-14 00:47:19,827-[root]	[DEBUG]	[best_empirical_entropy]: now dataset shape is (14, 5), column is ['年龄', '有工作', '是学生', '信贷情况', 'DECISION_tempADD']
2020-10-14 00:47:19,827-[root]	[DEBUG]	[best_empirical_entropy]: empirical_entropy is 0.9402859586706311
2020-10-14 00:47:19,831-[root]	[DEBUG]	[best_empirical_entropy]: now is 年龄
2020-10-14 00:47:19,849-[root]	[DEBUG]	[best_empirical_entropy]: now label is 2, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686
2020-10-14 00:47:19,859-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686
2020-10-14 00:47:19,865-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (4, 4), Ans count: [4], entropy: 0.0
2020-10-14 00:47:19,865-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.6935361388961918, 年龄 informationGain:0.24674981977443933
2020-10-14 00:47:19,868-[root]	[DEBUG]	[best_empirical_entropy]: now is 有工作
2020-10-14 00:47:19,880-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [4, 2], entropy: 0.9182958340544896
2020-10-14 00:47:19,889-[root]	[DEBUG]	[best_empirical_entropy]: now label is 2, chooseData shape is (4, 4), Ans count: [2, 2], entropy: 1.0
2020-10-14 00:47:19,896-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (4, 4), Ans count: [3, 1], entropy: 0.8112781244591328
2020-10-14 00:47:19,897-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.9110633930116763, 有工作 informationGain:0.02922256565895487
2020-10-14 00:47:19,898-[root]	[DEBUG]	[best_empirical_entropy]: now is 是学生
2020-10-14 00:47:19,909-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (7, 4), Ans count: [6, 1], entropy: 0.5916727785823275
2020-10-14 00:47:19,917-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (7, 4), Ans count: [4, 3], entropy: 0.9852281360342515
2020-10-14 00:47:19,918-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.7884504573082896, 是学生 informationGain:0.15183550136234159
2020-10-14 00:47:19,920-[root]	[DEBUG]	[best_empirical_entropy]: now is 信贷情况
2020-10-14 00:47:19,927-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (8, 4), Ans count: [6, 2], entropy: 0.8112781244591328
2020-10-14 00:47:19,937-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [3, 3], entropy: 1.0
2020-10-14 00:47:19,937-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.8921589282623617, 信贷情况 informationGain:0.04812703040826949
2020-10-14 00:47:19,937-[root]	[DEBUG]	[best_empirical_entropy]: Choose 年龄:0.24674981977443933
2020-10-14 00:47:19,940-[root]	[DEBUG]	[id3]: now choose_column:年龄, label: 2
2020-10-14 00:47:19,950-[root]	[DEBUG]	[best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是学生', '信贷情况', 'DECISION_tempADD']
2020-10-14 00:47:19,950-[root]	[DEBUG]	[best_empirical_entropy]: empirical_entropy is 0.9709505944546686
2020-10-14 00:47:19,953-[root]	[DEBUG]	[best_empirical_entropy]: now is 有工作
2020-10-14 00:47:19,964-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896
2020-10-14 00:47:19,974-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0
2020-10-14 00:47:19,974-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.9509775004326937, 有工作 informationGain:0.01997309402197489
2020-10-14 00:47:19,976-[root]	[DEBUG]	[best_empirical_entropy]: now is 是学生
2020-10-14 00:47:19,983-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896
2020-10-14 00:47:19,992-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0
2020-10-14 00:47:19,992-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.9509775004326937, 是学生 informationGain:0.01997309402197489
2020-10-14 00:47:19,995-[root]	[DEBUG]	[best_empirical_entropy]: now is 信贷情况
2020-10-14 00:47:20,004-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0
2020-10-14 00:47:20,013-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0
2020-10-14 00:47:20,013-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.0, 信贷情况 informationGain:0.9709505944546686
2020-10-14 00:47:20,013-[root]	[DEBUG]	[best_empirical_entropy]: Choose 信贷情况:0.9709505944546686
2020-10-14 00:47:20,015-[root]	[DEBUG]	[id3]: now choose_column:信贷情况, label: 0
2020-10-14 00:47:20,021-[root]	[DEBUG]	[id3]: select decision 1, result_type:[3], dataset column:(3, 3), lower than threshold:False
2020-10-14 00:47:20,021-[root]	[DEBUG]	[id3]: now choose_column:信贷情况, label: 1
2020-10-14 00:47:20,027-[root]	[DEBUG]	[id3]: select decision 0, result_type:[2], dataset column:(2, 3), lower than threshold:False
2020-10-14 00:47:20,028-[root]	[DEBUG]	[id3]: now choose_column:年龄, label: 0
2020-10-14 00:47:20,037-[root]	[DEBUG]	[best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是学生', '信贷情况', 'DECISION_tempADD']
2020-10-14 00:47:20,037-[root]	[DEBUG]	[best_empirical_entropy]: empirical_entropy is 0.9709505944546686
2020-10-14 00:47:20,038-[root]	[DEBUG]	[best_empirical_entropy]: now is 有工作
2020-10-14 00:47:20,046-[root]	[DEBUG]	[best_empirical_entropy]: now label is 2, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0
2020-10-14 00:47:20,052-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0
2020-10-14 00:47:20,060-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (1, 3), Ans count: [1], entropy: 0.0
2020-10-14 00:47:20,060-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.4, 有工作 informationGain:0.5709505944546686
2020-10-14 00:47:20,061-[root]	[DEBUG]	[best_empirical_entropy]: now is 是学生
2020-10-14 00:47:20,068-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0
2020-10-14 00:47:20,076-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0
2020-10-14 00:47:20,076-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.0, 是学生 informationGain:0.9709505944546686
2020-10-14 00:47:20,077-[root]	[DEBUG]	[best_empirical_entropy]: now is 信贷情况
2020-10-14 00:47:20,085-[root]	[DEBUG]	[best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896
2020-10-14 00:47:20,092-[root]	[DEBUG]	[best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0
2020-10-14 00:47:20,092-[root]	[DEBUG]	[best_empirical_entropy]: entropy: 0.9509775004326937, 信贷情况 informationGain:0.01997309402197489
2020-10-14 00:47:20,092-[root]	[DEBUG]	[best_empirical_entropy]: Choose 是学生:0.9709505944546686
2020-10-14 00:47:20,094-[root]	[DEBUG]	[id3]: now choose_column:是学生, label: 0
2020-10-14 00:47:20,100-[root]	[DEBUG]	[id3]: select decision 0, result_type:[3], dataset column:(3, 3), lower than threshold:False
2020-10-14 00:47:20,100-[root]	[DEBUG]	[id3]: now choose_column:是学生, label: 1
2020-10-14 00:47:20,106-[root]	[DEBUG]	[id3]: select decision 1, result_type:[2], dataset column:(2, 3), lower than threshold:False
2020-10-14 00:47:20,106-[root]	[DEBUG]	[id3]: now choose_column:年龄, label: 1
2020-10-14 00:47:20,112-[root]	[DEBUG]	[id3]: select decision 1, result_type:[4], dataset column:(4, 4), lower than threshold:False
2020-10-14 00:47:20,112-[root]	[INFO]	[fit]: decision_tree leaf:5
2020-10-14 00:47:20,113-[root]	[INFO]	[save]: 决策树已保存,位置:decisionTree.txt
2020-10-14 00:47:20,123-[root]	[DEBUG]	[score]: y_acutalTrue:9, y_acutalFalse:5, y_predictTrue:9, y_true:9, y_total:14
以下是Python实现决策树算法的代码,不调用sklearn库,对鸢尾花数据集进行分类实现结果可视化: ``` import pandas as pd import numpy as np import matplotlib.pyplot as plt class Node: def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None): self.feature_idx = feature_idx self.threshold = threshold self.left = left self.right = right self.value = value class DecisionTree: def __init__(self, max_depth=None): self.max_depth = max_depth self.root = None def fit(self, X, y): self.n_classes_ = len(set(y)) self.n_features_ = X.shape[1] self.root = self._grow_tree(X, y) def predict(self, X): return [self._traverse_tree(x, self.root) for x in X] def _grow_tree(self, X, y, depth=0): n_samples, n_features = X.shape n_labels = len(set(y)) if (depth is None) or (depth == self.max_depth) or (n_labels == 1) or (n_samples < 2): if n_samples == 0: return None leaf_value = self._leaf_value(y) return Node(value=leaf_value) feature_idxs = np.random.choice(n_features, int(np.sqrt(n_features)), replace=False) best_feature_idx, best_threshold = self._best_criteria(X, y, feature_idxs) left_idxs, right_idxs = self._split(X[:, best_feature_idx], best_threshold) left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth+1) right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth+1) return Node(best_feature_idx, best_threshold, left, right) def _best_criteria(self, X, y, feature_idxs): best_gain = -1 split_idx, split_threshold = None, None for feature_idx in feature_idxs: X_column = X[:, feature_idx] thresholds = np.unique(X_column) for threshold in thresholds: gain = self._information_gain(y, X_column, threshold) if gain > best_gain: best_gain = gain split_idx = feature_idx split_threshold = threshold return split_idx, split_threshold def _split(self, X_column, split_threshold): left_idxs = np.argwhere(X_column <= split_threshold).flatten() right_idxs = np.argwhere(X_column > split_threshold).flatten() return left_idxs, right_idxs def _information_gain(self, y, X_column, split_threshold): parent_entropy = self._entropy(y) left_idxs, right_idxs = self._split(X_column, split_threshold) if len(left_idxs) == 0 or len(right_idxs) == 0: return 0 n = len(y) n_l, n_r = len(left_idxs), len(right_idxs) e_l, e_r = self._entropy(y[left_idxs]), self._entropy(y[right_idxs]) child_entropy = (n_l / n) * e_l + (n_r / n) * e_r ig = parent_entropy - child_entropy return ig def _leaf_value(self, y): if self.n_classes_ == 1: return y[0] else: bincount = np.bincount(y, minlength=self.n_classes_) probs = bincount / len(y) idx = np.argmax(probs) return idx def _entropy(self, y): hist = np.bincount(y, minlength=self.n_classes_) ps = hist / np.sum(hist) return -np.sum([p * np.log2(p) for p in ps if p > 0]) def _traverse_tree(self, x, node): if node.value is not None: return node.value if x[node.feature_idx] <= node.threshold: return self._traverse_tree(x, node.left) else: return self._traverse_tree(x, node.right) def plot_decision_boundary(clf, X, y, plot_step = 0.02, cmap = "Paired"): x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5 y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(figsize = (10,6)) plt.contourf(xx, yy, Z , cmap=cmap, alpha= 0.5) plt.scatter(X[:,0], X[:,1], c=y, cmap=cmap, edgecolor = "black") if __name__ == "__main__": iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None).values np.random.seed(1) np.random.shuffle(iris) X_train = iris[:100 , [0,2]] y_train = iris[:100 , -1] y_train[y_train == 'Iris-setosa'] = 0 y_train[y_train == 'Iris-versicolor'] = 1 y_train[y_train == 'Iris-virginica'] = 2 y_train = y_train.astype(int) dt = DecisionTree(max_depth=3) dt.fit(X_train, y_train) plot_decision_boundary(dt, X_train, y_train) plt.show() ``` 注意:此处为Jupyter Notebook版本的代码,直接在Jupyter运行即可,否则可能需要对plt.show()进行更改。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值