机器学习-决策树

尘世中迷途小码农

已于 2024-04-15 18:18:49 修改

阅读量844

点赞数

分类专栏： python 机器学习文章标签：决策树 python 机器学习

于 2020-09-11 21:59:38 首次发布

本文链接：https://blog.csdn.net/funnyrand/article/details/108539212

版权

python 同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

机器学习

11 篇文章 2 订阅

订阅专栏

机器学习-决策树

本章介绍机器学习中一个非常重要的监督学习算法，决策树，决策树有很多分类，如CART（基于基尼系数，可用于分类或回归），ID3（基于信息增益，使用信息增益来选择特征），C4.5（基于信息增益，与ID3类似，但使用信息增益的比例来选择特征）等，这里介绍ID3。

包括以下内容：

决策树的原理
信息，熵，信息增益的概念
如何划分数据集
如何选择最好的数据划分
构建决策树
图形化查看决策树
测试决策树
存储和加载决策树
ID3的优缺点
案例 - 使用决策树预测隐形眼镜类型

部分内容引用自《Machine Learning in Action》

决策树的原理

问题提出：

假定某个数据集S包含M个元素，每个元素都有属性（特征）X1，X2，...，Xk，且每个元素都有具体的分类，分类的集合由C1，C2，...，Cn组成。若给定数据集S外的其它某个元素的属性(x1，x2，...，xk)，求该元素属于哪个分类。

思路：

决策树的的思想是先通过某种方式，将数据集转换成一颗树（决策树），再按此决策树中定义的顺序依次判断目标元素的属性值x1，x2，...，xk，最后定位到的叶子节点的分类就是目标元素的分类。决策树的构造相对比较费时，但判断分类是非常高效的。

举例：

假设通过某种方式，我们得到了以下用于判断是否去相亲的决策树：

现在给定一人A，其属性为（白，富，美），通过决策树可以得到结果“去”，说明可以去相亲。再给定另一人B，其属性为（不富，白，不美），通过决策树得到结果“犹豫”，说明需要考虑一下是否去相亲。可以看出，决策树可以非常快速的判断某个元素属于哪个分类。

信息，熵，信息增益的概念

在ID3的算法中，为了得到一棵合理的决策树，我们需要知道信息，熵和信息增益的概念。

信息：

假设数据集S中有n个分类，C1，C2，...，Cn，则第i个分类的信息定义为：

$\large l\left ( {C_{i}} \right ) = -\log_{2}p\left ( C_{i} \right )$

其中， $\large p\left ( C_{i} \right )$ 为 $C_{i}$ 在所有分类中出现的概率。通过此公式可知，概率越大，信息就越接近于0，反之越大于0。注意，信息是基于分类的，而不是基于集合中的元素。

熵：

熵可以称为香农熵，是克劳德香农在二十世纪发明的，其描述的是任意数据集S中元素的混乱程度。熵的值越大表示数据越混乱，越小表示越统一。熵定义为信息的期望：

$\large H = -\sum_{i=1}^{n} p\left ( C_{i} \right ) \log_{2}p\left ( C_{i} \right )$

注意，熵基于信息，所以熵也是针对数据集中的分类，而不是数据集中的元素。例如，假设数据集S中有100个元素，所有这些元素的属性值都不一样且相差很大，但这些元素都属于同一个分类Ci。那么该数据集的熵为0，表示所有元素绝对统一，没有任何混乱（都属于同一个分类，所以统一）。

信息增益：

信息增益定义为熵的减少。假设初始情况下数据集S的熵为 $\large H_{1}$ ，经过某些操作后其熵变为 $\large H_{2}$ ，则该操作带来的信息增益为：

$\large \Delta_{H}= H_{1} - H_{2}$

如果 $\large \Delta_{H}$ 为正数，表示经过某些操作后能减少S的混乱程度，能让S中的元素更加统一。但如果 $\large \Delta_{H}$ 为负数，表示该操作让S中的的元素更加混乱。下面通过实例计算数据集的熵。

创建Python模块 entropy.py，输入以下代码：

import math


def cal_entropy(data_set):
    """
    This function calculates entropy of a data set, no matter how many attributes that data set have, only depends on
    the last value of each element, it's the element category(label).
    :param data_set: data set
    :return: entropy of a data set
    """
    labels = {}
    for data in data_set:
        label = data[-1]
        if label not in labels.keys():
            labels[label] = 0
        labels[label] += 1
    entropy = 0.0
    for label in labels:
        prob = float(labels[label]) / len(data_set)
        entropy += (-prob * math.log2(prob))
    return entropy


if __name__ == '__main__':
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    entropy = cal_entropy(data_set)
    print(entropy)

    data_set = [[1, 1, 'yes'], [1, 1, 'not sure'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    entropy = cal_entropy(data_set)
    print(entropy)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/entropy.py
0.9709505944546686
1.3709505944546687

Process finished with exit code 0

可以看出，数据集1中的元素相对于数据集2中的元素更加统一。直观来看，数据集1中只有两个分类，yes和no，而数据集2中增加了一个分类not sure，所以数据集2更加混乱。

如何划分数据集

一个数据集中的元素有很多属性（特征），我们需要确定划分属性的先后顺序。只要能得到属性划分的先后顺序，就很容易构建出最终的决策树。为了弄清楚这个问题，我们先要尝试划分数据集。

设原数据集为S，其属性有C1，C2，...，Cn，若按照第i个属性的某个值v划分数据集，则将得到新的数据子集Si。其中，Si是S中第i个属性的值为v的所有元素的集合，但是每个元素都去掉了第i个属性本身（减少属性，便于收敛）。

实例：

创建模块 split_data_set.py，输入以下代码：

def split_data_set(data_set, axis, value):
    """
    Split data set by giving axis and value, only return the sub data set which element value on axis equals to the
    giving value, and sub data will not include the value on axis index.
    :param data_set: data set
    :param axis: axis index
    :param value: giving value
    :return: sub data set
    """
    return_data_set = []
    for data in data_set:
        if data[axis] == value:
            reduced_data_set = data[:axis]
            reduced_data_set.extend(data[axis + 1:])
            return_data_set.append(reduced_data_set)
    return return_data_set


if __name__ == '__main__':
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    print("Original data set: %r" % data_set)
    result = split_data_set(data_set, 0, 1)
    print("Split index 0, value 1: %r" % result)
    result = split_data_set(data_set, 1, 0)
    print("Split index 1, value 0: %r" % result)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/split_data_set.py
Original data set: [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
Split index 0, value 1: [[1, 'yes'], [1, 'yes'], [0, 'no']]
Split index 1, value 0: [[1, 'no']]

Process finished with exit code 0

如何选择最好的划分

现在我们知道了如何划分数据集，那么选择哪个属性进行划分才是最好的划分呢？我们应该优先选择能为我们带来最大信息增益的属性来划分数据集，因为这样划分后剩下的数据集将在最大程度上变得统一，能减少决策树的深度和结构。

假设数据集S有n个分类，C1，C2，...，Cn，按照第i个属性的所有值（vi1，vi2，...，vim）能划分出m个子集Si1，Si2，...，Sim，通过上面熵的计算方式可以得到这m个子集的熵分别为 Hi1，Hi2，...，Him，则这m个子集合并到一起的熵为：

$\large H_{i} = \sum_{j=1}^{m} p\left ( S_{ij} \right ) H_{ij}$

其中， $\large p\left ( S_{ij} \right )$ 为集合 $S_{ij}$ 相对于原集合S的概率，即 $S_{ij}$ 中元素的个数除以S中元素的个数。

由于 $S_{ij}$ 中的元素都没有第i个属性，所以其能够用于判断剩下的属性的熵，看其是否更加统一。依次计算按每个属性划分后的熵，选择信息增益最大的属性作为最好的数据划分。

实例：

创建模块 best_feature.py，输入以下代码：

import decision_tree.entropy as entropy
import decision_tree.split_data_set as sp


def choose_best_feature_to_split(data_set):
    feature_size = len(data_set[0]) - 1
    base_entropy = entropy.cal_entropy(data_set)
    best_info_gain = 0.0
    best_feature = -1
    for feature_index in range(feature_size):
        # All element feature values on i axis, this is a set, so will remove duplicated value
        feature_set = set([element[feature_index] for element in data_set])
        feature_entropy = 0.0
        for feature_value in feature_set:
            sub_data_set = sp.split_data_set(data_set, feature_index, feature_value)
            prob = len(sub_data_set) / float(len(data_set))
            feature_entropy += prob * entropy.cal_entropy(sub_data_set)
        info_gain = base_entropy - feature_entropy
        # print("Info gain on feature %r is %r" % (feature_index, info_gain))
        if info_gain > best_info_gain:
            best_info_gain = info_gain
            best_feature = feature_index
    return best_feature


if __name__ == "__main__":
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    best_feature = choose_best_feature_to_split(data_set)
    print("The best feature is: %r" % best_feature)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/best_feature.py
The best feature is: 0

Process finished with exit code 0

输出为0，说明在当前数据集中，按照第一个属性的划分是最好的划分。

构建决策树

可以用递归的方式依次找出给定集合中的最优划分属性，然后根据该属性将给定数据集划分成若干个子集，再找出每个子集中的最优划分属性进行子集划分。递归退出有两个条件，满足一个即可，1）子集中所有类别都相同，2）检查完所有属性，如果此时存在不同类别，则选择类别最多的作为最终类别。

实例：

创建模块 create_tree.py，输入以下代码：

import operator

import decision_tree.best_feature as bf
import decision_tree.split_data_set as sp


def majority_count(class_list):
    class_count = {}
    for value in class_list:
        if value not in class_count.keys():
            class_count[value] = 0
        class_count[value] += 1
    sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_class_count[0][0]


def create_decision_tree(data_set, labels):
    class_list = [element[-1] for element in data_set]
    if class_list.count(class_list[0]) == len(class_list):
        return class_list[0]
    if len(data_set[0]) == 1:
        return majority_count(class_list)
    best_feature = bf.choose_best_feature_to_split(data_set)
    best_feature_label = labels[best_feature]
    tree = {best_feature_label: {}}
    del (labels[best_feature])
    feature_set = set([element[best_feature] for element in data_set])
    for feature in feature_set:
        sub_labels = labels[:]
        sub_data_set = sp.split_data_set(data_set, best_feature, feature)
        tree[best_feature_label][feature] = create_decision_tree(sub_data_set, sub_labels)
    return tree


if __name__ == "__main__":
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    labels = ['A', 'B']
    tree = create_decision_tree(data_set, labels)
    print(tree)

注：A表示第一个属性的名称是A，B表示第二个属性的名称是B。

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/create_tree.py
{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}

Process finished with exit code 0

输出的决策树是一个dict格式，key代表的是label和该属性的所有值。注意，这里的决策树是普通的树，不是二叉树，因为任何属性都可能有多个值。

图形化查看决策树

可以使用 matplotlib 库画出决策树的图像，便于直观理解。

实例：

针对上面的决策树 {'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}，我们画出其图形。创建模块 plot_tree.py，并输入以下代码：

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")


def get_num_leafs(tree):
    num_leafs = 0
    first_str = list(tree.keys())[0]
    second_dict = tree[first_str]
    for key in second_dict.keys():
        if type(second_dict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            num_leafs += get_num_leafs(second_dict[key])
        else:
            num_leafs += 1
    return num_leafs


def get_tree_depth(tree):
    max_depth = 0
    first_str = list(tree.keys())[0]
    second_dict = tree[first_str]
    for key in second_dict.keys():
        if type(second_dict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            this_depth = 1 + get_tree_depth(second_dict[key])
        else:
            this_depth = 1
        if this_depth > max_depth: max_depth = this_depth
    return max_depth


def plot_node(node_txt, center_pt, parent_pt, node_type):
    create_plot.ax1.annotate(node_txt, xy=parent_pt, xycoords='axes fraction',
                             xytext=center_pt, textcoords='axes fraction',
                             va="center", ha="center", bbox=node_type, arrowprops=arrow_args)


def plot_mid_text(center_pt, parent_pt, txt_str):
    x_mid = (parent_pt[0] - center_pt[0]) / 2.0 + center_pt[0]
    y_mid = (parent_pt[1] - center_pt[1]) / 2.0 + center_pt[1]
    create_plot.ax1.text(x_mid, y_mid, txt_str, va="center", ha="center", rotation=30)


def plot_tree(tree, parent_pt, node_txt):  # if the first key tells you what feat was split on
    num_leafs = get_num_leafs(tree)  # this determines the x width of this tree
    depth = get_tree_depth(tree)
    first_str = list(tree.keys())[0]  # the text label for this node should be this
    cntr_pt = (plot_tree.xOff + (1.0 + float(num_leafs)) / 2.0 / plot_tree.totalW, plot_tree.yOff)
    plot_mid_text(cntr_pt, parent_pt, node_txt)
    plot_node(first_str, cntr_pt, parent_pt, decisionNode)
    second_dict = tree[first_str]
    plot_tree.yOff = plot_tree.yOff - 1.0 / plot_tree.totalD
    for key in second_dict.keys():
        if type(second_dict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            plot_tree(second_dict[key], cntr_pt, str(key))  # recursion
        else:  # it's a leaf node print the leaf node
            plot_tree.xOff = plot_tree.xOff + 1.0 / plot_tree.totalW
            plot_node(second_dict[key], (plot_tree.xOff, plot_tree.yOff), cntr_pt, leafNode)
            plot_mid_text((plot_tree.xOff, plot_tree.yOff), cntr_pt, str(key))
    plot_tree.yOff = plot_tree.yOff + 1.0 / plot_tree.totalD


def create_plot(tree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    create_plot.ax1 = plt.subplot(111, frameon=False, **axprops)  # no ticks
    # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plot_tree.totalW = float(get_num_leafs(tree))
    plot_tree.totalD = float(get_tree_depth(tree))
    plot_tree.xOff = -0.5 / plot_tree.totalW;
    plot_tree.yOff = 1.0;
    plot_tree(tree, (0.5, 1.0), '')
    plt.show()


def retrieve_tree(i):
    list_of_tree = [{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}},
                    {'A': {0: 'no', 1: {'B': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                    ]
    return list_of_tree[i]


if __name__ == '__main__':
    tree = retrieve_tree(0)
    print(tree)
    create_plot(tree)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/plot_tree.py
{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}

测试决策树

构建完决策树以后，我们希望通过该决策树来判断给定数据的分类。方法是根据给定数据的属性值依次遍历决策树的节点，从而找到最后的分类。

实例：

创建模块 classify.py，输入以下代码：

import decision_tree.create_tree as ct


def classify(tree, feature_labels, test_vec):
    first_str = list(tree.keys())[0]
    second_dict = tree[first_str]
    feat_index = feature_labels.index(first_str)
    for value in second_dict.keys():
        if test_vec[feat_index] == value:
            if type(second_dict[value]).__name__ == 'dict':
                class_label = classify(second_dict[value], feature_labels, test_vec)
            else:
                class_label = second_dict[value]
    return class_label


if __name__ == '__main__':
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    labels = ['A', 'B']
    tree = ct.create_decision_tree(data_set, ['A', 'B'])
    print(tree)
    obj_leabl = classify(tree, labels, [1, 0])
    print("The label of [1, 0] is %r" % obj_leabl)
    obj_leabl = classify(tree, labels, [1, 1])
    print("The label of [1, 1] is %r" % obj_leabl)
    obj_leabl = classify(tree, labels, [0, 1])
    print("The label of [0, 1] is %r" % obj_leabl)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/classify.py
{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
The label of [1, 0] is 'no'
The label of [1, 1] is 'yes'
The label of [0, 1] is 'no'

Process finished with exit code 0

存储和加载决策树

构建决策树的过程往往会消耗比较多的时间，我们可以将构建好的决策树存储到磁盘空间，需要使用的时候再加载，这样就可以不用重复构建决策树了。这可以通过Python内置模块 pickle来实现。

实例：

创建模块 store.py，并输入以下代码：

import decision_tree.create_tree as ct
import decision_tree.classify as cl
import pickle


def store_tree(tree, file_path):
    with open(file_path, 'wb') as f:
        pickle.dump(tree, f)


def restore_tree(file_path):
    with open(file_path, 'rb') as f:
        return pickle.load(f)


if __name__ == '__main__':
    data_set = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
    tree = ct.create_decision_tree(data_set, ['A', 'B'])
    print(tree)
    store_tree(tree, './my_tree.data')

    tree = restore_tree('./my_tree.data')
    print(tree)

    obj_leabl = cl.classify(tree, ['A', 'B'], [1, 1])
    print("The label of [1, 1] is %r" % obj_leabl)

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/store.py
{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
{'A': {0: 'no', 1: {'B': {0: 'no', 1: 'yes'}}}}
The label of [1, 1] is 'yes'

Process finished with exit code 0

可以看出此时能够将构建好的决策树存储到文件 my_tree.data，并从该文件加载决策树。

ID3的优缺点

我们目前讨论的决策树都基于ID3，该算法的优点是相对比较简单，便于实现。缺点是如果匹配项太多可能出现过渡匹配现象，其无法裁剪不必要的叶子节点。另外，ID3只能划分标称型数据，无法处理连续的数值数据。

案例 - 使用决策树预测隐形眼镜类型

最后，我们通过一个案例来综合演练决策树的实际应用，使用决策树来预测是否应该给患者配隐形眼镜，以及配什么材质的隐形眼镜。

数据集属性和取值：

属性	取值
age（年龄）	pre（小孩） young（年轻人） presbyopic（老年人）
prescript（症状）	myopia（近视眼） hyperopia（远视眼）
astigmatic（是否散光）	yes no
tearRate（眼泪数量）	normal（正常） reduced（减少）

注：括号内是中文说明，加粗黑体表示属性名和取值。

数据集分类：

分类	说明
no lenses	不能使用隐形眼镜
soft	使用软材质隐形眼镜
hard	使用硬材质隐形眼镜

数据集保存到文件 lenses.txt，每列分别代表年龄，症状，是否散光，眼泪数量，分类：

young	myope	no	reduced	no lenses
young	myope	no	normal	soft
young	myope	yes	reduced	no lenses
young	myope	yes	normal	hard
young	hyper	no	reduced	no lenses
young	hyper	no	normal	soft
young	hyper	yes	reduced	no lenses
young	hyper	yes	normal	hard
pre	myope	no	reduced	no lenses
pre	myope	no	normal	soft
pre	myope	yes	reduced	no lenses
pre	myope	yes	normal	hard
pre	hyper	no	reduced	no lenses
pre	hyper	no	normal	soft
pre	hyper	yes	reduced	no lenses
pre	hyper	yes	normal	no lenses
presbyopic	myope	no	reduced	no lenses
presbyopic	myope	no	normal	no lenses
presbyopic	myope	yes	reduced	no lenses
presbyopic	myope	yes	normal	hard
presbyopic	hyper	no	reduced	no lenses
presbyopic	hyper	no	normal	soft
presbyopic	hyper	yes	reduced	no lenses
presbyopic	hyper	yes	normal	no lenses

创建模块 lenses.py，并输入以下代码：

import decision_tree.create_tree as ct
import decision_tree.plot_tree as pt
import decision_tree.classify as cf


def get_tree():
    with open('./lenses.txt') as f:
        lenses = [inst.strip().split('\t') for inst in f.readlines()]
        tree = ct.create_decision_tree(lenses, get_lense_labels())
        return tree


def get_lense_labels():
    labels = ['age', 'prescript', 'astigmatic', 'tearRate']
    return labels


def plot_tree():
    tree = get_tree()
    pt.create_plot(tree)


def test_classify():
    tree = get_tree()
    print(tree)
    labels = get_lense_labels()
    test_data = ['young', 'hyper', 'yes', 'normal']
    result = cf.classify(tree, labels, test_data)
    print("The label of data %r is %r" % (test_data, result))


if __name__ == '__main__':
    test_classify()
    plot_tree()

运行结果：

D:\work\python_workspace\machine_learning\venv\Scripts\python.exe D:/work/python_workspace/machine_learning/decision_tree/lenses.py
{'tearRate': {'normal': {'astigmatic': {'no': {'age': {'pre': 'soft', 'presbyopic': {'prescript': {'myope': 'no lenses', 'hyper': 'soft'}}, 'young': 'soft'}}, 'yes': {'prescript': {'myope': 'hard', 'hyper': {'age': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}}}}}, 'reduced': 'no lenses'}}
The label of data ['young', 'hyper', 'yes', 'normal'] is 'hard'