机器学习作业4 - 决策树和剪枝

最新推荐文章于 2024-08-12 16:23:26 发布

拉克因

最新推荐文章于 2024-08-12 16:23:26 发布

阅读量3.2k

点赞数

分类专栏：机器学习文章标签：机器学习数据

本文链接：https://blog.csdn.net/dapanbest/article/details/78281201

版权

机器学习专栏收录该内容

11 篇文章 0 订阅

订阅专栏

决策树和两种剪枝方式（预剪枝和后剪枝）

首先吐槽一下本次作业的残暴！手写决策树也就算了，还要剪枝！还要两种剪枝方式！！！写的我手残眼花，不过还好，勉强达到了题目要求。不过可以说明的是，我的代码肯定有需要改进的地方，甚至可能存在Bug，所以大家如果在看的时候发现任何问题，都可以通过评论或者私信指出，在这里谢谢大家了！

那么，首先放一下这次作业的内容，即使用信息熵划分方式，产生一颗决策树，并使用UCI数据集，对不剪枝、预剪枝和后剪枝三种方式的效果进行验证。

第一步：种树

首先先种一棵决策树吧。还是老规矩，算法部分直接放图（主要是因为我懒得打字……），图片来自周志华《机器学习》第4章。

算法图片描述

决策树基于信息熵进行划分，书上划分的原则是：选择划分以后信息增益最大的属性进行划分，信息增益的计算方式可以描述为：

D - \sum i = 1 j D i * c i c

$D - \sum_{i=1}^jD_i * \frac{c_i}{c}$
其中，

Di $D_i$ 表示按照当前属性划分后，第

i $i$ 类的信息熵，

ci $c_i$ 表示第

i $i$ 类的样本数量，

c $c$ 表示样本总数量。例如，需要计算西瓜数据集根据“花纹”这一属性的划分结果，而“花纹”属性有清晰、稍糊、模糊3个类别，则信息增益是划分前的信息熵减去划分后三个子类别信息熵的加权平均值。

信息熵的计算方式可以描述为：划分后某一分支中，正样本和负样本所占比的加权之和，当然这样的描述并不是很准确，毕竟没有考虑进对数的作用，所以还是直接放公式吧：

- \sum i = 1 j p i * l o g 2 p i

$-\sum_{i=1}^j p_i * log_2p_i$
其中，信息熵越小越好，信息增益越大越好。根据信息增益的公式可以得出，同一样本的

D $D$ 是一样的，所以

−∑ji=1Di∗cic $- \sum_{i=1}^jD_i * \frac{c_i}{c}$ 越小，结果越好。因此我偷了个懒，通过最小化

−∑ji=1Di∗cic $- \sum_{i=1}^jD_i * \frac{c_i}{c}$ 来达到最大化信息增益的作用。那么对应的函数如下：

# 返回使用特定属性划分下的信息熵之和
# label: 数据标签
# attr: 用于进行数据划分的属性
def __get_info_entropy(label, attr):
    result = 0.0
    for this_attr in np.unique(attr):
        sub_label, entropy = label[np.where(attr == this_attr)[0]], 0.0
        for this_label in np.unique(sub_label):
            p = len(np.where(sub_label == this_label)[0]) / len(sub_label)
            entropy -= p * np.log2(p)
        result += len(sub_label) / len(label) * entropy
    return result

随后，基于书上给出的算法，写出决策树的核心代码如下：

# 递归构建一颗决策树
# data: 维度为 N * 2 的数组，每行的第 0 个数表示数据索引，第 1 个数表示数据标签
# attr: 维度为 N * M 的数组，每行表示一条数据的属性，列数随着决策树的构建而变化
# attr_idx: 表示每个属性在原始属性集合中的索引，用于决策树的构建
# pre_pruning: 表示是否进行预剪枝
# check_attr: 在预剪枝时，用作测试数据的属性集合
# check_label: 在预剪枝时，用作测试数据的验证标签
def __run_build(self, label, attr, attr_idx, 
                pre_pruning, check_attr=None, check_label=None):
    node, right_count = {}, None
    max_type = np.argmax(np.bincount(label))
    if len(np.unique(label)) == 1:
        # 如果所有样本属于同一类C，则将结点标记为C
        node['type'] = label[0]
        return node
    if attr is None or len(np.unique(attr, axis=0)) == 1:
        # 如果 attr 为空或者 attr 上所有元素取值一致，则将结点标记为样本数最多的类
        node['type'] = max_type
        return node
    attr_trans = np.transpose(attr)
    min_entropy, best_attr = np.inf, None
    # 获取各种划分模式下的信息熵之和（作用和信息增益类似）
    # 并以此为信息，找出最佳的划分属性
    if pre_pruning:
        right_count = len(np.where(check_label == max_type)[0])
    for this_attr in attr_trans:
        entropy = self.__get_info_entropy(label, this_attr)
        if entropy < min_entropy:
            min_entropy = entropy
            best_attr = this_attr
    # branch_attr_idx 表示用于划分的属性是属性集合中的第几个
    branch_attr_idx = np.where((attr_trans == best_attr).all(1))[0][0]
    if pre_pruning:
        sub_right_count = 0
        check_attr_trans = check_attr.transpose()
        # branch_attr_idx 表示本次划分依据的属性属于属性集中的哪一个
        for val in np.unique(best_attr):
            # 按照预划分的特征进行划分，并统计划分后的正确率
            # branch_data_idx 表示数据集中，被划分为 idx 的数据的索引
            branch_data_idx = np.where(best_attr == val)[0]
            # predict_label 表示一次划分以后，该分支数据的预测类别
            predict_label = np.argmax(np.bincount(label[branch_data_idx]))
            # check_data_idx 表示验证集中，属性编号为 branch_attr_idx 的属性值等于 val 的项的索引
            check_data_idx = np.where(check_attr_trans[branch_attr_idx] == val)[0]
            # check_branch_label 表示按照当前特征划分以后，被分为某一类的数据的标签
            check_branch_label = check_label[check_data_idx]
            # 随后判断这些标签是否等于前面计算得到的类别，如果相等，则分类正确
            sub_right_count += len(np.where(check_branch_label == predict_label)[0])
        if sub_right_count <= right_count:
            # 如果划分后的正确率小于等于不划分的正确率，则剪枝
            node['type'] = max_type
            return node
    values = []
    for val in np.unique(best_attr):
        values.append(val)
        branch_data_idx = np.where(best_attr == val)[0]
        if len(branch_data_idx) == 0:
            new_node = {'type': np.argmax(np.bincount(label))}
        else:
            # 按照划分构造新数据，并开始递归
            branch_label = label[branch_data_idx]
            branch_attr = np.delete(attr_trans, branch_attr_idx, axis=0)
                                    .transpose()[branch_data_idx]
            new_node = self.__run_build(branch_label, branch_attr,
                                        np.delete(attr_idx, branch_attr_idx, axis=0),
                                        pre_pruning, check_attr, check_label)
        node[str(val)] = new_node
    node['attr'] = attr_idx[branch_attr_idx]
    node['type'] = max_type
    node['values'] = values
    return node

第二步：剪枝

预剪枝在构建决策树的时候进行，在种树的代码中，如果设置了预剪枝，则在每次划分之前，使用验证数据对欲划分的数据集进行验证，首先得到未划分情况下的准确预测次数right_count，随后计算划分以后的准确预测次数之和，如果总和小于等于right_count，则不进行划分。预剪枝的代码参见上方代码中 if pre_pruning: 内部的代码。

后剪枝则是首先构建一颗完整的树，然后通过递归进行剪枝。其本质上的判断方式和预剪枝类似。直接放代码吧：

# 后剪枝
# node: 当前进行判断和剪枝操作的结点
# check_attr: 用于验证的数据属性集
# check_label: 用于验证的数据标签集
def __post_pruning(self, node, check_attr, check_label):
    check_attr_trans = check_attr.transpose()
    if node.get('attr') is None:
        # attr 为 None 代表叶节点
        return len(np.where(check_label == node['type'])[0])
    sub_right_count = 0
    for val in node['values']:
        sub_node = node[str(val)]
        # 找到当前分支点中，数据属于 idx 这一分支的数据的索引
        idx = np.where(check_attr_trans[node['attr']] == val)[0]
        # 使用上述数据，从子节点开始新的递归
        sub_right_count += self.__post_pruning(sub_node, check_attr[idx], check_label[idx])
    if sub_right_count <= len(np.where(check_label == node['type'])[0]):
        for val in node['values']:
            del node[str(val)]
        del node['values']
        del node['attr']
        return len(np.where(check_label == node['type']))
    return sub_right_count

最后，在构造函数中加入控制语句和数据准备代码：

def __init__(self, label, attr, pruning=None):
    self.__root = None
    boundary = len(label) // 3
    if pruning is None:
        self.__root = self.__run_build(label[boundary:], attr[boundary:],
                                       np.array(range(len(attr.transpose()))), False)
        return
    if pruning == 'Pre':
        self.__root = self.__run_build(label[boundary:], attr[boundary:],
                                       np.array(range(len(attr.transpose()))),
                                       True, attr[0:boundary], label[0:boundary])
    elif pruning == 'Post':
        self.__root = self.__run_build(label[boundary:], attr[boundary:],
                                       np.array(range(len(attr.transpose()))), False)
        self.__post_pruning(self.__root, attr[0:boundary], label[0:boundary])
    else:
        raise RuntimeError('未能识别的参数：%s' % pruning)

至此，决策树部分的代码全部写完了。

第三步：验证

使用 UCI 数据集中 car.data 进行验证。但是原数据中的属性是用英文字母表示的，所以在下载到数据以后，我使用数字对原数据进行了替换，并且减少了一些属性（例如v-high、5more）等，将原数据量从1000多降低至700多。修改后的数据可以在本作业文件夹的Data目录下找到。

首先读入文件，然后进行打乱，使用打乱后的数据和标签进行训练，最后进行验证。

if __name__ == '__main__':
    print('正在准备数据并种树……')
    file = open('Data/car.data')
    lines = file.readlines()
    raw_data = np.zeros([len(lines), 7], np.int32)
    for idx in range(len(lines)):
        raw_data[idx] = np.array(lines[idx].split(','), np.int32)
    file.close()
    np.random.shuffle(raw_data)
    data = raw_data.transpose()[0:6].transpose()
    label = raw_data.transpose()[6]
    tree_no_pruning = DecisionTree.Tree(label, data, None)
    tree_pre_pruning = DecisionTree.Tree(label, data, 'Pre')
    tree_post_pruning = DecisionTree.Tree(label, data, 'Post')
    test_count = len(label) // 3
    test_data, test_label = data[0:test_count], label[0:test_count]
    times_no_pruning, times_pre_pruning, times_post_pruning = 0, 0, 0
    print('正在检验结果（共 %d 条验证数据）' % test_count)
    for idx in range(test_count):
        if tree_no_pruning.predict(test_data[idx]) == test_label[idx]:
            times_no_pruning += 1
        if tree_pre_pruning.predict(test_data[idx]) == test_label[idx]:
            times_pre_pruning += 1
        if tree_post_pruning.predict(test_data[idx]) == test_label[idx]:
            times_post_pruning += 1
    print('【未剪枝】：命中 %d 次，命中率 %.2f%%' % (times_no_pruning, 
          times_no_pruning * 100 / test_count))
    print('【预剪枝】：命中 %d 次，命中率 %.2f%%' % (times_pre_pruning, 
          times_pre_pruning * 100 / test_count))
    print('【后剪枝】：命中 %d 次，命中率 %.2f%%' % (times_post_pruning, 
          times_post_pruning * 100 / test_count))