西瓜书-决策树代码

云墨山海

已于 2023-12-26 23:25:05 修改

阅读量1.1k

点赞数 15

文章标签：决策树算法机器学习

于 2023-12-23 21:02:54 首次发布

本文链接：https://blog.csdn.net/qq_46346707/article/details/135172956

版权

本篇内容为周志华老师的机器学习第四章决策树代码实现。初学者一位，有问题请多指教。

    # 属性集
    A = {'色泽': 0, '根蒂': 1, '敲声': 2, '纹理': 3, '脐部': 4, '触感': 5, '好瓜': 6}
    # 属性取值集合
    B = [
        ['青绿', '乌黑', '浅白'],
        ['蜷缩', '稍蜷', '硬挺'],
        ['浊响', '沉闷', '清脆'],
        ['清晰', '稍糊', '模糊'],
        ['凹陷', '稍凹', '平坦'],
        ['硬滑', '软粘'],
        ['是', '否']
    ]
    # 西瓜数据集 2.0
	D = [
        ['青绿', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '是'],
        ['乌黑', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '是'],
        ['乌黑', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '是'],
        ['青绿', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '是'],
        ['浅白', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '是'],
        ['青绿', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '是'],
        ['乌黑', '稍蜷', '浊响', '稍糊', '稍凹', '软粘', '是'],
        ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '硬滑', '是'],
        ['乌黑', '稍蜷', '沉闷', '稍糊', '稍凹', '硬滑', '否'],
        ['青绿', '硬挺', '清脆', '清晰', '平坦', '软粘', '否'],
        ['浅白', '硬挺', '清脆', '模糊', '平坦', '硬滑', '否'],
        ['浅白', '蜷缩', '浊响', '模糊', '平坦', '软粘', '否'],
        ['青绿', '稍蜷', '浊响', '稍糊', '凹陷', '硬滑', '否'],
        ['浅白', '稍蜷', '沉闷', '稍糊', '凹陷', '硬滑', '否'],
        ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '否'],
        ['浅白', '蜷缩', '浊响', '模糊', '平坦', '硬滑', '否'],
        ['青绿', '蜷缩', '沉闷', '稍糊', '稍凹', '硬滑', '否']
    ]

信息熵

信息熵公式： $Ent(D)=-\sum^{|y|}_{k=1}p_klog_2^{p_k}$

计算根节点信息熵

在这里插入图片描述

# 属性值在全部样例中的个数
def attribute_val_count_in_all(D, attribute, attribute_val):
    count = 0
    for row in D:
        if row[attribute] == attribute_val:
            count += 1
    return count

# 信息熵
def ent(D, A, attribute=None, attribute_val=None):
    # 正例个数
    positive_example = attribute_val_count_in_all(D, A['好瓜'], '是')
    # 计算根结点信息熵
    if attribute is None and attribute_val is None:
        # 样本总数
        count = len(D)
        # 正例个数
        positive_example_count = positive_example
        # 反例个数
        negative_example_count = count - positive_example_count
    else:
        ...
    if count == 0:
        return 0
    # 正比例
    positive_proportion = positive_example_count / count if positive_example_count / count != 0 else 1
    # 反比例
    negative_proportion = negative_example_count / count if negative_example_count / count != 0 else 1
    # 信息熵
    res = -(positive_proportion * np.log2(positive_proportion) + negative_proportion * np.log2(negative_proportion))
    return np.around(res, 3)

当 attribute=None, attribute_val=None 时，计算根节点信息熵，否则计算分支结点信息熵。

计算分支结点信息熵

在这里插入图片描述

# 信息熵
def ent(D, A, attribute=None, attribute_val=None):
    ...
    if attribute is None and attribute_val is None:
        ...
    else:
        # 当前属性值在数据集中总个数
        count = attribute_val_count_in_all(D, attribute, attribute_val)
        positive_example_count = 0
        # 正例中当前属性值个数
        for row in range(positive_example):
            if D[row][attribute] == attribute_val:
                positive_example_count += 1
        # 反例中当前属性值个数
        negative_example_count = count - positive_example_count
    if count == 0:
        return 0
    # 正比例
    positive_proportion = positive_example_count / count if positive_example_count / count != 0 else 1
    # 反比例
    negative_proportion = negative_example_count / count if negative_example_count / count != 0 else 1
    # 信息熵
    res = -(positive_proportion * np.log2(positive_proportion) + negative_proportion * np.log2(negative_proportion))
    return np.around(res, 3)

测试

测试结果与书中给出答案相同
在这里插入图片描述

信息增益

信息增益公式： $Gain(D,a)=Ent(D)-\sum^{V}_{v=1}\frac{|D^v|}{|D|}Ent(D^v)$

在这里插入图片描述

# 信息增益
def gain(D, B, A, attribute):
    sum_ = 0
    for attribute_val in B[A[attribute]]:
        # 属性值在全部样例中的个数
        count = attribute_val_count_in_all(D, A[attribute], attribute_val)
        sum_ += np.around((count / len(D)) * ent(D, A, A[attribute], attribute_val), 3)
    res = ent(D, A) - sum_
    return np.around(res, 3)

测试输出：

在这里插入图片描述

算法流程

算法流程
三个结束递归条件：

当前结点样本都属于同一个标记
当前属性集为空，或所有样本在所有属性上取值相同
当前结点包含的样本集合为空

def decision_tree(D, B, A):
    tree_root = TreeNode('root')
    # 当前结点样本都属于同一个标记
    if is_sample_equal(D):
        tree_root.set_attribute('好瓜' if D[0][-1] == '是' else '坏瓜')
        return tree_root
    # 当前属性集为空，或所有样本在所有属性上取值相同
    if not A or is_attribute_val_equal(D):
        positive = 0
        for d in D:
            positive = positive + 1 if d[-1] == '是' else positive - 1
        tree_root.set_attribute('好瓜' if positive > 0 else '坏瓜')
        return tree_root
    # 选择最优划分属性
    max_gain_attribute = choice_optimal_divide_attribute(D, B, A)
    tree_root.set_attribute(max_gain_attribute)
    # B[A[max_gain_attribute]] 最优划分属性取值
    for attribute_val in B[A[max_gain_attribute]]:
        # 获取只保留指定属性值的样本子集
        D_v = get_specify_attribute_val_list(D, attribute_val)
        # 当前结点包含的样本集合为空
        if not D_v:
        	# 正例个数
            positive = 0
            for d in D:
                positive = positive + 1 if d[-1] == '是' else positive - 1
            node = TreeNode('好瓜' if positive > 0 else '坏瓜')
            node.set_attribute_val(attribute_val)
            tree_root.add_child(node)
            return tree_root
        else:
            A_ = copy.deepcopy(A)
            # 从属性集中删除当前最优化分属性
            del A_[max_gain_attribute]
            # 递归
            node = decision_tree(D_v, B, A_)
            node.set_attribute_val(attribute_val)
            tree_root.add_child(node)
    return tree_root

# 所有样本是否属于同一类别
def is_sample_equal(D):
    first_type = D[0][-1]
    for d in D[1:]:
        if d[-1] != first_type:
            return False
    return True

所有样本在所有属性取值是否相同

def is_attribute_val_equal(D):
    first_row = D[0]
    for row in D[1:]:
        if row != first_row:
            return False
    return True

选择最优划分属性

def choice_optimal_divide_attribute(D, B, A):
    max_gain = -1
    max_gain_attribute = ''
    for attribute in list(A.keys())[:-1]:
        gain_ = gain(D, B, A, attribute)
        if max_gain < gain_:
            max_gain = gain_
            max_gain_attribute = attribute
    return max_gain_attribute

树结点

class TreeNode:
    def __init__(self, attribute):
        self.attribute = attribute
        self.children = []
        self.attribute_val = ''

    def add_child(self, child_node):
        self.children.append(child_node)

    def set_attribute(self, attribute):
        self.attribute = attribute

    def set_attribute_val(self, attribute_val):
        self.attribute_val = attribute_val