决策树的构建与剪枝方法

一只小爪磕

于 2024-06-29 23:57:20 发布

阅读量923

点赞数 8

分类专栏：后端及其他教学文章标签：决策树算法数据结构线性回归 b树推荐算法

本文链接：https://blog.csdn.net/weixin_52938153/article/details/140072816

版权

后端及其他教学专栏收录该内容

79 篇文章 1 订阅

订阅专栏

决策树的构建与剪枝方法

在上一部分，我们详细介绍了决策树的基础概念与原理。接下来，我们将深入探讨决策树的构建流程、递归分裂和停止条件、过拟合问题及其原因，以及剪枝技术（包括预剪枝和后剪枝）的具体实现方法和效果对比。

决策树的构建流程

构建决策树的过程可以分为以下几个主要步骤：

选择最佳特征：在每个节点处，根据某种标准（如信息增益或基尼系数）选择最能区分数据集的特征。
数据集划分：根据选择的特征，将数据集划分成子集。
递归构建树：对每个子集递归地重复上述步骤，直到满足停止条件（如达到最大树深度或子集不可再分）。
生成叶节点：当无法再进行划分时，生成叶节点，并将叶节点标记为最终的分类或回归结果。

选择最佳特征

选择最佳特征是构建决策树的关键步骤。在每个节点，我们需要选择一个特征来划分数据集，使得划分后的子集更加纯净。常用的标准有信息增益和基尼系数。

数据集划分

根据选择的特征，我们将数据集划分为若干子集。每个子集包含一个特征的特定取值或取值范围内的数据。

递归构建树

对每个子集，我们递归地重复选择最佳特征和数据集划分的步骤，直到满足停止条件。停止条件可以是达到最大树深度、子集不可再分（即子集中的所有数据都属于同一类别）或其他预设条件。

生成叶节点

当无法再进行划分时，我们生成叶节点。叶节点的值即为分类或回归的结果。对于分类问题，叶节点通常是类别标签；对于回归问题，叶节点是一个连续的数值。

树构建的伪代码

以下是决策树构建过程的伪代码示例：

class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _build_tree(self, X, y, depth=0):
        num_samples, num_features = X.shape
        if num_samples == 0 or depth == self.max_depth:
            return self._create_leaf(y)
        
        best_feature, best_threshold = self._choose_best_feature(X, y)
        if best_feature is None:
            return self._create_leaf(y)
        
        left_indices, right_indices = self._split(X[:, best_feature], best_threshold)
        left_subtree = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        right_subtree = self._build_tree(X[right_indices], y[right_indices], depth + 1)
        return Node(feature=best_feature, threshold=best_threshold, left=left_subtree, right=right_subtree)

    def _choose_best_feature(self, X, y):
        # 实现选择最佳特征的逻辑（例如使用信息增益或基尼系数）
        pass

    def _split(self, feature_column, threshold):
        left_indices = np.where(feature_column <= threshold)[0]
        right_indices = np.where(feature_column > threshold)[0]
        return left_indices, right_indices

    def _create_leaf(self, y):
        # 实现创建叶节点的逻辑
        pass

在上述伪代码中，_build_tree 方法递归地构建决策树，每次选择最佳特征并划分数据集，直到满足停止条件。

递归分裂和停止条件

递归分裂

递归分裂是构建决策树的核心过程。在每个节点，我们根据选择的特征将数据集划分为两个子集，然后递归地对每个子集重复这一过程。递归分裂的目标是最大化每次划分后子集的纯净度。

停止条件

停止条件决定了递归分裂何时终止。常见的停止条件包括：

最大树深度：限制树的最大深度，防止过度分裂。
最小样本数：每个节点必须包含至少一定数量的样本，否则停止分裂。
纯度阈值：如果某个节点的纯度（如熵或基尼系数）已经达到预设阈值，则停止分裂。
无法进一步分裂：当所有样本的特征取值相同且类别相同时，无法再进行有效分裂。

过拟合问题及其原因

过拟合问题

过拟合是机器学习模型在训练数据上表现很好，但在测试数据上表现不佳的问题。对于决策树，过拟合通常表现为树的结构过于复杂，能够非常精确地拟合训练数据中的噪声和细节。

过拟合的原因

决策树过拟合的原因主要有以下几点：

树的深度过大：树的深度越大，模型越复杂，容易拟合训练数据中的噪声。
节点划分过于细致：每个节点划分后，子集的样本数过少，导致叶节点的预测结果过于依赖于个别样本。
缺乏正则化：没有对树的复杂度进行约束，导致模型过于灵活。

剪枝技术：预剪枝和后剪枝

为了防止过拟合，可以采用剪枝技术对决策树进行简化。剪枝技术主要分为预剪枝和后剪枝两种。

预剪枝（Pre-pruning）

预剪枝是在构建决策树的过程中，通过提前停止树的生长来限制树的复杂度。常见的预剪枝方法包括：

限制最大深度：设置树的最大深度，防止树的深度过大。
限制最小样本数：设置每个节点必须包含的最小样本数，防止节点划分过于细致。
纯度阈值：设置纯度阈值，当节点的纯度达到一定程度时停止分裂。

预剪枝的具体实现

以下是预剪枝的具体实现代码：

class DecisionTreePrePruning:
    def __init__(self, max_depth=None, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _build_tree(self, X, y, depth=0):
        num_samples, num_features = X.shape
        if num_samples < self.min_samples_split or depth == self.max_depth:
            return self._create_leaf(y)
        
        best_feature, best_threshold = self._choose_best_feature(X, y)
        if best_feature is None:
            return self._create_leaf(y)
        
        left_indices, right_indices = self._split(X[:, best_feature], best_threshold)
        left_subtree = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        right_subtree = self._build_tree(X[right_indices], y[right_indices], depth + 1)
        return Node(feature=best_feature, threshold=best_threshold, left=left_subtree, right=right_subtree)

    def _choose_best_feature(self, X, y):
        # 实现选择最佳特征的逻辑（例如使用信息增益或基尼系数）
        pass

    def _split(self, feature_column, threshold):
        left_indices = np.where(feature_column <= threshold)[0]
        right_indices = np.where(feature_column > threshold)[0]
        return left_indices, right_indices

    def _create_leaf(self, y):
        # 实现创建叶节点的逻辑
        pass

在上述代码中，我们通过 max_depth 和 min_samples_split 参数实现了预剪枝，控制树的最大深度和每个节点的最小样本数。

后剪枝（Post-pruning）

后剪枝是在决策树完全生成后，通过剪去一些节点来简化树的结构。后剪枝的方法包括：

剪枝评估：使用验证集评估树的性能，剪去对性能影响不大的节点。
最小误差剪枝：计算每个节点的误差，剪去误差较大的节点。

后剪枝的具体实现

以下是后剪枝的具体实现代码：

class DecisionTreePostPruning:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)
        self._post_prune(self.tree, X, y)

    def _build_tree(self, X, y, depth=0):
        num_samples, num_features = X.shape
        if num_samples == 0 or depth == self.max_depth:
            return self._create_leaf(y)
        
        best_feature, best_threshold = self._choose_best_feature(X, y)
        if best_feature is None:
            return self._create_leaf(y)
        
        left_indices, right_indices = self._split(X[:, best_feature], best_threshold)
        left_subtree = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        right_subtree = self._build_tree(X[right_indices], y[right_indices], depth + 1)
        return Node(feature=best_feature, threshold=best_threshold, left=left_subtree, right=right_subtree)

    def _post_prune(self, node, X, y):
        if node is None or isinstance(node, Leaf):
            return
        
        left_indices, right_indices = self._split(X[:, node.feature], node.threshold)
        self._post_prune(node.left, X[left_indices], y[left_indices])
        self._post_prune(node.right, X[right_indices], y[right_indices])

        if isinstance(node.left, Leaf) and isinstance(node.right, Leaf):
            # 计算当前节点和叶节点的误差
            error_no_pruning = self._calculate_error(y)
            error_pruning = self._calculate_leaf_error(y, node.left.value, node.right.value)
            
            if error_pruning <= error_no_pruning:
                node.left = node.right = None
                node.value = (node.left.value + node.right.value) / 2

    def _choose_best_feature(self, X, y):
        # 实现选择最佳特征的逻辑（例如使用信息增益或基尼系数）
        pass

    def _split(self, feature_column, threshold):
        left_indices = np.where(feature_column <= threshold)[0]
        right_indices = np.where(feature_column > threshold)[0]
        return left_indices, right_indices

    def _create_leaf(self, y):
        # 实现创建叶节点的逻辑
        pass

    def _calculate_error(self, y):
        # 计算当前节点的误差
        pass

    def _calculate_leaf_error(self, y, left_value, right_value):
        # 计算叶节点的误差
        pass

在上述代码中，我们通过 _post_prune 方法实现了后剪枝。首先构建完整的决策树，然后递归地评估每个节点，决定是否剪去节点以简化树的结构。

预剪枝和后剪枝的效果对比

预剪枝和后剪枝都有助于防止决策树的过拟合，但它们的效果和适用场景有所不同。

预剪枝：
- 优点：构建过程简单，计算开销较小。
- 缺点：可能会过早地停止树的生长，导致树的性能不如后剪枝。
后剪枝：
- 优点：可以生成更优的树结构，通常能提高模型的泛化能力。
- 缺点：需要构建完整的决策树，计算开销较大。

具体实现方法和性能比较

为了更直观地理解预剪枝和后剪枝的效果，我们通过实际数据集进行实验，并比较两种剪枝方法的性能。

数据集准备

我们使用著名的乳腺癌数据集（Breast Cancer Dataset）进行实验。该数据集包含569个样本和30个特征，用于二分类任务。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据集
data = load_breast_cancer()
X, y = data.data, data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

预剪枝实验

我们使用预剪枝方法构建决策树，并评估其在测试集上的性能。

# 实例化预剪枝决策树
pre_pruning_tree = DecisionTreePrePruning(max_depth=5, min_samples_split=10)
pre_pruning_tree.fit(X_train, y_train)

# 预测并评估性能
y_pred = pre_pruning_tree.predict(X_test)
print(f"预剪枝决策树的准确率: {accuracy_score(y_test, y_pred)}")

后剪枝实验

我们使用后剪枝方法构建决策树，并评估其在测试集上的性能。

# 实例化后剪枝决策树
post_pruning_tree = DecisionTreePostPruning(max_depth=10)
post_pruning_tree.fit(X_train, y_train)

# 预测并评估性能
y_pred = post_pruning_tree.predict(X_test)
print(f"后剪枝决策树的准确率: {accuracy_score(y_test, y_pred)}")