决策树构造步骤详解

最新推荐文章于 2025-04-24 06:19:55 发布

北辰alk

最新推荐文章于 2025-04-24 06:19:55 发布

阅读量881

点赞数 34

分类专栏： AI 文章标签：决策树算法机器学习

本文链接：https://blog.csdn.net/qq_16242613/article/details/147347115

版权

AI 专栏收录该内容

93 篇文章

订阅专栏

决策树是一种模仿人类决策过程的树形结构模型，下面是完整的构造步骤和技术细节：

1. 基础构造步骤

1.1 数据准备阶段

缺失值处理：填充（均值/众数）或删除
离散变量编码：One-Hot或Label Encoding
连续变量分箱：等宽/等频分箱（可选）

1.2 核心构建流程

def build_tree(data, depth=0):
    # 终止条件判断
    if should_stop(data, depth):
        return create_leaf(data)
    
    # 选择最佳分裂
    best_split = find_best_split(data)
    
    # 数据划分
    left_data, right_data = split_data(data, best_split)
    
    # 递归构建子树
    left_tree = build_tree(left_data, depth+1)
    right_tree = build_tree(right_data, depth+1)
    
    return DecisionNode(
        feature=best_split.feature,
        threshold=best_split.threshold,
        left=left_tree,
        right=right_tree
    )

2. 关键步骤详解

2.1 最佳分裂选择

2.1.1 常用不纯度度量指标

分类问题：
- 基尼指数： $\sum(p_i^2)$
- 信息增益： $\sum\frac{N_i}{N}H(child_i)$
- 信息增益比（C4.5）
回归问题：
- 方差减少： $\frac{1}{N}\sum(y_i - \bar{y})^2$

2.1.2 分裂点评估示例

def calculate_gini(y):
    _, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    return 1 - np.sum(probabilities**2)

def find_best_split(X, y):
    best_gini = float('inf')
    best_feature, best_thresh = None, None
    
    for feature in X.columns:
        values = np.sort(X[feature].unique())
        thresholds = (values[:-1] + values[1:]) / 2  # 中点作为候选阈值
        
        for thresh in thresholds:
            left_idx = X[feature] <= thresh
            gini_left = calculate_gini(y[left_idx])
            gini_right = calculate_gini(y[~left_idx])
            
            weighted_gini = (len(y[left_idx]) * gini_left + 
                          len(y[~left_idx]) * gini_right) / len(y)
            
            if weighted_gini < best_gini:
                best_gini = weighted_gini
                best_feature = feature
                best_thresh = thresh
                
    return best_feature, best_thresh

2.2 停止条件设置

条件类型	典型值	作用
最大深度	3-10	防止过拟合
最小样本分裂	2-20	避免微小分裂
最小叶子样本	1-5	保证统计意义
不纯度改进阈值	0.01-0.1	过滤无效分裂

2.3 叶子节点生成

分类树：选择多数类

def create_leaf(y):
    counts = np.bincount(y)
    return np.argmax(counts)

回归树：计算均值

def create_leaf(y):
    return np.mean(y)

3. 高级优化技术

3.1 剪枝处理

预剪枝 vs 后剪枝：

代价复杂度剪枝：

def prune_tree(node, alpha):
    if node.is_leaf:
        return
    
    # 递归剪枝子树
    prune_tree(node.left, alpha)
    prune_tree(node.right, alpha)
    
    # 计算剪枝前后的代价
    before_prune = node.impurity + alpha * node.subtree_size
    after_prune = calculate_leaf_impurity(node) + alpha * 1
    
    if after_prune <= before_prune:
        node.convert_to_leaf()

3.2 特殊数据处理

处理连续特征：

排序后取中点作为候选分裂点
优化方法：使用分位数减少候选点

处理类别特征：

二分法： $2^{k-1}-1$ 种划分方式（k为类别数）
优化：按目标变量均值排序后寻找最优划分

4. 主流算法对比

算法	分裂标准	支持任务	特点
ID3	信息增益	分类	倾向多值特征
C4.5	信息增益比	分类	处理连续特征
CART	基尼指数	分类/回归	二叉树结构
CHAID	卡方检验	分类	多路分裂

5. 实际应用建议

特征缩放：决策树不需要标准化，但归一化有助于可视化

参数调优网格：

param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_impurity_decrease': [0, 0.01, 0.1]
}

可视化解读（Graphviz示例）：

from sklearn.tree import export_graphviz
export_graphviz(
    tree_model,
    out_file="tree.dot",
    feature_names=feature_names,
    class_names=target_names,
    rounded=True
)