30.决策树：会问问题的智能算法

决策树：智能算法的奥秘与实战

原创于 2025-08-14 17:15:38 发布 · 969 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#决策树 #算法 #机器学习

python-从基础语法到前言技术专栏收录该内容

50 篇文章

订阅专栏

决策树：会问问题的智能算法

🎯 前言：AI界的"柯南"

还记得小时候玩的"20个问题"游戏吗？你心里想一个东西，我通过问20个是非问题来猜出答案。“是动物吗？”“会飞吗？”"比猫大吗？"每个问题都能缩小范围，直到最后锁定答案。

决策树就是机器学习界的"名侦探柯南"，它通过一系列巧妙的问题来破解数据的秘密。不同的是，柯南破案需要天才般的推理，而决策树只需要足够的数据和正确的算法。今天我们就来看看这个"AI侦探"是如何通过问问题来解决复杂问题的！

想象一下，如果你要给银行设计一个贷款审批系统，传统方法可能需要一堆复杂的规则和经验。但决策树会说：“让我来问几个问题就行了！”——这就是决策树的魅力，简单直观，效果还不错。

🌳 什么是决策树？

决策树是一种基于树结构的机器学习算法，它通过学习一系列的if-then规则来对数据进行分类或回归。就像是一个智能的流程图，每个节点都是一个问题，每个分支都是一个答案。

生活中的决策树

其实我们每天都在使用决策树的思维：

选择午餐的决策树：

今天想吃什么？
├─ 想吃米饭？
│  ├─ 是 → 想吃辣的？
│  │  ├─ 是 → 麻辣香锅
│  │  └─ 否 → 白切鸡饭
│  └─ 否 → 想吃面条？
│     ├─ 是 → 牛肉面
│     └─ 否 → 汉堡

看，这不就是一个完美的决策树吗？每个问题都帮我们缩小选择范围，直到找到最终答案。

决策树的组成部分

根节点（Root Node）：树的起点，包含所有数据
内部节点（Internal Node）：代表一个特征的判断条件
叶子节点（Leaf Node）：最终的预测结果
分支（Branch）：连接节点的边，代表判断结果

🔍 决策树的工作原理

决策树的构建过程就像是一个优秀的侦探破案：

1. 选择最佳问题（特征选择）

不是所有问题都同样重要。一个好的侦探会问最有价值的问题。决策树使用以下指标来选择最佳特征：

信息增益（Information Gain）：问这个问题能获得多少信息？
基尼系数（Gini Index）：问这个问题能让数据变得多纯净？
信息增益比（Gain Ratio）：考虑了特征复杂度的信息增益

2. 递归分割

选定问题后，根据答案将数据分成子集，然后对每个子集重复这个过程，直到：

所有数据属于同一类别
没有更多特征可以使用
达到预设的停止条件

3. 剪枝（Pruning）

为了避免过拟合，需要对树进行"修剪"：

预剪枝：在构建过程中提前停止
后剪枝：先构建完整的树，再删除不必要的分支

⚖️ 决策树的优缺点

优点：决策树的"超能力"

直观易懂：结果可以用简单的if-then规则表示
无需数据预处理：不需要标准化、归一化
处理混合数据：能同时处理数值和分类特征
特征选择自动化：自动识别重要特征
计算效率高：训练和预测速度都很快

缺点：决策树的"阿喀琉斯之踵"

容易过拟合：像记忆力太好的学生，记住了所有细节
对噪声敏感：一个错误的数据点可能影响整个树
偏向复杂特征：更喜欢有很多取值的特征
不稳定：数据的小变化可能导致完全不同的树
难以表达线性关系：对于线性可分的数据效果不佳

🎯 分类决策树实战

让我们用决策树来解决一个经典问题：判断蘑菇是否有毒。

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 创建蘑菇数据集（模拟）
np.random.seed(42)
n_samples = 1000

# 生成特征
data = {
    '帽子颜色': np.random.choice(['红色', '白色', '棕色'], n_samples),
    '帽子形状': np.random.choice(['圆形', '锥形', '扁平'], n_samples),
    '茎长度': np.random.normal(5, 2, n_samples),
    '茎厚度': np.random.normal(1, 0.5, n_samples),
    '生长环境': np.random.choice(['森林', '草地', '湿地'], n_samples),
    '季节': np.random.choice(['春季', '夏季', '秋季'], n_samples)
}

# 创建DataFrame
df = pd.DataFrame(data)

# 创建目标变量（有毒/无毒）
# 简化的规则：红色帽子 + 森林环境 + 茎长度>6 = 有毒
df['是否有毒'] = (
    (df['帽子颜色'] == '红色') & 
    (df['生长环境'] == '森林') & 
    (df['茎长度'] > 6)
).astype(int)

print("蘑菇数据集预览：")
print(df.head())
print(f"\n数据集形状: {df.shape}")
print(f"有毒蘑菇比例: {df['是否有毒'].mean():.2%}")

数据预处理

from sklearn.preprocessing import LabelEncoder

# 处理分类特征
categorical_features = ['帽子颜色', '帽子形状', '生长环境', '季节']
label_encoders = {}

for feature in categorical_features:
    le = LabelEncoder()
    df[feature + '_编码'] = le.fit_transform(df[feature])
    label_encoders[feature] = le

# 准备特征和目标变量
feature_columns = ['帽子颜色_编码', '帽子形状_编码', '茎长度', '茎厚度', 
                  '生长环境_编码', '季节_编码']
X = df[feature_columns]
y = df['是否有毒']

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("训练集形状:", X_train.shape)
print("测试集形状:", X_test.shape)

训练决策树

# 创建决策树分类器
dt_classifier = DecisionTreeClassifier(
    max_depth=5,  # 限制树的深度
    min_samples_split=20,  # 分割内部节点的最小样本数
    min_samples_leaf=10,   # 叶子节点的最小样本数
    random_state=42
)

# 训练模型
dt_classifier.fit(X_train, y_train)

# 预测
y_pred = dt_classifier.predict(X_test)

# 评估模型
print("分类报告：")
print(classification_report(y_test, y_pred, target_names=['无毒', '有毒']))

# 计算准确率
accuracy = dt_classifier.score(X_test, y_test)
print(f"\n准确率: {accuracy:.3f}")

混淆矩阵可视化

# 绘制混淆矩阵
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['无毒', '有毒'],
            yticklabels=['无毒', '有毒'])
plt.title('蘑菇分类混淆矩阵')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

特征重要性分析

# 特征重要性
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': dt_classifier.feature_importances_
}).sort_values('importance', ascending=False)

print("特征重要性排名：")
print(feature_importance)

# 可视化特征重要性
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('决策树特征重要性')
plt.xlabel('重要性')
plt.tight_layout()
plt.show()

📈 回归决策树实战

决策树不仅能分类，还能预测连续数值。让我们用决策树来预测房价：

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

# 生成房价数据
X, y = make_regression(
    n_samples=1000,
    n_features=8,
    noise=0.1,
    random_state=42
)

# 创建特征名称
feature_names = ['面积', '房间数', '楼层', '建造年份', '距离市中心', 
                '周边设施', '交通便利性', '学区质量']

# 创建DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['房价'] = y

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 创建回归决策树
dt_regressor = DecisionTreeRegressor(
    max_depth=10,
    min_samples_split=20,
    min_samples_leaf=5,
    random_state=42
)

# 训练模型
dt_regressor.fit(X_train, y_train)

# 预测
y_pred = dt_regressor.predict(X_test)

# 评估
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"均方误差: {mse:.3f}")
print(f"R²得分: {r2:.3f}")

# 可视化预测效果
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('真实房价')
plt.ylabel('预测房价')
plt.title('决策树回归预测效果')
plt.show()

# 特征重要性
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': dt_regressor.feature_importances_
}).sort_values('importance', ascending=False)

print("\n房价预测特征重要性：")
print(feature_importance)

🔧 决策树的参数调优

决策树有很多参数可以调整，就像调音师调音一样，需要找到最佳的组合：

from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# 创建决策树
dt = DecisionTreeClassifier(random_state=42)

# 网格搜索
grid_search = GridSearchCV(
    dt, param_grid, cv=5, 
    scoring='accuracy', 
    n_jobs=-1, 
    verbose=1
)

# 训练
grid_search.fit(X_train, y_train)

# 最佳参数
print("最佳参数:")
print(grid_search.best_params_)
print(f"最佳交叉验证分数: {grid_search.best_score_:.3f}")

# 使用最佳模型预测
best_model = grid_search.best_estimator_
best_pred = best_model.predict(X_test)
best_accuracy = best_model.score(X_test, y_test)

print(f"测试集准确率: {best_accuracy:.3f}")

参数含义解释

# 让我们通过不同参数设置来看效果
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 不同max_depth的效果
depths = [3, 5, 10, None]
for i, depth in enumerate(depths):
    ax = axes[i//2, i%2]
    
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    
    train_score = dt.score(X_train, y_train)
    test_score = dt.score(X_test, y_test)
    
    ax.bar(['训练集', '测试集'], [train_score, test_score])
    ax.set_title(f'max_depth={depth}')
    ax.set_ylabel('准确率')
    ax.set_ylim(0, 1.1)
    
    # 添加数值标签
    for j, score in enumerate([train_score, test_score]):
        ax.text(j, score + 0.01, f'{score:.3f}', ha='center')

plt.tight_layout()
plt.show()

📊 决策树的可视化

决策树最大的优势就是可以可视化，让我们看看这个"AI侦探"是如何思考的：

from sklearn.tree import export_text, plot_tree
import matplotlib.pyplot as plt

# 训练一个简单的决策树
simple_dt = DecisionTreeClassifier(max_depth=3, random_state=42)
simple_dt.fit(X_train, y_train)

# 文本形式的决策树
tree_rules = export_text(simple_dt, feature_names=feature_columns)
print("决策树规则:")
print(tree_rules)

# 图形化决策树
plt.figure(figsize=(20, 10))
plot_tree(simple_dt, 
          feature_names=feature_columns,
          class_names=['无毒', '有毒'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('蘑菇分类决策树')
plt.show()

使用Graphviz创建更美观的树图

from sklearn.tree import export_graphviz
import graphviz

# 导出为DOT格式
dot_data = export_graphviz(simple_dt,
                          feature_names=feature_columns,
                          class_names=['无毒', '有毒'],
                          filled=True,
                          rounded=True,
                          special_characters=True)

# 创建图形
graph = graphviz.Source(dot_data)
graph.render('decision_tree')  # 保存为PDF
print("决策树图形已保存为 decision_tree.pdf")

🏢 实战项目：客户流失预测

让我们做一个完整的项目，用决策树来预测客户是否会流失：

# 生成客户数据
np.random.seed(42)
n_customers = 2000

# 客户特征
customer_data = {
    '年龄': np.random.randint(18, 80, n_customers),
    '性别': np.random.choice(['男', '女'], n_customers),
    '收入': np.random.normal(50000, 20000, n_customers),
    '使用时长': np.random.randint(1, 120, n_customers),  # 月数
    '月消费': np.random.normal(100, 50, n_customers),
    '投诉次数': np.random.poisson(2, n_customers),
    '客服评分': np.random.randint(1, 6, n_customers),
    '产品数量': np.random.randint(1, 6, n_customers)
}

# 创建DataFrame
customer_df = pd.DataFrame(customer_data)

# 确保数据合理
customer_df['收入'] = np.clip(customer_df['收入'], 10000, 200000)
customer_df['月消费'] = np.clip(customer_df['月消费'], 0, 500)

# 创建流失标签（基于业务逻辑）
# 流失概率与投诉次数、客服评分、使用时长相关
churn_probability = (
    0.1 +  # 基础流失率
    0.05 * customer_df['投诉次数'] +  # 投诉越多流失概率越高
    0.03 * (6 - customer_df['客服评分']) +  # 评分越低流失概率越高
    0.002 * np.maximum(0, 60 - customer_df['使用时长'])  # 使用时长短流失概率高
)

# 添加随机因素
customer_df['是否流失'] = np.random.binomial(1, np.clip(churn_probability, 0, 1))

print("客户数据预览：")
print(customer_df.head())
print(f"\n流失率: {customer_df['是否流失'].mean():.2%}")

数据探索与预处理

# 数据探索
plt.figure(figsize=(15, 10))

# 流失率分布
plt.subplot(2, 3, 1)
plt.pie(customer_df['是否流失'].value_counts(), labels=['未流失', '流失'], autopct='%1.1f%%')
plt.title('客户流失分布')

# 各特征与流失的关系
numeric_features = ['年龄', '收入', '使用时长', '月消费', '投诉次数', '客服评分']

for i, feature in enumerate(numeric_features[:5]):
    plt.subplot(2, 3, i+2)
    
    # 分组统计
    stayed = customer_df[customer_df['是否流失'] == 0][feature]
    churned = customer_df[customer_df['是否流失'] == 1][feature]
    
    plt.hist(stayed, alpha=0.7, label='未流失', density=True)
    plt.hist(churned, alpha=0.7, label='流失', density=True)
    plt.xlabel(feature)
    plt.ylabel('密度')
    plt.title(f'{feature}分布')
    plt.legend()

plt.tight_layout()
plt.show()

# 处理分类特征
customer_df['性别_编码'] = LabelEncoder().fit_transform(customer_df['性别'])

# 准备特征
features = ['年龄', '性别_编码', '收入', '使用时长', '月消费', '投诉次数', '客服评分', '产品数量']
X = customer_df[features]
y = customer_df['是否流失']

# 分割数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

模型训练与评估

# 创建决策树模型
dt_churn = DecisionTreeClassifier(
    max_depth=6,
    min_samples_split=50,
    min_samples_leaf=20,
    class_weight='balanced',  # 处理类别不平衡
    random_state=42
)

# 训练模型
dt_churn.fit(X_train, y_train)

# 预测
y_pred = dt_churn.predict(X_test)
y_pred_proba = dt_churn.predict_proba(X_test)[:, 1]

# 评估模型
print("客户流失预测结果：")
print(classification_report(y_test, y_pred, target_names=['未流失', '流失']))

# 混淆矩阵
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['未流失', '流失'],
            yticklabels=['未流失', '流失'])
plt.title('客户流失预测混淆矩阵')
plt.xlabel('预测')
plt.ylabel('实际')
plt.show()

业务洞察与建议

# 特征重要性分析
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': dt_churn.feature_importances_
}).sort_values('importance', ascending=False)

print("影响客户流失的关键因素：")
print(feature_importance)

# 可视化特征重要性
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
plt.title('客户流失关键因素排名')
plt.xlabel('重要性')
plt.tight_layout()
plt.show()

# 高风险客户识别
high_risk_threshold = 0.7
high_risk_customers = customer_df[y_pred_proba > high_risk_threshold]

print(f"\n高风险客户数量: {len(high_risk_customers)}")
print("高风险客户特征：")
print(high_risk_customers[features + ['是否流失']].describe())

# 决策树规则提取
tree_rules = export_text(dt_churn, feature_names=features)
print("\n决策树规则（前10行）：")
print('\n'.join(tree_rules.split('\n')[:10]))

🌟 决策树的变种

决策树家族有很多成员，每个都有自己的特色：

1. CART（Classification and Regression Trees）

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# CART是sklearn中的默认实现
cart_classifier = DecisionTreeClassifier(criterion='gini')
cart_regressor = DecisionTreeRegressor(criterion='mse')

2. ID3和C4.5（概念介绍）

# ID3使用信息增益
# C4.5使用信息增益比
# sklearn中可以通过criterion参数近似实现

dt_id3_like = DecisionTreeClassifier(criterion='entropy')  # 类似ID3
dt_c45_like = DecisionTreeClassifier(criterion='entropy')  # 类似C4.5

3. 极端随机树（Extra Trees）

from sklearn.ensemble import ExtraTreesClassifier

# 极端随机树
extra_trees = ExtraTreesClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

extra_trees.fit(X_train, y_train)
extra_pred = extra_trees.predict(X_test)
extra_accuracy = extra_trees.score(X_test, y_test)

print(f"极端随机树准确率: {extra_accuracy:.3f}")

🚀 进阶技巧

1. 处理缺失值

# 决策树可以处理缺失值
X_with_missing = X_train.copy()
# 随机添加缺失值
mask = np.random.rand(*X_with_missing.shape) < 0.1
X_with_missing = X_with_missing.astype(float)
X_with_missing[mask] = np.nan

# 使用支持缺失值的决策树实现
# 注意：sklearn的决策树不直接支持缺失值，需要先处理
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_with_missing)

dt_imputed = DecisionTreeClassifier(random_state=42)
dt_imputed.fit(X_imputed, y_train)

2. 代价敏感学习

# 设置不同类别的代价
# 假设误诊流失客户的代价是误诊未流失客户的5倍
cost_matrix = {0: 1, 1: 5}  # 类别0的代价是1，类别1的代价是5

dt_cost_sensitive = DecisionTreeClassifier(
    class_weight=cost_matrix,
    random_state=42
)

dt_cost_sensitive.fit(X_train, y_train)
cost_pred = dt_cost_sensitive.predict(X_test)

print("代价敏感决策树结果：")
print(classification_report(y_test, cost_pred, target_names=['未流失', '流失']))

3. 特征选择

from sklearn.feature_selection import SelectFromModel

# 基于决策树的特征选择
selector = SelectFromModel(dt_churn, threshold='median')
X_selected = selector.fit_transform(X_train, y_train)

print(f"原始特征数: {X_train.shape[1]}")
print(f"选择后特征数: {X_selected.shape[1]}")
print(f"选择的特征: {np.array(features)[selector.get_support()]}")

🔧 常见问题与解决方案

问题1：过拟合

# 过拟合的表现
overfitted_dt = DecisionTreeClassifier(random_state=42)
overfitted_dt.fit(X_train, y_train)

train_acc = overfitted_dt.score(X_train, y_train)
test_acc = overfitted_dt.score(X_test, y_test)

print(f"过拟合决策树:")
print(f"训练准确率: {train_acc:.3f}")
print(f"测试准确率: {test_acc:.3f}")
print(f"过拟合程度: {train_acc - test_acc:.3f}")

# 解决方案：剪枝
pruned_dt = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)
pruned_dt.fit(X_train, y_train)

train_acc_pruned = pruned_dt.score(X_train, y_train)
test_acc_pruned = pruned_dt.score(X_test, y_test)

print(f"\n剪枝后决策树:")
print(f"训练准确率: {train_acc_pruned:.3f}")
print(f"测试准确率: {test_acc_pruned:.3f}")
print(f"过拟合程度: {train_acc_pruned - test_acc_pruned:.3f}")

问题2：类别不平衡

# 处理类别不平衡的方法
from sklearn.utils import class_weight

# 1. 使用class_weight参数
balanced_dt = DecisionTreeClassifier(
    class_weight='balanced',
    random_state=42
)

# 2. 手动计算类别权重
class_weights = class_weight.compute_class_weight(
    'balanced', 
    classes=np.unique(y_train), 
    y=y_train
)
weight_dict = {i: weight for i, weight in enumerate(class_weights)}

manual_balanced_dt = DecisionTreeClassifier(
    class_weight=weight_dict,
    random_state=42
)

问题3：特征重要性误解

# 特征重要性可能具有误导性
# 创建冗余特征
X_redundant = X_train.copy()
X_redundant['冗余特征'] = X_redundant['投诉次数'] + np.random.normal(0, 0.1, len(X_redundant))

dt_redundant = DecisionTreeClassifier(random_state=42)
dt_redundant.fit(X_redundant, y_train)

importance_with_redundant = pd.DataFrame({
    'feature': features + ['冗余特征'],
    'importance': dt_redundant.feature_importances_
}).sort_values('importance', ascending=False)

print("包含冗余特征的重要性排名：")
print(importance_with_redundant)