【学习笔记】陈强-机器学习-Python-Ch11 决策树（Decision Tree）

赛博机器喵

已于 2024-08-21 21:40:33 修改

阅读量586

点赞数 22

文章标签：机器学习学习笔记 python

于 2024-08-21 21:25:58 首次发布

本文链接：https://blog.csdn.net/2201_76026029/article/details/141393002

版权

系列文章目录

监督学习：参数方法

监督学习：非参数方法

【学习笔记及课后题练习】陈强-机器学习-Python-Ch10 KNN法

文章目录

系列文章目录
- 监督学习：参数方法
- 监督学习：非参数方法
前言
一、非参数法：决策树
二、回归树案例
三、分类树案例

前言

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。

本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

数学原理等详见陈强老师的 PPT

参考了：
网友阡之尘埃的Python机器学习08——决策树算法

一、非参数法：决策树

KNN未考虑响应变量 y 的信息，所以对于噪音变量并不稳健。→ 决策树 (decision tree)

决策树可视为“自适应近邻法”(adaptive nearest neighbor），在进行节点分裂时考虑了y的信息，不受噪音变量的影响，适用于高维数据。

如果将决策树用于分类问题，则称为分类树 (classification tree) 。
如果将决策树用于回归问题，则称为回归树(regression tree)。

本质上，二叉树将“特征空间” 进行递归分割，每次总是沿着与某个特征变量 $x_j$ 轴平行的方向进行切割，切成“ 矩形 ”或“超矩形” 区域。
分类树是一种通过分割特征空间进行分类的分类器(classifier as partition)。

分类树的分裂准则：定义一个节点不纯度函数 (node impurity function) $φ(p_j)≥0$ 。
实践中常用的两个不纯度函数：“基尼指数”与“信息熵”。

二、回归树案例

使用波士顿房价数据boston （参考【学习笔记】陈强-机器学习-Python-Ch4 线性回归）

1. 载入数据

import pandas as pd
import numpy as np

# 从原始来源加载数据
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)

# 处理数据
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# 创建DataFrame
columns = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", 
    "PTRATIO", "B", "LSTAT"
]
df = pd.DataFrame(data, columns=columns)
df['MEDV'] = target

# 确定特征
X = df.drop(columns=['MEDV'])
y = df['MEDV']


# 将数据分割为训练集（70%）和测试集（30%）
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0) 

X_train.shape, X_test.shape, y_train.shape, y_test.shape #显示了特征矩阵和目标向量的形状。

结果输出： ((354, 13), (152, 13), (354,), (152,))

2. 进行回归树估计

#进行回归树估计
from sklearn.tree import DecisionTreeRegressor,export_text

model = DecisionTreeRegressor(
			max_depth=2,  #最大深度为2：最多会有 3 层（根节点 + 2 层的分裂），每个内部节点最多有 2 个分支。
			random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test) #拟合优度

结果输出： 0.622596538377147

`笔记：DecisionTreeRegressor ()`

DecisionTreeRegressor 是由 Scikit-learn 提供的一个决策树模型，用于解决回归问题。它的工作原理是通过学习数据中的规律，将数据集分割成越来越小的区块，直至每个区块（或叶节点）尽可能地包含具有相同或相似目标值的观测。

#基本语法和参数
from sklearn.tree import DecisionTreeRegressor

# 创建回归模型实例
model = DecisionTreeRegressor(
    criterion='squared_error', 
    splitter='best',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features=None,
    random_state=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    ccp_alpha=0.0)

criterion: 衡量分裂质量的标准，通常有以下两种：
    'squared_error': 最小化平方误差 (均方误差)，这是默认值。
    'friedman_mse': Friedman’s mean squared error，改进了均方误差的计算，可能会更适用于某些数据集。
    'poisson': 适用于泊松回归，用于处理计数数据。

splitter: 决定分裂策略的算法，通常有：
    'best': 选择最佳分裂。
    'random': 选择随机分裂，适用于增加模型的多样性。

max_depth: 树的最大深度。如果 None，则树会一直增长直到所有叶节点都是纯的，或者每个叶节点包含少于 min_samples_split 个样本。设置最大深度可以防止过拟合。

min_samples_split: 内部节点再分裂所需的最小样本数。默认值为 2。可以设置较大的值来防止过拟合。

min_samples_leaf: 叶节点所需的最小样本数。默认值为 1。设置较大的值可以平滑模型的预测。

min_weight_fraction_leaf: 叶节点中样本的最小权重比例。默认值为 0.0。适用于样本权重不均衡的情况。

max_features: 用于寻找最佳分裂的特征数量。可以是：
    整数，表示特征的数量。
    浮点数，表示特征的比例。
    'auto'、'sqrt'、'log2'，分别表示特征数量为 sqrt(n_features)、log2(n_features)，或自动选择（默认为 None）。

random_state: 随机数生成器的种子，用于确保实验的可重复性。可以是整数、RandomState 实例或 None。

max_leaf_nodes: 叶节点的最大数量。如果为 None，则不限制叶节点数量。可以控制树的复杂度。

min_impurity_decrease: 节点分裂的最小不纯度减少。只有当分裂带来的不纯度减少大于该值时，才会进行分裂。

ccp_alpha: 最小化成本复杂度修剪的参数。通过修剪减少树的复杂度，避免过拟合。ccp_alpha 的值越大，修剪越多。

1）文本格式的决策树

feature_names = columns  # 已经定义了包含所有特征的列表
print(export_text(model, feature_names=feature_names))

2）plot_tree()画决策树

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree

plot_tree(model, 
          feature_names=feature_names, #使用数据集中的特征名称
          node_ids=True, #在每个节点上显示唯一的 ID
          rounded=True, #将节点绘制为圆角矩形
          precision=2) #保留两位小数
plt.tight_layout()

在这里插入图片描述
node #0 （全样本，样本数为354，房价均值22.75）的分裂条件：RM(房间数)<=6.8
True $\implies$ node #1 “普通房宅” （样本数为284，房价均值19.61））的分裂条件：LSTAT(低端人口比重)<=14.4
True $\to$ node #2 （样本数为167，房价均值22.98）
False $\to$ node #3 （样本数为117，房价均值14.81）
False $\implies$ node #4 “大宅” （样本数为70，房价均值35.45））的分裂条件：RM<=7.43
True $\to$ node #5（样本数为47，房价均值30.92）
False $\to$ node #6（样本数为23，房价均值44.71）

`笔记：plot_tree()`

plot_tree() 是 scikit-learn 库中用于可视化决策树的函数。它能够生成一个决策树的图形表示，帮助理解模型的结构和决策过程。

#基本语法和参数
from sklearn.tree import plot_tree

# 绘制决策树
plot_tree(
    model_decision_tree, 
    max_depth=None, 
    feature_names=None,
    class_names=None,
    label='all',
    filled=False,
    rounded=False,
    proportion=False,
    precision=2,
    ax=None,
    fontsize=None,
    **kwargs)

model_decision_tree:
    必需参数。要绘制的决策树模型实例，通常是 DecisionTreeClassifier 或 DecisionTreeRegressor 的实例。

max_depth: 可选参数。
    绘制的最大深度。
    如果指定了该参数，树的深度会被限制到 max_depth。
    如果为 None，则绘制整个树。

feature_names: 可选参数。
    特征的名称列表，用于在图中显示每个节点的特征。如果未提供，将使用特征的索引。

class_names: 可选参数。
    类别名称列表，用于显示分类任务的目标标签。如果是回归模型，则此参数可以省略。

label: 可选参数。
    决定在每个节点上显示的标签类型。
    可以是 'all'、'root'、'none'。
    	'all' 表示显示所有信息，
    	'root' 仅显示根节点信息，
    	'none' 不显示标签。

filled: 可选参数。
    布尔值，决定是否填充节点的颜色以表示不同的类别或值。
    如果为 True，节点将使用不同的颜色填充；
    如果为 False，则不填充。

rounded: 可选参数。
    布尔值，决定是否将节点的边角圆化。
    如果为 True，节点将显示为圆角矩形；
    如果为 False，则为矩形。

proportion:  可选参数。
    布尔值，决定是否显示节点的比例。
    如果为 True，将显示每个节点的样本比例；
    如果为 False，则不显示。

precision: 可选参数。
   决定在节点的样本和叶节点的值上显示的浮点数精度（小数位数）。

ax: 可选参数。
    matplotlib 的 Axes 对象。指定要绘制树的图形区域。
    如果为 None，将创建一个新的图形区域。

fontsize: 可选参数。
    设置节点标签的字体大小。
    如果为 None，将使用默认字体大小。

**kwargs:   其他可选的参数，传递给 matplotlib 的 plot 函数。

3. 决策树的最优规模：最佳的泛化预测能力

以上模型中只用了13个特征变量中的2个变量，拟合优度已达到0.62。
决策树的最优规模，可通过对成本复杂性参数(cost-comlexity parameter)ccp_alpha 进行交叉验证来确定。

`笔记：ccp_alpha`

ccp_alpha 是一个参数，在决策树模型的剪枝（pruning）过程中使用，特别是在 Scikit-learn 的 DecisionTreeClassifier 和 DecisionTreeRegressor 中。它用于控制剪枝的强度，从而帮助提高模型的泛化能力。
ccp_alpha 是一个非负的浮点数，表示复杂度参数。ccp_alpha 越大，剪枝越强，模型变得越简单。ccp_alpha 为 0 时，不进行剪枝，即不剪去任何分支。ccp_alpha 的值越大，剪去的分支越多。

1）成本复杂度修剪:cost_complexity_pruning_path()方法

model_123 = DecisionTreeRegressor(random_state=123)
path = model_123.cost_complexity_pruning_path(X_train, y_train) #计算决策树的成本复杂度修剪路径。cost_complexity_pruning_path()方法
max(path.ccp_alphas),  max(path.impurities)

结果输出： (39.791826179538845, 84.76451346994725)

`笔记：cost_complexity_pruning_path()`

cost_complexity_pruning_path 是 scikit-learn 中 DecisionTreeRegressor 和 DecisionTreeClassifier 类的方法。它用于计算成本复杂度剪枝（Cost Complexity Pruning）路径的参数，以便进行决策树的剪枝操作。
通过不同的剪枝强度来调整决策树的复杂度。
剪枝过程的目标是减少决策树的复杂度，以防止过拟合，同时保持模型的预测能力。

#基本语法和参数
DecisionTreeRegressor.cost_complexity_pruning_path(
	X, #特征数据（数组或数据框），形状为 (n_samples, n_features)。
	y) #目标值（数组），形状为 (n_samples,)

cost_complexity_pruning_path 的返回值
返回一个包含三个主要元素的字典：
 ccp_alphas: （类型: 数组）
    含义: 成本复杂度参数 α 的一系列值。这些值控制剪枝的强度。较大的 α 值对应更强的剪枝，即更少的叶子节点。
 impurities:（类型: 数组）
    含义: 每个 α 值对应的总叶节点 impurity（不纯度）。这个值表示树在对应 α 值下的整体不纯度。通常，随着 α 值的增加，impurity 也会增加，因为树被修剪得更简单。
 ccp_alpha:（类型: 数组）
    含义: 用于剪枝的 α 值。

2）展示不同的 α 值（成本复杂度参数）与总叶节点均方误差（MSE）之间的关系

对于回归树，其“不纯度（impurities）”就是MSE。

plt.plot(path.ccp_alphas, #剪枝强度的参数 α 值（通过调用 cost_complexity_pruning_path 方法获得的）。α 值越大，模型的复杂度越低，剪枝越明显。
         path.impurities, #对应于每个 α 值的总叶节点不纯度（impurity）。不纯度通常用来衡量模型的拟合程度，值越小表示模型拟合训练数据的效果越好，但过于复杂的模型可能会导致过拟合。
         marker='o', 
         drawstyle='steps-post') #绘图的样式为“步骤后”样式：线条在每个数据点之后绘制，形成阶梯状的视觉效果。
plt.xlabel('alpha (cost-complexity parameter)')
plt.ylabel('Total Leaf MSE')
plt.title('Total Leaf MSE vs alpha for Training Set')

通常，随着 α 值的增加，不纯度也会增加，因为剪枝过程减少了树的复杂度。初始的 α 值可能会有较低的不纯度，随着 α 值的增加，树的修剪导致不纯度增加。最终，图形会形成一个上升的趋势，显示模型复杂度与不纯度之间的关系。
在这里插入图片描述

3）选择最佳ccp_alpha：交叉验证

from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV

param_grid = {'ccp_alpha': path.ccp_alphas} 
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeRegressor(random_state=123), param_grid, cv=kfold)
model.fit(X_train, y_train)
#获取最佳参数
model.best_params_)

结果输出： {‘ccp_alpha’: 0.03671186440677543}

4）最佳模型

model = model.best_estimator_ #model.best_estimator_ 属性是 GridSearchCV 对象中性能最佳的模型。
model.score(X_test,y_test) #拟合优度

结果输出： 0.6705389109763318

5）画出最佳模型的决策树

plot_tree(model, 
          feature_names=feature_names,
          node_ids=True, 
          rounded=True, 
          precision=2)
plt.tight_layout()

在这里插入图片描述

5）最优模型的决策树的深度与叶节点数

#决策树的深度
model.get_depth()
#叶节点数
model.get_n_leaves()

结果输出： 10
71

4. 变量重要性

1）查看变量重要性

model.feature_importances_

结果输出： array([0.07403082, 0.002995 , 0.01108218, 0. , 0.00842927,
0.60539031, 0.01294712, 0.06840243, 0.00158878, 0.00650786,
0.0253731 , 0.0081025 , 0.17515063])

2）画出变量重要性的柱状图

#argsort() 方法返回特征重要性数组的排序索引。
sorted_index = model.feature_importances_.argsort()

#plt.barh() 用于绘制水平条形图。
plt.barh(range(X.shape[1]), # 生成 y 轴位置的索引
         model.feature_importances_[sorted_index]) #按排序后的特征重要性值。

#plt.yticks() 用于设置 y 轴的刻度和标签。
plt.yticks(np.arange(X.shape[1]), #生成 y 轴的刻度位置
		   X.columns[sorted_index]) #按重要性排序后的特征名称。
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree')
plt.tight_layout()

在这里插入图片描述
由上图可见，RM最重要，其次是LSTAT，再次CRIM，然后是DIS和PTRATIO

5. 预测

pred = model.predict(X_test) #在测试集中预测
#呼出响应变量的预测值pred 与实际值 y_test的散点图
plt.scatter(pred, y_test, 
            alpha=0.6) #设置散点的透明度为 0.6，允许散点图的点重叠部分更清晰
w = np.linspace(min(pred), 
                max(pred), 
                100) #从预测值最小值到最大值的线性间隔的数组 w，包含 100 个点。
plt.plot(w, w) #绘制一条对角线（即 y = x 线），用于表示预测值与实际值完全一致的理想情况。通过这条线，可以比较预测结果与实际值之间的偏差。
plt.xlabel('pred')
plt.ylabel('y_test')
plt.title('Tree Prediction')

在这里插入图片描述

三、分类树案例

使用一个葡萄牙银行市场营销的数据集bank-additional.csv
响应变量y ：取值为 yes 或 no，表示在接到银行的直销电话后，客户是否会购买“银行定期存款” 产品。

1. 载入数据

1) 读取CSV文件

import pandas as pd
import numpy as np

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\bank-additional.csv'
bank = pd.read_csv(csv_path, sep=';')
bank.shape

结果输出： (4119, 21)

2) 处理原始数据

#查看y的比例
bank.y.value_counts(normalize=True)

结果输出： y
no 0.890507
yes 0.109493
Name: proportion, dtype: float64

#去掉不需要的变量duration
bank = bank.drop('duration', axis=1)
#查看 数据类型
bank.dtypes

结果输出：
age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
campaign int64
pdays int64
previous int64
poutcome object
emp.var.rate float64
cons.price.idx float64
cons.conf.idx float64
euribor3m float64
nr.employed float64
y object
dtype: object

#特征变量X
X_raw = bank.iloc[:, :-1]
X = pd.get_dummies(X_raw) #生成虚拟变量
#取出y
y = bank.iloc[:, -1]

3) 样本分组

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=1000, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

结果输出： ((3119, 62), (1000, 62), (3119,), (1000,))

2. 分类树

1) 进行分类树估计

#进行决策树估计
from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=2, 
                               random_state=123)
model.fit(X_train, y_train)

model.score(X_test, y_test)

结果输出： 0.904

`笔记：DecisionTreeClassifier(）`

DecisionTreeClassifier 是一个用于分类任务的机器学习模型，属于决策树家族。

#基本语法和参数
from sklearn.tree import DecisionTreeClassifier
# 初始化决策树分类器，设置参数
clf = DecisionTreeClassifier(
    criterion='gini',  #用于衡量分裂质量的标准默认'gini'(基尼指数)。可选：entropy'（信息熵）
    splitter='best', # 用于选择最佳分裂点的策略。'best': 选择最佳分裂点。'random': 随机选择分裂点。
    max_depth=None, #决策树的最大深度。限制树的深度可以防止过拟合。默认值为 None，表示不限制深度。
    min_samples_split=2, #部节点再分裂所需的最小样本数。默认值为 2。
    min_samples_leaf=1, #叶子节点上最小的样本数。确保每个叶子节点有足够的样本。默认值为 1。
    max_features=None, #次分裂时考虑的最大特征数量。默认值为 None，即考虑所有特征。可以设为整数、浮点数或 auto（自动选择）。
    max_leaf_nodes=None, #叶子节点的最大数量。默认值为 None，即没有限制。
    min_impurity_decrease=0.0, #节点分裂所需的最小不纯度减少量。用于防止分裂节点的过度生成。默认值为 0.0
    class_weight='balanced', #类别的权重，用于处理类别不平衡的问题。可以设为 'balanced'，'balanced_subsample'，或字典形式的类别权重。
    random_state=None) #随机种子，确保结果的可重复性。默认值为 None

2) 画分类树

import matplotlib.pyplot as plt

plot_tree(model, 
          feature_names=X.columns, 
          node_ids=True, 
          rounded=True, 
          precision=2)
plt.tight_layout()

在这里插入图片描述

3. 最优决策树的规模

1) 计算成本复杂度剪枝路径的参数

model_123 = DecisionTreeClassifier(random_state=123)
path = model_123.cost_complexity_pruning_path(X_train, y_train)
max(path.ccp_alphas),  max(path.impurities)

结果输出： (0.029949526543893212, 0.19525458100457016)

2) 展示不同的 α 值（成本复杂度参数）与叶节点总不纯度的关系

plt.plot(path.ccp_alphas, path.impurities, 
         marker='o', drawstyle='steps-post')
plt.xlabel('alpha (cost-complexity parameter)')
plt.ylabel('Total Leaf Impuritites')
plt.title('Total Leaf Impuritites vs alpha for Training Set')

在这里插入图片描述

3）选择最佳ccp_alpha：交叉验证

#10折交叉验证网格化搜索最优超参数——惩罚系数ccp_alpha
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV

param_grid = {'ccp_alpha': path.ccp_alphas} 
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeClassifier(random_state=123), 
                     param_grid, cv=kfold)
model.fit(X_train, y_train)
 
model.best_params_

结果输出： {‘ccp_alpha’: 0.0021510777681259807}

4）最佳模型

model = model.best_estimator_
model.score(X_test,y_test)

结果输出： 0.904

5）画出最佳模型的决策树

plot_tree(model, 
          feature_names=X.columns, 
          node_ids=True, 
          rounded=True, 
          precision=2)
plt.tight_layout()

在这里插入图片描述

4. 变量重要性

1）查看变量重要性

model.feature_importances_

结果输出：
array([0. , 0. , 0.16460096, 0. , 0. ,
0. , 0.05995227, 0. , 0.77544677, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ])

2）画出变量重要性的柱状图

sorted_index = model.feature_importances_.argsort()
plt.barh(range(X_train.shape[1]), 
         model.feature_importances_[sorted_index])
plt.yticks(np.arange(X_train.shape[1]),
           X_train.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree')
plt.tight_layout()

在这里插入图片描述

5. 预测

1）在测试集中预测并计算混淆矩阵

pred = model.predict(X_test)
table = pd.crosstab(y_test, pred, 
                    rownames=['Actual'], 
                    colnames=['Predicted'])
table

在这里插入图片描述

2）计算灵敏度和kappa

#计算灵敏度
table = np.array(table)
Sensitivity  = table[1, 1] / (table[1, 0] + table[1, 1])
Sensitivity
#计算kappa
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_test, pred)

结果输出： 0.22018348623853212 只能成功识别22%有购买意向的客户
0.2960328518002493 预测值与实际值的一致性一般

3）计算测试集中每个个体有购买意向的概率

prob = model.predict_proba(X_test)
prob

结果输出：
array([[0.94008876, 0.05991124],
[0.94008876, 0.05991124],
[0.94008876, 0.05991124],
…,
[0.94008876, 0.05991124],
[0.94008876, 0.05991124],
[0.94008876, 0.05991124]])
第一列无购买意愿的概率，第二列为有购买意向的概率

4）以0.1作为临界值进行预测

#取出‘有购买意向的概率’，以0.1为临界值进行预测
prob_yes = prob[:, 1]
pred_new = (prob_yes >= 0.1)

#根据新的预测结果，再次计算混淆矩阵
table = pd.crosstab(y_test, pred_new, rownames=['Actual'], colnames=['Predicted'])
table

在这里插入图片描述

5）计算预测（临界值0.1）的准确率与灵敏度

table = np.array(table)
Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
print(Accuracy)
 
Sensitivity  = table[1, 1] / (table[1, 0] + table[1, 1])
Sensitivity

结果输出： 0.88
0.5412844036697247

6. 用信息熵(entropy)进行分类树估计

1）选出最优模型

#选出最优模型：交叉验证
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(
    DecisionTreeClassifier(criterion='entropy', random_state=123), 
    param_grid, cv=kfold)
 
model.fit(X_train, y_train)     
model.score(X_test, y_test)

结果输出： 0.904

2）预测

pred = model.predict(X_test)
pd.crosstab(y_test, pred, 
            rownames=['Actual'], 
            colnames=['Predicted'])

在这里插入图片描述
结果和用GINI指数一样。

7.分类树：决策边界图（数据：iris）

#载入数据
from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)
X2 = X[:, 2:4]

#进行 分类树 估计
model = DecisionTreeClassifier(random_state=123)
#选出 最优模型
path = model.cost_complexity_pruning_path(X2, y)
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeClassifier(random_state=123), param_grid, cv=kfold)
model.fit(X2, y)
#预测 准确率
model.score(X2, y)

结果输出： 0.9933333333333333

#画出决策边界
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X2, y, model)
plt.xlabel('petal_length')
plt.ylabel('petal_width')
plt.title('Decision Boundary for Decision Tree')

在这里插入图片描述

赛博机器喵

关注

22
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【学习笔记】陈强-机器学习-Python-Ch11 决策树（Decision Tree）

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.数学原理等详见陈强老师的PPT参考了：网友阡之尘埃的Python机器学习08——决策树算法。
复制链接

扫一扫