陈强-机器学习及Python应用-11.10 回归树案例

E'ureka

已于 2024-09-07 21:04:15 修改

阅读量520

点赞数 3

分类专栏：机器学习及Python应用文章标签：机器学习 python 决策树

于 2024-09-07 15:51:53 首次发布

本文链接：https://blog.csdn.net/wjjdkwj/article/details/141995407

版权

机器学习及Python应用专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文章目录

前言
一、数据预处理
二、回归树
三、选择最优超参数

前言

入门机器学习，记录学习日常，如有错误请多指正。
参考书目:机器学习及Python应用
数据集可在陈强教授主页下载

一、数据预处理

1.数据介绍

案例采用波士顿房价数据，它包含了波士顿地区1970年房价的中位数与各种影响房价的因素。数据集共有506个数据点，每个数据点包含14个属性，其中13个是数值型特征，如城镇人均犯罪率、住宅用地比例等，以及一个目标变量即房价中位数（MEDV）。

2.导入模块和数据文件

1）导入案例所需的全部模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor,export_text
from sklearn.tree import plot_tree

2）导入数据

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])#自变量
target = raw_df.values[1::2, 2]#引入Boston房价数据
boston = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :3]])
names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
Boston=pd.DataFrame(boston,columns=names)#数据与变量名匹配

3.随机抽样

X_train,X_test,y_train,y_test=train_test_split(data,target,test_size=0.3,random_state=0)

二、回归树

1.DecisionTreeRegressor类介绍

def __init__(self,
             *,
             criterion: Any = "squared_error",#用于衡量特征选择质量的函数。可以选择“mse”(均方误差)或“mae”(平均绝对误差)。MSE通过使用叶子节点的均值来最小化L2损失，而MAE使用叶节点的中值来最小化L1损失。
             splitter: Any = "best",#用于在每个节点上选择分裂策略的策略。可以选择“best”(最佳)或“random”(随机)。Best选择最佳分割，而Random选择随机分割。
             max_depth: Any = None,#决策树的最大深度
             min_samples_split: Any = 2,#分裂内部节点所需的最小样本数
             min_samples_leaf: Any = 1,#叶节点所需的最小样本数
             min_weight_fraction_leaf: Any = 0.0,
             max_features: Any = None,#用于查找最佳分裂的特征数。
             random_state: Any = None,#随机种子的种子数
             max_leaf_nodes: Any = None,
             min_impurity_decrease: Any = 0.0,
             ccp_alpha: Any = 0.0,#成本复杂性参数
             monotonic_cst: Any = None) -> None

2.创建实例

model=DecisionTreeRegressor(max_depth=2, random_state=123)#max_depth=2表示限制回归树最大深度为2
model.fit(X_train,y_train)

print('拟合优度:',model.score(X_test,y_test))

拟合优度: 0.622596538377147

3.文本格式决策树

print(export_text(model,feature_names=list(names[:-1])))

|--- RM <= 6.80
|   |--- LSTAT <= 14.40
|   |   |--- value: [22.98]
|   |--- LSTAT >  14.40
|   |   |--- value: [14.81]
|--- RM >  6.80
|   |--- RM <= 7.43
|   |   |--- value: [30.92]
|   |--- RM >  7.43
|   |   |--- value: [44.71]

4.决策树图像

plot_tree(model,feature_names=list(names[:-1]),node_ids=True,rounded=True,precision=2)
plt.show()

在这里插入图片描述

5.成本复杂性参数与信息不纯度（total leaf MSE）关系图

model=DecisionTreeRegressor(random_state=123)
path=model.cost_complexity_pruning_path(X_train,y_train)#path属性path.ccp_alphas,path.impurities分别为成本复杂性参数序列与相应的叶节点总均方误差
plt.plot(path.ccp_alphas,path.impurities,marker='o',drawstyle='steps-post')#drawstyle='steps-post'表示阶梯方式折线图
plt.xlabel('alpha (cost-complexity parameter)')
plt.ylabel('Total Leaf MSE')
plt.title('Total Leaf MSE vs alpha for Training Set')
plt.show()

在这里插入图片描述

三、选择最优超参数

1.十折交叉验证

param_grid={'ccp_alpha':path.ccp_alphas}
kfold=KFold(n_splits=10,shuffle=True,random_state=1)
model=GridSearchCV(DecisionTreeRegressor(random_state=123),param_grid,cv=kfold)#网格搜索
model.fit(X_train,y_train)
print('最优超参数:',model.best_params_)

model=model.best_estimator_#将模型设定为最优决策树模型
print('最优超参数拟合优度:',model.score(X_test,y_test))

最优超参数: {'ccp_alpha': np.float64(0.03671186440677543)}
最优超参数拟合优度: 0.6705389109763318

2.最优决策树图像

plot_tree(model,feature_names=list(names[:-1]),node_ids=True,rounded=True,precision=2)
plt.show()

在这里插入图片描述
考察决策树深度与叶节点数目

print('决策树深度：',model.get_depth())
print('决策树叶节点数目',model.get_n_leaves())

决策树深度： 10
决策树叶节点数目 71

3.变量重要性

特征变量重要性为信息不纯度减少量的归一化值

print('变量重要性:\n',model.feature_importances_)
sorted_index=model.feature_importances_.argsort()#按重要性程度排序
print(sorted_index)

变量重要性:
 [0.07403082 0.002995   0.01108218 0.         0.00842927 0.60539031
 0.01294712 0.06840243 0.00158878 0.00650786 0.0253731  0.0081025
 0.17515063]
 
[ 3  8  1  9 11  4  2  6 10  7  0 12  5]#按重要性程度排序

变量重要性柱状图

X=pd.DataFrame(data,columns=names[:-1])
plt.barh(range(X.shape[1]),model.feature_importances_[sorted_index])#水平柱状图
plt.yticks(np.arange(X.shape[1]),X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree')
plt.tight_layout()
plt.show()

在这里插入图片描述

4. 预测值散点图

#预测值和实际值散点图
pred=model.predict(X_test)
plt.scatter(pred,y_test,alpha=0.6)#alpha=0.6控制散点透明度
w=np.linspace(min(pred),max(pred),100)
plt.plot(w,w)#45度线
plt.xlabel('pred')
plt.ylabel('y_test')
plt.title('Tree Prediction')
plt.show()