如何利用决策树进行分类和回归预测

文章通过葡萄酒质量数据集构建决策树模型,首先对数据进行预处理,将颜色特征编码为整数。然后,划分训练集和测试集,评估未经调整的决策树在训练和测试数据上的性能。接着,使用带交叉验证的网格搜索优化决策树参数,提高测试集的预测精度。最后,将残糖作为目标变量,应用决策树回归分析,评估模型在训练和测试数据上的均方误差,并绘制预测结果的散点图。
摘要由CSDN通过智能技术生成

这次练习中,我们将使用葡萄酒质量数据集。该数据集包含葡萄酒的各种化学性质,如酸度、糖分、PH值和酒精含量等,还包括两列分别表示葡萄酒的质量(3-9,越高越好)和酒的颜色(红或者白)。数据保存在Wine_Quality_Data.csv文件中。

第一步:

  • 导入数据并检查特征的类型
  • 使用所有特征预测 color (white 或者 red),但是颜色特征需要编码成整数
# 读入数据
import pandas as pd

data = pd.read_csv("Wine_Quality_Data.csv")

data
fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
07.40.700.001.90.07611.034.00.997803.510.569.45red
17.80.880.002.60.09825.067.00.996803.200.689.85red
27.80.760.042.30.09215.054.00.997003.260.659.85red
311.20.280.561.90.07517.060.00.998003.160.589.86red
47.40.700.001.90.07611.034.00.997803.510.569.45red
..........................................
64926.20.210.291.60.03924.092.00.991143.270.5011.26white
64936.60.320.368.00.04757.0168.00.994903.150.469.65white
64946.50.240.191.20.04130.0111.00.992542.990.469.46white
64955.50.290.301.10.02220.0110.00.988693.340.3812.87white
64966.00.210.380.80.02022.098.00.989413.260.3211.86white

6497 rows × 13 columns

# 检查特征的数据类型
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB
# 把color特征值转换成整数(如'white'是0, 'red'是1)
data["color"] = data.color.map(lambda x: 1 if x=='red' else 0)

data
fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
07.40.700.001.90.07611.034.00.997803.510.569.451
17.80.880.002.60.09825.067.00.996803.200.689.851
27.80.760.042.30.09215.054.00.997003.260.659.851
311.20.280.561.90.07517.060.00.998003.160.589.861
47.40.700.001.90.07611.034.00.997803.510.569.451
..........................................
64926.20.210.291.60.03924.092.00.991143.270.5011.260
64936.60.320.368.00.04757.0168.00.994903.150.469.650
64946.50.240.191.20.04130.0111.00.992542.990.469.460
64955.50.290.301.10.02220.0110.00.988693.340.3812.870
64966.00.210.380.80.02022.098.00.989413.260.3211.860

6497 rows × 13 columns

第二步:

  • 生成X和y(使用除’color’外的全部特征列做X,‘color’列做y)
  • 划分训练集和测试集,使得测试集中包含1000条数据
  • 分别检查测试集和训练集中不同类别的样例的个数
# 生成X和y
y = data.color

X = data[data.columns[:-1]]
# 划分训练集和测试集,使得测试集中包含1000条数据
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1000, random_state = 101)

X_test
fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholquality
8686.80.560.221.80.07415.024.00.994383.400.8211.26
50976.40.300.362.00.05218.0141.00.992733.380.5310.56
54505.90.170.293.10.03032.0123.00.989133.410.3313.77
58507.00.240.241.80.04729.091.00.992513.300.439.96
21926.40.450.071.10.03010.0131.00.990502.970.2810.85
.......................................
154811.20.400.502.00.09919.050.00.997833.100.5810.45
25008.00.290.2913.20.04626.0113.00.998303.250.379.76
43208.20.370.6413.90.04322.0171.00.998732.990.809.35
34307.40.490.2415.10.03034.0153.00.995303.130.5112.07
30997.40.190.499.30.03026.0132.00.994002.990.3211.07

1000 rows × 12 columns

# 分别检查测试集和训练集中不同类别的样例的个数
print(y_train.value_counts())

print(y_test.value_counts())
0    4147
1    1350
Name: color, dtype: int64
0    751
1    249
Name: color, dtype: int64

第三步:

  • 在训练集上训练一个没有对树的最大深度、特征及叶子节点等方面有任何限制的决策树分类器
  • 绘制并显示决策树
  • 评价这棵树分别在训练数据和测试数据上的预测效果(精度、查全率、查准率、F1),并思考其现象
# 在训练集上训练一个没有对树的最大深度、特征及叶子节点等方面有任何限制的决策树分类器
from sklearn.tree import DecisionTreeClassifier

treeclf = DecisionTreeClassifier()

treeclf.fit(X_train, y_train)
DecisionTreeClassifier()
import graphviz
from sklearn import tree

dot_data = tree.export_graphviz(treeclf, out_file=None, 
                         feature_names=X_train.columns.values.tolist(),
                         class_names="color",  
                         filled=True, rounded=True,  
                         special_characters=True)
graph = graphviz.Source(dot_data)

graph

在这里插入图片描述

# 评价这棵树分别在训练数据和测试数据上的预测效果
from sklearn import metrics

y_train_pred = treeclf.predict(X_train)
y_test_pred = treeclf.predict(X_test)

print("训练数据")
print("accruacy:", metrics.accuracy_score(y_train, y_train_pred))
print("precision:", metrics.precision_score(y_train, y_train_pred, average = "micro"))
print("recall:", metrics.recall_score(y_train, y_train_pred, average = "micro"))
print("fscore:", metrics.f1_score(y_train, y_train_pred, average = "micro"))

print("测试数据")
print("accruacy:", metrics.accuracy_score(y_test, y_test_pred))
print("precision:", metrics.precision_score(y_test, y_test_pred, average = "micro"))
print("recall:", metrics.recall_score(y_test, y_test_pred, average = "micro"))
print("fscore:", metrics.f1_score(y_test, y_test_pred, average = "micro"))
训练数据
accruacy: 0.9996361651810078
precision: 0.9996361651810078
recall: 0.9996361651810078
fscore: 0.9996361651810078
测试数据
accruacy: 0.987
precision: 0.987
recall: 0.987
fscore: 0.987

训练数据被用作训练决策树,因而预测结果与训练集中的y值基本一致

测试数据为被用作训练决策树,预测结果与测试集中的y值存在误差

第四步:

  • 使用带交叉验证的网格搜索(调节’max_depth’和‘max_features’两个参数,并使用’accuracy’为评价指标),得到一个更好的决策树分类器。
  • 绘制并显示新的决策树
  • 评价这棵树分别在训练数据和测试数据上的预测效果(精度、查全率、查准率、F1),并和第三步的结果比较
# 使用带交叉验证的网格搜索,得到一个更好的决策树分类器。
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': range(1,21), 'max_features': range(1,13)}

grid = GridSearchCV(treeclf, param_grid, cv = 14, scoring = "accuracy")

grid.fit(X_train,y_train)

grid.best_params_
{'max_depth': 8, 'max_features': 10}
# 绘制并显示新的决策树
treeclf_new = DecisionTreeClassifier(max_depth = 8, max_features = 10)

treeclf_new.fit(X_train, y_train)

dot_data = tree.export_graphviz(treeclf_new, out_file=None, 
                         feature_names=X_train.columns.values.tolist(),
                         class_names="color",  
                         filled=True, rounded=True,  
                         special_characters=True)
graph = graphviz.Source(dot_data)

graph

在这里插入图片描述

# 评价这棵树分别在训练数据和测试数据上的预测效果
y1_train_pred = treeclf_new.predict(X_train)
y1_test_pred = treeclf_new.predict(X_test)

print("训练数据")
print("accruacy:", metrics.accuracy_score(y_train, y1_train_pred))
print("precision:", metrics.precision_score(y_train, y1_train_pred, average = "micro"))
print("recall:", metrics.recall_score(y_train, y1_train_pred, average = "micro"))
print("fscore:", metrics.f1_score(y_train, y1_train_pred, average = "micro"))

print("测试数据")
print("accruacy:", metrics.accuracy_score(y_test, y1_test_pred))
print("precision:", metrics.precision_score(y_test, y1_test_pred, average = "micro"))
print("recall:", metrics.recall_score(y_test, y1_test_pred, average = "micro"))
print("fscore:", metrics.f1_score(y_test, y1_test_pred, average = "micro"))
训练数据
accruacy: 0.9963616518100782
precision: 0.9963616518100782
recall: 0.9963616518100782
fscore: 0.9963616518100782
测试数据
accruacy: 0.991
precision: 0.991
recall: 0.991
fscore: 0.991

训练数据的精度、查全率、查准率、F1与之前相比不变,即训练集的预测效果不变

测试数据的精度、查全率、查准率、F1与之前相比提高,即训练集的预测效果更好

第五步:

  • 重新生成X和y,使用residual_sugar列做y,用其他列做X
  • 划分训练集和测试集
  • 使用带交叉验证的网格搜索,找到一个最优的决策回归树模型
  • 评价其在训练和测试数据上的预测效果(均方误差)
  • 将测试数据上的真实residual_sugar值和预测的值绘制成一个散点图
# 重新生成X和y,使用residual_sugar列做y,用其他列做X
data["color"] = data.color.map(lambda x: 1 if x=='red' else 0)

y = data.residual_sugar

data1 = data.drop(columns = ["residual_sugar"])

X = data1
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
# 使用带交叉验证的网格搜索,找到一个最优的决策回归树模型
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

treereg = DecisionTreeRegressor(random_state=1)

treereg.fit(X_train, y_train)

max_depth_range = range(1, 29)

RMSE_scores = []

for depth in max_depth_range:

    treereg = DecisionTreeRegressor(max_depth = depth, random_state=1)
    
    MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
    RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
    
list(zip(max_depth_range, RMSE_scores))
[(1, 4.7412390726331335),
 (2, 3.4544154395481383),
 (3, 3.1639654273825113),
 (4, 2.9418918287093807),
 (5, 2.7628958295364643),
 (6, 2.4710769387347433),
 (7, 2.4884012677347624),
 (8, 2.3269275890462864),
 (9, 2.5139415494435156),
 (10, 2.275007010763719),
 (11, 2.2776945241660984),
 (12, 2.4592336881204067),
 (13, 2.369192162832904),
 (14, 2.4710654294832137),
 (15, 2.593863770797681),
 (16, 2.5910821121559917),
 (17, 2.591568468456111),
 (18, 2.5952083144938025),
 (19, 2.4394399583590807),
 (20, 2.4633668453055244),
 (21, 2.544115322564612),
 (22, 2.4799181648559623),
 (23, 2.5974270853875256),
 (24, 2.457877260514561),
 (25, 2.4852951159726215),
 (26, 2.3251638290482655),
 (27, 2.338758743194139),
 (28, 2.434645775126495)]
# 评价其在训练和测试数据上的预测效果(均方误差)
import numpy as np

treereg_new = DecisionTreeRegressor(max_depth = 10, random_state=1)

treereg_new.fit(X_train, y_train)

y_train_pred = treereg_new.predict(X_train)
y_test_pred = treereg_new.predict(X_test)

print("训练集RMSE:", np.sqrt(metrics.mean_squared_error(y_train,y_train_pred)))
print("测试集RMSE:", np.sqrt(metrics.mean_squared_error(y_test,y_test_pred)))
训练集RMSE: 1.0457068726114853
测试集RMSE: 1.8401674997587358
# 将测试数据上的真实residual_sugar值和预测的值绘制成一个散点图
import matplotlib.pyplot as plt

plt.scatter(y_test, y_test_pred)
<matplotlib.collections.PathCollection at 0x16fd2b025b0>

在这里插入图片描述

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值