sklearn-第四节（决策树）

最新推荐文章于 2022-12-22 15:38:10 发布

~一段浮华

最新推荐文章于 2022-12-22 15:38:10 发布

阅读量276

点赞数 1

文章标签：决策树机器学习 sklearn

本文链接：https://blog.csdn.net/qq_42258383/article/details/122138048

版权

决策树

1.基本流程

决策树(decision tree) 是一类常见的机器学习方法.

有关决策树的基本知识，可见机器学习（第四章）4.决策树

以二分类任务为例，希望从给定训练数据集中学得一个模型用以对新示例进行分类，将样本分类的任务，可以看作是对于“当前样本是否为正类”这个问题的“决策”或“判定”过程。此决策过程如下图所示：

在这里插入图片描述

决策过程的最终结论对应了我们所希望的判定结果，例如"是"或"不是"好瓜;
决策过程中提出的每个判定问题都是对某个属性的"测试"，例如"色泽=?" "根蒂=?“
每个测试的结果或是导出最终结论，或是导出进一步的判定问题，其考虑范围是在上次决策结果的限定范围之内，

一般的，一棵决策树包含一个根结点、若干个内部结点和若干个叶结点;

2.代码逻辑

构建一个决策树分类模型，实现对鸢尾花的分类

鸢尾花（iris）数据集是一个经典数据集，在统计学习和机器学习领域都经常被用作示例。数据集内包含 3 类共 150 条记录，每类各 50 个数据，每条记录都有 4 项特征：花萼长度、花萼宽度、花瓣长度、花瓣宽度，可以通过这4个特征预测鸢尾花卉属于（iris-setosa, iris-versicolour, iris-virginica）中的哪一品种

2.1引入库

import seaborn as sns
from pandas import plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

2.2查看相关信息

# 加载数据集
data = load_iris() 
# 转换成.DataFrame形式
df = pd.DataFrame(data.data, columns = data.feature_names)
# 添加品种列
df['Species'] = data.target
# 查看数据集信息
print(f"数据集信息：\n{df.info()}")
# 查看前5条数据
print(f"前5条数据：\n{df.head()}")
# 查看各特征列的摘要信息
df.describe()

2.3通过Violinplot和 Pointplot，分别从数据分布和斜率，观察各特征与品种之间的关系

# 设置颜色主题
antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864'] 
# 绘制violinplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True) # 删除上方和右方坐标轴上不需要的边框，这在matplotlib中是无法通过参数实现的
sns.violinplot(x='Species', y=df.columns[0], data=df, palette=antV, ax=axes[0, 0])
sns.violinplot(x='Species', y=df.columns[1], data=df, palette=antV, ax=axes[0, 1])
sns.violinplot(x='Species', y=df.columns[2], data=df, palette=antV, ax=axes[1, 0])
sns.violinplot(x='Species', y=df.columns[3], data=df, palette=antV, ax=axes[1, 1])
plt.show()
# 绘制pointplot
f, axes = plt.subplots(2, 2, figsize=(8, 6), sharex=True)
sns.despine(left=True)
sns.pointplot(x='Species', y=df.columns[0], data=df, color=antV[1], ax=axes[0, 0])
sns.pointplot(x='Species', y=df.columns[1], data=df, color=antV[1], ax=axes[0, 1])
sns.pointplot(x='Species', y=df.columns[2], data=df, color=antV[1], ax=axes[1, 0])
sns.pointplot(x='Species', y=df.columns[3], data=df, color=antV[1], ax=axes[1, 1])
plt.show()
# g = sns.pairplot(data=df, palette=antV, hue= 'Species')
# 安德鲁曲线
plt.subplots(figsize = (8,6))
plotting.andrews_curves(df, 'Species', colormap='cool')

plt.show()

可以看出**’petal_length’, 'petal_width’特征重叠的值比较少**，这意味这两种特征区分能力更强。最后我们也可以看到，这两种基本上起了决定作用。

2.4训练决策树

# 加载数据集
data = load_iris() 
# 转换成.DataFrame形式
df = pd.DataFrame(data.data, columns = data.feature_names)
# 添加品种列
df['Species'] = data.target

# 用数值替代品种名作为标签
target = np.unique(data.target)
target_names = np.unique(data.target_names)
targets = dict(zip(target, target_names))
df['Species'] = df['Species'].replace(targets)

# 提取数据和标签
X = df.drop(columns="Species")
y = df["Species"]
feature_names = X.columns
labels = y.unique()

X_train, test_x, y_train, test_lab = train_test_split(X,y,
                                                 test_size = 0.4,
                                                 random_state = 42)
model = DecisionTreeClassifier(max_depth =3, random_state = 42)
model.fit(X_train, y_train) 
# 以文字形式输出树     
text_representation = tree.export_text(model)
print(text_representation)
# 用图片画出
plt.figure(figsize=(30,10), facecolor ='g') #
a = tree.plot_tree(model,
                   feature_names = feature_names,
                   class_names = labels,
                   rounded = True,
                   filled = True,
                   fontsize=14)
plt.show()

最后得到一棵决策树模型如下

决策树模型

控制台输出如下

在这里插入图片描述

实验结果表明：

先判断第 2 个特征的值，根据其值划分出两个分支；> 2.45 的分支会优先选择第 3 个特征，再根据特征值2的值进行分类。

3.总结

决策树优点：

1）简单直观，生成的决策树很直观。
2）基本不需要预处理，不需要提前归一化，处理缺失值。
3）使用决策树预测的代价是 $O\left(\log _{2} m\right)$ 。 m为样本数。
4）既可以处理离散值也可以处理连续值。很多算法只是专注于离散值或者连续值。
5）可以处理多维度输出的分类问题。
6）相比于神经网络之类的黑盒分类模型，决策树在逻辑上可以得到很好的解释
7）可以交叉验证的剪枝来选择模型，从而提高泛化能力。
8）对于异常点的容错能力好，健壮性高。

决策树算法的缺点:

1）决策树算法非常容易过拟合，导致泛化能力不强。可以通过设置节点最少样本数量和限制决策树深度来改进。
2）决策树会因为样本发生一点点的改动，就会导致树结构的剧烈改变。这个可以通过集成学习之类的方法解决。
3）寻找最优的决策树是一个NP难的问题，我们一般是通过启发式方法，容易陷入局部最优。可以通过集成学习之类的方法来改善。
4）有些比较复杂的关系，决策树很难学习，比如异或。这个就没有办法了，一般这种关系可以换神经网络分类方法来解决。
5）如果某些特征的样本比例过大，生成决策树容易偏向于这些特征。这个可以通过调节样本权重来改善。