决策树-红酒分析

最新推荐文章于 2024-03-06 23:26:54 发布

爱遛弯的布谷

最新推荐文章于 2024-03-06 23:26:54 发布

阅读量3.7k

点赞数 4

分类专栏： python 数据分析与挖掘文章标签：决策树 python 数据分析

本文链接：https://blog.csdn.net/weixin_45609831/article/details/109871703

版权

python 数据分析与挖掘专栏收录该内容

9 篇文章 0 订阅

订阅专栏

决策树

是一种非参数的有监督学习方法

决策树算法的核心是要解决两个问题：

1）如何从数据表中找出最佳节点和最佳分枝？
2）如何让决策树停止生长，防止过拟合？

不纯度越低，决策树对训练集的拟合越好

Criterion这个参数正是用来决定不纯度的计算方法的。sklearn提供了两种选择：

输入”entropy“，使用信息熵（Entropy）
输入”gini“，使用基尼系数（Gini Impurity）

from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.datasets import load_wine #导入红酒的数据
from sklearn.model_selection import train_test_split

wine=load_wine()#导入数据
wine

我的运行结果：加黑部分是我们所需要的数据

{‘data’: array([[1.423e+01, 1.710e+00, 2.430e+00, …, 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, …, 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, …, 1.030e+00, 3.170e+00,
1.185e+03],
…,
[1.327e+01, 4.280e+00, 2.260e+00, …, 5.900e-01, 1.560e+00,
8.350e+02],
[1.317e+01, 2.590e+00, 2.370e+00, …, 6.000e-01, 1.620e+00,
8.400e+02],
[1.413e+01, 4.100e+00, 2.740e+00, …, 6.100e-01, 1.600e+00,
5.600e+02]]),
‘target’: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2]),
‘target_names’: array([‘class_0’, ‘class_1’, ‘class_2’], dtype=’<U7’),
‘DESCR’: ‘… _wine_dataset:\n\nWine recognition dataset\n------------------------\n\nData Set Characteristics:\n\n :Number of Instances: 178 (50 in each of three classes)\n :Number of Attributes: 13 numeric, predictive attributes and the class\n :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n - class:\n - class_0\n - class_1\n - class_2\n\t\t\n :Summary Statistics:\n \n ============================= ==== ===== ======= =====\n Min Max Mean SD\n ============================= ==== ===== ======= =====\n Alcohol: 11.0 14.8 13.0 0.8\n Malic Acid: 0.74 5.80 2.34 1.12\n Ash: 1.36 3.23 2.36 0.27\n Alcalinity of Ash: 10.6 30.0 19.5 3.3\n Magnesium: 70.0 162.0 99.7 14.3\n Total Phenols: 0.98 3.88 2.29 0.63\n Flavanoids: 0.34 5.08 2.03 1.00\n Nonflavanoid Phenols: 0.13 0.66 0.36 0.12\n Proanthocyanins: 0.41 3.58 1.59 0.57\n Colour Intensity: 1.3 13.0 5.1 2.3\n Hue: 0.48 1.71 0.96 0.23\n OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71\n Proline: 278 1680 746 315\n ============================= ==== ===== ======= =====\n\n :Missing Attribute Values: None\n :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n… topic:: References\n\n (1) S. Aeberhard, D. Coomans and O. de Vel, \n Comparison of Classifiers in High Dimensional Settings, \n Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of \n Mathematics and Statistics, James Cook University of North Queensland. \n (Also submitted to Technometrics). \n\n The data was used with many others for comparing various \n classifiers. The classes are separable, though only RDA \n has achieved 100% correct classification. \n (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n (All results using the leave-one-out technique) \n\n (2) S. Aeberhard, D. Coomans and O. de Vel, \n “THE CLASSIFICATION PERFORMANCE OF RDA” \n Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n Mathematics and Statistics, James Cook University of North Queensland. \n (Also submitted to Journal of Chemometrics).\n’,
’feature_names’: [‘alcohol’,
‘malic_acid’,
‘ash’,
‘alcalinity_of_ash’,
‘magnesium’,
‘total_phenols’,
‘flavanoids’,
‘nonflavanoid_phenols’,
‘proanthocyanins’,
‘color_intensity’,
‘hue’,
‘od280/od315_of_diluted_wines’,
‘proline’]}

x_train,x_test,y_train,y_test=train_test_split(wine.data,wine.target,test_size=0.3)

wine_model=DTC(criterion='entropy').fit(x_train,y_train)  #实例化，用训练集数据训练模型
score=wine_model.score(x_test,y_test)  #对模型衡量
score  #0.8888888888888888   我的结果
chn_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']

import graphviz
from sklearn import tree

dot_data=tree.export_graphviz(wine_model
                             ,feature_names=chn_name
                             ,class_names=["香槟","冰酒","雪莉酒"]
                             ,filled=True #树的块填充颜色
                             ,rounded=True  #块的框是方圆形
                             )
graph =graphviz.Source(dot_data)
graph  #画出树 ，samples 样本数 ，value根据samples 的占比分配，

在这里插入图片描述

[*zip(chn_name,wine_model.feature_importances_)] #查看特征向量的重要性

random_state用来设置分枝中的随机模式的参数，默认None，在高维度时随机性会表现更明显，低维度的数据（比如鸢尾花数据集），随机性几乎不会显现。

splitter用来控制决策树中的随即选项，有两种输入值：

输入”best"，决策树在分枝时虽然随机，但是还是会优先选择更重要的特征进行分枝（重要性可以通过属性feature_importances_查看）
输入“random"，决策树在分枝时会更加随机，树会因为含有更多的不必要信息而更深更大，并因这些不必要信息而降低对训练集的拟合。

当你预测到你的模型会过拟合，用这两个参数来帮助你降低树建成之后过拟合的可能性。当然，树一旦建成，我们依然是使用剪枝参数来防止过拟合。

#调整参数
wine_model=DTC(criterion="entropy"
                ,random_state=30
                ,splitter="random"
                ).fit(x_train,y_train)
score=wine_model.score(x_test,y_test)  #对模型衡量
score  #0.9444444444444444

dot_data=tree.export_graphviz(wine_model
                             ,feature_names=chn_name
                             ,class_names=["香槟","冰酒","雪莉酒"]
                             ,filled=True #树的块填充颜色
                             ,rounded=True  #块的框是方圆形
                             )
graph =graphviz.Source(dot_data)
graph

剪枝参数

在不加限制的情况下，一棵决策树会生长到衡量不纯度的指标最优，或者没有更多的特征可用为止。这样的决策树往往会过拟合，这就是说，它会在训练集上表现很好，在测试集上却表现糟糕。

我们收集的样本数据不可能和整体的状况完全一致，因此当一棵决策树对训练数据有了过于优秀的解释性，它找出的规则必然包含了训练样本中的噪声，并使它对未知数据的拟合程度不足。

剪枝策略

max_depth一般从3开始尝试

min_ samples_ leaf 一般和max_depth搭配使用,一般从5开始尝试，样本数量小于min_ samples_ leaf的分支将会被剪掉

wine_model=DTC(criterion="entropy"
                ,random_state=30
                ,splitter="random"
                ,max_depth=3
                ,min_samples_leaf=3 
                ,min_samples_split=10
                ).fit(x_train,y_train)
score=wine_model.score(x_test,y_test)  #对模型衡量
score  #0.9629629629629629

dot_data=tree.export_graphviz(wine_model
                             ,feature_names=chn_name
                             ,class_names=["香槟","冰酒","雪莉酒"]
                             ,filled=True #树的块填充颜色
                             ,rounded=True  #块的框是方圆形
                             )
graph =graphviz.Source(dot_data)
graph

import matplotlib.pyplot as plt
test = []
for i in range(10):
    wine_model=DTC(max_depth=i+1,criterion='entropy',random_state=30)
    wine_model=wine_model.fit(x_train,y_train)
    score =wine_model.score(x_test, y_test)
    test.append(score)
plt.plot(range(1, 11),test ,color='red',label='max_depth')
plt.legend()
plt.show()

在这里插入图片描述

爱遛弯的布谷

关注

4
点赞
踩
37

收藏

觉得还不错? 一键收藏
0
评论
决策树-红酒分析

决策树是一种非参数的有监督学习方法决策树算法的核心是要解决两个问题：1）如何从数据表中找出最佳节点和最佳分枝？2）如何让决策树停止生长，防止过拟合？不纯度越低，决策树对训练集的拟合越好Criterion这个参数正是用来决定不纯度的计算方法的。sklearn提供了两种选择：输入”entropy“，使用信息熵（Entropy）输入”gini“，使用基尼系数（Gini Impurity）from sklearn.tree import DecisionTreeClassifier as DT
复制链接

扫一扫