决策树

最新推荐文章于 2022-09-01 16:01:23 发布

置顶不再彷惶

最新推荐文章于 2022-09-01 16:01:23 发布

阅读量292

点赞数

分类专栏： sklearn

本文链接：https://blog.csdn.net/qq_43604520/article/details/107582535

版权

sklearn 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

决策树优点：

建立的模型容易可视化，容易理解
完全不受数据缩放的影响，即不需要数据预处理

分类树

构建分类树模型

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

数据观察

data = load_wine()
print(data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

print(data.data)
print(data.target)
print(data.target_names)
print(data.feature_names)

[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['class_0' 'class_1' 'class_2']
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

构建分类树

tree = DecisionTreeClassifier(random_state=0)
x1,x2,y1,y2 = train_test_split(data.data,data.target,random_state=0)
tree.fit(x1,y1)
print('训练集分数：',tree.score(x1,y1))
print('测试集分数：',tree.score(x2,y2))

训练集分数： 1.0			 很明显过拟合了，文末添加一个完整的代码
测试集分数： 0.9333333333333333

查看所有数据

pd.concat([pd.DataFrame(data.data),pd.DataFrame(data.target)],axis=1)

	0	1	2	3	4	5	6	7	8	9	10	11	12	0
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0	0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0	0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0	0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0	0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	13.71	5.65	2.45	20.5	95.0	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740.0	2
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0	2
175	13.27	4.28	2.26	20.0	120.0	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835.0	2
176	13.17	2.59	2.37	20.0	120.0	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840.0	2
177	14.13	4.10	2.74	24.5	96.0	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560.0	2

178 rows × 14 columns

重要参数

不纯度:criterion
基尼系数：gini(默认)，一般情况下都用这个，速度快
信息熵：entropy，数据欠拟合时用这个，速度慢

画树

import graphviz
import sklearn.tree as lcy
dot_data = lcy.export_graphviz(tree
                   ,out_file=None
                   ,feature_names=data.feature_names
                   ,class_names=data.target_names
                   ,filled=True			filled设置是否在节点上显示颜色
                   ,rounded=True)		rounded是使每个节点变成圆角
graph = graphviz.Source(dot_data)
graph

在这里插入图片描述

输出每个特征的重要性，总和为1

.feature_importances_

a  = zip(data.feature_names,tree.feature_importances_)
for i in a:
    print(i)

('alcohol', 0.0)
('malic_acid', 0.01888131743327655)
('ash', 0.0221650248129768)
('alcalinity_of_ash', 0.0)
('magnesium', 0.0)
('total_phenols', 0.0)
('flavanoids', 0.4324191914130787)
('nonflavanoid_phenols', 0.0)
('proanthocyanins', 0.0)
('color_intensity', 0.4031560026054714)
('hue', 0.0)
('od280/od315_of_diluted_wines', 0.0)
('proline', 0.1233784637351966)

预剪枝

一棵树在无任何剪枝操作下很容易无线生长，然后训练集分数为1，造成过拟合

max_depth:设置最大深度，建议从3开始
min_samples_leaf = N:每个节点分支后的子节点都必须包括至少N个样本，不然的话不分这个支。一般搭配max_depth使用，建议从5开始，也可以输入浮点数作为样本量的百分比来使用
min_samples_split = N:每个节点必须包括N个节点才允许分支

重要属性

.feature_importances_ 查看各个特征对模型的重要性

回归树

重要的特征是不能外推，，即不能在训练数据范围外进行预测

import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# numpy产生随机数
rng = np.random.RandomState(1)
X = np.sort(5*rng.rand(80,1),axis=0)        #rand生成0~1之间的数
y = np.sin(X).ravel()
y[::5] += 3*(0.5 - rng.rand(16))

# 开始模型预测
tree3 = DecisionTreeRegressor(max_depth=3)
tree5 = DecisionTreeRegressor(max_depth=5)
# x1 , x2 , y1 , y2 = train_test_split(X , y , random_state=5)
tree3.fit(X , y)
tree5.fit(X ,y)

a = np.linspace(0,5,500).reshape(-1,1)
b3 = tree3.predict(a)
b5 = tree5.predict(a)

# 画出图形
# plt.plot(X , y , 'g.')
plt.scatter(X , y , c='g' , s=30 , label = "moudle" , alpha=.8)
plt.plot(a , b3 , 'r-' , label="max_depth=3" , linewidth=4)
plt.plot(a , b5 , 'b-' , label="max_depth=5" , linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("tree")
plt.legend()
plt.show()

在这里插入图片描述

tree2 = DecisionTreeClassifier(criterion='gini'
                              ,max_depth=3
                              ,min_samples_leaf=6
                              ,random_state=0)
x11,x21,y11,y21 = train_test_split(data.data,data.target,random_state=0)

tree2.fit(x1,y1)
print('训练集参数：',tree.score(x11,y11))
print('测试集参数：',tree.score(x21,y21))

dot_data = lcy.export_graphviz(tree2
                   ,out_file=None
                   ,feature_names=data.feature_names
                   ,class_names=data.target_names
                   ,filled=True
                   ,rounded=True)
graph = graphviz.Source(dot_data)
graph