决策树优点:
- 建立的模型容易可视化,容易理解
- 完全不受数据缩放的影响,即不需要数据预处理
分类树
构建分类树模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
数据观察
data = load_wine()
print(data.keys())
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
print(data.data)
print(data.target)
print(data.target_names)
print(data.feature_names)
[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
[1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
[1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
...
[1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
[1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
[1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['class_0' 'class_1' 'class_2']
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
构建分类树
tree = DecisionTreeClassifier(random_state=0)
x1,x2,y1,y2 = train_test_split(data.data,data.target,random_state=0)
tree.fit(x1,y1)
print('训练集分数:',tree.score(x1,y1))
print('测试集分数:',tree.score(x2,y2))
训练集分数: 1.0 很明显过拟合了,文末添加一个完整的代码
测试集分数: 0.9333333333333333
查看所有数据
pd.concat([pd.DataFrame(data.data),pd.DataFrame(data.target)],axis=1)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 | 2 |
174 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 | 2 |
175 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 | 2 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 | 2 |
177 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 | 2 |
178 rows × 14 columns
重要参数
不纯度:criterion
基尼系数:gini(默认),一般情况下都用这个,速度快
信息熵:entropy,数据欠拟合时用这个,速度慢
画树
import graphviz
import sklearn.tree as lcy
dot_data = lcy.export_graphviz(tree
,out_file=None
,feature_names=data.feature_names
,class_names=data.target_names
,filled=True filled设置是否在节点上显示颜色
,rounded=True) rounded是使每个节点变成圆角
graph = graphviz.Source(dot_data)
graph
输出每个特征的重要性,总和为1
.feature_importances_
a = zip(data.feature_names,tree.feature_importances_)
for i in a:
print(i)
('alcohol', 0.0)
('malic_acid', 0.01888131743327655)
('ash', 0.0221650248129768)
('alcalinity_of_ash', 0.0)
('magnesium', 0.0)
('total_phenols', 0.0)
('flavanoids', 0.4324191914130787)
('nonflavanoid_phenols', 0.0)
('proanthocyanins', 0.0)
('color_intensity', 0.4031560026054714)
('hue', 0.0)
('od280/od315_of_diluted_wines', 0.0)
('proline', 0.1233784637351966)
预剪枝
一棵树在无任何剪枝操作下很容易无线生长,然后训练集分数为1,造成过拟合
-
max_depth:设置最大深度,建议从3开始
-
min_samples_leaf = N:每个节点分支后的子节点都必须包括至少N个样本,不然的话不分这个支。一般搭配max_depth使用,建议从5开始,也可以输入浮点数作为样本量的百分比来使用
-
min_samples_split = N:每个节点必须包括N个节点才允许分支
重要属性
- .feature_importances_ 查看各个特征对模型的重要性
回归树
重要的特征是不能外推,,即不能在训练数据范围外进行预测
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
# numpy产生随机数
rng = np.random.RandomState(1)
X = np.sort(5*rng.rand(80,1),axis=0) #rand生成0~1之间的数
y = np.sin(X).ravel()
y[::5] += 3*(0.5 - rng.rand(16))
# 开始模型预测
tree3 = DecisionTreeRegressor(max_depth=3)
tree5 = DecisionTreeRegressor(max_depth=5)
# x1 , x2 , y1 , y2 = train_test_split(X , y , random_state=5)
tree3.fit(X , y)
tree5.fit(X ,y)
a = np.linspace(0,5,500).reshape(-1,1)
b3 = tree3.predict(a)
b5 = tree5.predict(a)
# 画出图形
# plt.plot(X , y , 'g.')
plt.scatter(X , y , c='g' , s=30 , label = "moudle" , alpha=.8)
plt.plot(a , b3 , 'r-' , label="max_depth=3" , linewidth=4)
plt.plot(a , b5 , 'b-' , label="max_depth=5" , linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("tree")
plt.legend()
plt.show()
tree2 = DecisionTreeClassifier(criterion='gini'
,max_depth=3
,min_samples_leaf=6
,random_state=0)
x11,x21,y11,y21 = train_test_split(data.data,data.target,random_state=0)
tree2.fit(x1,y1)
print('训练集参数:',tree.score(x11,y11))
print('测试集参数:',tree.score(x21,y21))
dot_data = lcy.export_graphviz(tree2
,out_file=None
,feature_names=data.feature_names
,class_names=data.target_names
,filled=True
,rounded=True)
graph = graphviz.Source(dot_data)
graph
训练集参数: 1.0 虽然说只有三层,但是还是能在训练集上百分之百拟合
测试集参数: 0.9333333333333333