下面用个实际例子测试一下决策树算法。
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import pandas as pd
import graphviz
# 1. data
wine = load_wine()
wine_df = pd.concat([pd.DataFrame(wine.data), pd.DataFrame(wine.target)], axis=1)
feature_name = ['酒精', '苹果酸', '灰', '灰的碱性', '镁',
'总酚', '类黄酮', '非黄烷类酚类', '花青素', '颜色强度',
'色调', 'od280/od315稀释葡萄酒', '脯氨酸']
wine_df.columns = feature_name + ["label"]
print(wine_df[:10])
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data, wine.target,
test_size=0.3, random_state=2020)
# 2. train model
clf = tree.DecisionTreeClassifier(criterion="gini") # 特征选择标准,criterion=(entropy, gini)
clf = clf.fit(Xtrain, Ytrain)
# 3. predict
score = clf.score(Xtest, Ytest)
print(score)
# 0.9260
数据集是sklearn模块中自带的数据,
1. 共有13个特征,分别为:'酒精', '苹果酸', '灰', '灰的碱性', '镁', '总酚', '类黄酮', '非黄烷类酚类', '花青素', '颜色强度', '色调', 'od280/od315稀释葡萄酒', '脯氨酸';
2. 共有3个类别,分别为:"琴酒", "雪莉", "贝尔摩德";
数据示例如下:
模型训练时,特征选择的标准为基尼系数。在验证集数据上的准确度为:92.6%。
# 4. plot figure
dot_data = tree.export_graphviz(clf
,feature_names= feature_name
,class_names=["琴酒", "雪莉", "贝尔摩德"]
,filled=True #让树的每一块有颜色,颜色越浅,表示不纯度越高
,rounded=True #树的块的形状
)
graph = graphviz.Source(dot_data)
graph.render("Tree")
graph
决策树过程如下图所示,从图中可以看出“颜色强度”特征的重要度很强。