决策树分类算法的案例（代码实现及运行测试）

最新推荐文章于 2024-04-25 18:29:56 发布

qq_38220914

最新推荐文章于 2024-04-25 18:29:56 发布

阅读量1.5k

点赞数

文章标签：决策树分类机器学习

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_38220914/article/details/127567208

版权

1 案例需求

我们的任务就是训练一个决策树分类器，输入身高和体重，分类器能给出这个人是胖子还是瘦子。

所用的训练数据如下，这个数据一共有10个样本，每个样本有2个属性，分别为身高和体重，第三列为类别标签，表示“胖”或“瘦”。该数据保存在1.txt中。

1.5 50 thin

1.5 60 fat

1.6 40 thin

1.6 60 fat

1.7 60 thin

1.7 80 fat

1.8 60 thin

1.8 90 fat

1.9 70 thin

1.9 80 fat

2 模型分析

决策树对于“是非”的二值逻辑的分枝相当自然。而在本数据集中，身高与体重是连续值怎么办呢？

虽然麻烦一点，不过这也不是问题，只需要找到将这些连续值划分为不同区间的中间点，就转换成了二值逻辑问题。

本例决策树的任务是找到身高、体重中的一些临界值，按照大于或者小于这些临界值的逻辑将其样本两两分类，自顶向下构建决策树。

3 python实现

使用python的机器学习库，实现起来相当简单和优雅

# -*- coding: utf-8 -*-

import numpy as np

import scipy as sp

from sklearn import tree

from sklearn.metrics import precision_recall_curve

from sklearn.metrics import classification_report

from sklearn.cross_validation import train_test_split

''' 数据读入 '''

data = []

labels = []

with open("d:\\python\\ml\\data\\1.txt") as ifile:

for line in ifile:

tokens = line.strip().split(' ')

data.append([float(tk) for tk in tokens[:-1]])

labels.append(tokens[-1])

x = np.array(data)

labels = np.array(labels)

y = np.zeros(labels.shape)

''' 标签转换为0/1 '''

y[labels=='fat']=1

''' 拆分训练数据与测试数据 '''

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

''' 使用信息熵作为划分标准，对决策树进行训练 '''

clf = tree.DecisionTreeClassifier(criterion='entropy')

print(clf)

clf.fit(x_train, y_train)

''' 把决策树结构写入文件 '''

with open("tree.dot", 'w') as f:

f = tree.export_graphviz(clf, out_file=f)

''' 系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大 '''

print(clf.feature_importances_)

'''测试结果的打印'''

answer = clf.predict(x_train)

print(x_train)

print(answer)

print(y_train)

print(np.mean( answer == y_train))

'''准确率与召回率'''

precision, recall, thresholds = precision_recall_curve(y_train, clf.predict(x_train))

answer = clf.predict_proba(x)[:,1]

print(classification_report(y, answer, target_names = ['thin', 'fat']))

这时候会输出

[ 0.2488562 0.7511438]

array([[ 1.6, 60. ],

[ 1.7, 60. ],

[ 1.9, 80. ],

[ 1.5, 50. ],

[ 1.6, 40. ],

[ 1.7, 80. ],

[ 1.8, 90. ],

[ 1.5, 60. ]])

array([ 1., 0., 1., 0., 0., 1., 1., 1.])

array([ 1., 0., 1., 0., 0., 1., 1., 1.])

1.0

precision recall f1-score support

thin 0.83 1.00 0.91 5

fat 1.00 0.80 0.89 5

avg / total 1.00 1.00 1.00 8

array([ 0., 1., 0., 1., 0., 1., 0., 1., 0., 0.])

array([ 0., 1., 0., 1., 0., 1., 0., 1., 0., 1.])

可以看到，对训练过的数据做测试，准确率是100%。但是最后将所有数据进行测试，会出现1个测试样本分类错误。

说明本例的决策树对训练集的规则吸收的很好，但是预测性稍微差点。

4 决策树的保存

一棵决策树的学习训练是非常耗费运算时间的，因此，决策树训练出来后，可进行保存，以便在预测新数据时只需要直接加载训练好的决策树即可

本案例的代码中已经决策树的结构写入了tree.dot中。打开该文件，很容易画出决策树，还可以看到决策树的更多分类信息。

本例的tree.dot如下所示：

digraph Tree {

0 [label="X[1] <= 55.0000\nentropy = 0.954434002925\nsamples = 8", shape="box"] ;

1 [label="entropy = 0.0000\nsamples = 2\nvalue = [ 2. 0.]", shape="box"] ;

0 -> 1 ;

2 [label="X[1] <= 70.0000\nentropy = 0.650022421648\nsamples = 6", shape="box"] ;

0 -> 2 ;

3 [label="X[0] <= 1.6500\nentropy = 0.918295834054\nsamples = 3", shape="box"] ;

2 -> 3 ;

4 [label="entropy = 0.0000\nsamples = 2\nvalue = [ 0. 2.]", shape="box"] ;

3 -> 4 ;

5 [label="entropy = 0.0000\nsamples = 1\nvalue = [ 1. 0.]", shape="box"] ;

3 -> 5 ;

6 [label="entropy = 0.0000\nsamples = 3\nvalue = [ 0. 3.]", shape="box"] ;

2 -> 6 ;

}

根据这个信息，决策树应该长的如下这个样子：

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
决策树分类算法的案例（代码实现及运行测试）

所用的训练数据如下，这个数据一共有10个样本，每个样本有2个属性，分别为身高和体重，第三列为类别标签，表示“胖”或“瘦”。一棵决策树的学习训练是非常耗费运算时间的，因此，决策树训练出来后，可进行保存，以便在预测新数据时只需要直接加载训练好的决策树即可。''' 使用信息熵作为划分标准，对决策树进行训练 '''''' 拆分训练数据与测试数据 '''''' 把决策树结构写入文件 '''''' 标签转换为0/1 ''''''测试结果的打印''''''准确率与召回率'''''' 数据读入 '''
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

打赏作者

qq_38220914 你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20

扫码支付：¥1

获取中

扫码支付

您的余额不足，请更换扫码支付或充值

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。