机器学习笔记——决策数实现及使用Graphviz查看

最新推荐文章于 2022-07-15 15:13:12 发布

阿卡蒂奥

最新推荐文章于 2022-07-15 15:13:12 发布

阅读量1.2w

点赞数 3

分类专栏：机器学习文章标签：机器学习 graphviz 决策树

本文链接：https://blog.csdn.net/akadiao/article/details/77800909

版权

机器学习专栏收录该内容

17 篇文章 13 订阅

订阅专栏

决策树实例

判断是否会购买电脑的案例：
数据以data.csv文件存储，内容为。

这里写图片描述

RID age income  student credit_rating   Class: buys_computer
1   youth   high    no  fair    no
2   youth   high    no  excellent   no
3   middle_aged high    no  fair    yes
4   senior  medium  no  fair    yes
5   senior  low yes fair    yes
6   senior  low yes excellent   no
7   middle_aged low yes excellent   yes
8   youth   medium  no  fair    no
9   youth   low yes fair    yes
10  senior  medium  yes fair    yes
11  youth   medium  yes excellent   yes
12  middle_aged medium  no  excellent   yes
13  middle_aged high    yes fair    yes
14  senior  medium  no  excellent   no

决策树实现

1、导入所需模块：

##sklearn对输入数据的格式有一定要求，只支持整型的数据，不支持类型数据，故需要对输入数据进行转换；
from sklearn.feature_extraction import DictVectorizer
##涉及到对csv文件的读取，故导入csv接口
import csv
from sklearn import preprocessing
from sklearn import tree
from sklearn.externals.six import StringIO

2、从csv文件中读取数据

##将csv文件中的数据读取到变量allElectronicsData中
allElectronicsData=open(r'data.csv')
##csv自带的reader可以按行读取allElectronicsData中的数据
reader=csv.reader(allElectronicsData)
##读取第一行数据即title
headers=reader.next()
print headers

打印出结果为：

['RID', 'age', 'income', 'student', 'credit_rating', 'Class: buys_computer']

3、数据预处理：
sklearn要求数据输入的特征值（属性）features以及输出的类，必须是数值型的值，而不能是类别值（如income属性中的high、medium、low）。

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1])
    rowDict = {}
    for i in range(1,len(row)-1):
        # print row[i]
        rowDict[headers[i]] = row[i]
        # print "rowDict:",rowDict
    featureList.append(rowDict)

###  list中的每一个字典对应原始数据中的一行数据 <featureList[0]对应第1行原始数据>
print featureList
print type(featureList[0])

打印出结果为：
生成的list中的每一个字典对应原始数据中的一行数据，如{‘credit_rating’: ‘fair’, ‘age’: ‘youth’, ‘student’: ‘no’, ‘income’: ‘high’}

[{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'high'}, 
{'credit_rating': 'excellent', 'age': 'youth', 'student': 'no', 'income': 'high'}, 
{'credit_rating': 'fair', 'age': 'middle_aged', 'student': 'no', 'income': 'high'}, 
{'credit_rating': 'fair', 'age': 'senior', 'student': 'no', 'income': 'medium'}, 
{'credit_rating': 'fair', 'age': 'senior', 'student': 'yes', 'income': 'low'}, 
{'credit_rating': 'excellent', 'age': 'senior', 'student': 'yes', 'income': 'low'}, 
{'credit_rating': 'excellent', 'age': 'middle_aged', 'student': 'yes', 'income': 'low'}, 
{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'medium'}, 
{'credit_rating': 'fair', 'age': 'youth', 'student': 'yes', 'income': 'low'}, 
{'credit_rating': 'fair', 'age': 'senior', 'student': 'yes', 'income': 'medium'}, 
{'credit_rating': 'excellent', 'age': 'youth', 'student': 'yes', 'income': 'medium'}, 
{'credit_rating': 'excellent', 'age': 'middle_aged', 'student': 'no', 'income': 'medium'}, 
{'credit_rating': 'fair', 'age': 'middle_aged', 'student': 'yes', 'income': 'high'}, 
{'credit_rating': 'excellent', 'age': 'senior', 'student': 'no', 'income': 'medium'}]
<type 'dict'>

将数据进行编码处理，将字符型的数据进行one-hot编码转化为0、1：
即age中的youth对应001、middle_aged对应100、senior对应010；

vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()
print "dummyX:\n"+str(dummyX)
print vec.get_feature_names()
lb = preprocessing.LabelBinarizer()
dummyY=lb.fit_transform(labelList)
print "dummyY:"+str(dummyY)

打印结果：

dummyX:
[[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']
dummyY:
[[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]

使用决策树作为分类器

clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX,dummyY)
print "clf:"+str(clf)

打印结果：

clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

将获得的决策树写入dot文件：

with open("allElectronicsData.dot","w") as f:
    f = tree.export_graphviz(clf,feature_names=vec.get_feature_names(),out_file=f)

windows系统下使用Graphviz查看生成的决策树

下载Graphviz安装包，按提示安装后用GVEdit打开allElectronicsData.dot文件，可看到生成的决策树：

在电脑上安装Graphviz后，

这里写图片描述

对新数据进行预测：
取X得第一行数据

oneRowX = dummyX[0,:]
print "oneRowX:"+str(oneRowX)

打印出X的第一行数据为：

oneRowX:[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]

将X的第一行数据的第一位置1第三位置0

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print "newRowX:"+str(newRowX)

打印出新构造的的数据为：

newRowX:[ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]

对新数据进行预测：

predictedY = clf.predict(oneRowX.reshape(1,-1))
print "predictedY:"+str(predictedY)

打印出结果为：

predictedY:[1]

即buy computer。

阿卡蒂奥

关注

3
点赞
踩
24

收藏

觉得还不错? 一键收藏
10
评论
机器学习笔记——决策数实现及使用Graphviz查看

决策树实例判断是否会购买电脑的案例：数据以data.csv文件存储，内容为。RID age income student credit_rating Class: buys_computer1 youth high no fair no2 youth high no excellent no3 middle_aged high no
复制链接

扫一扫

专栏目录