机器学习---决策树decision tree的应用

1.Python

2.Python机器学习的库:scikit-learn

2.1 特性:

简单高效的数据挖掘和机器学习分析
对所有用户开放,根据不同需求高度可重用性
基于Numpy,SciPy和matplotlib
开源的,且可达到商用级别,获得BSD许可
安装 Graphviz—-转化dot文件至pdf可视化决策树:dot -Tpdf *.dot -o

2.2覆盖问题领域

分类(classifaction),回归(regression),聚类(clustering),降维(dimensionality reduction)
模型选择(model selection),预处理(preprocessing)

3.使用scikit-learn

安装scikit-learn:
安装必要package:numpy,Scipy和matplotlib。
sklearn两篇优质文章:

《使用sklearn进行集成学习——理论》
《使用sklearn进行集成学习——实践》

4.例子

这里写图片描述

RIDageincomestudentcredit_ratingclass_buys_computer
1youthhighnofairno
2youthhighnoexcellentno
3middle_agedhighnofairyes
4seniormediumnofairyes
5seniorlowyesfairyes
6seniorlowyesexcellentno
7middle_agedlowyesexcellentyes
8youthmediumnofairno
9youthlowyesfairyes
10seniormediumyesfairyes
11youthmediumyesexcellentyes
12middle_agedmediumnoexcellentyes
13middle_agedhighyesfairyes
14seniormediumnoexcellentno

5.实现

from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO

#sklearn对数据有格式要求,首先要对数据进行格式预处理。
# Read in the csv file and put features into list of dict and list of class label
#读取csv文件,并把属性放到字典列表和类标签中
#Python2.x 
#allElectronicsData = open(r'AllElectronics.csv', 'rb')
#reader = csv.reader(allElectronicsData)
#headers = reader.next()
#上面的语句在python3.X会报错,'_csv.reader' object has no attribute 'next' 
#在python3.x需改为如下语句
allElectronicsData = open(r'AllElectronics.csv', 'rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)

print(headers)
#['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1])
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

print(featureList)
'''
[{'age': 'youth', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'}, 
{'age': 'youth', 'credit_rating': 'excellent', 'income': 'high', 'student': 'no'}, 
{'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'}, 
{'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'}, 
{'age': 'senior', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'}, 
{'age': 'senior', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'}, 
{'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'}, 
{'age': 'youth', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'}, 
{'age': 'youth', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'}, 
{'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'yes'}, 
{'age': 'youth', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'yes'}, 
{'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'},
{'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'yes'}, 
{'age': 'senior', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'}]
'''
#从表中可以看出是用字典储存,所以是无序的。

# Vetorize features
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList) .toarray()

print("dummyX: " + str(dummyX))
#将每一行转化为如下格式
#youth  middle_age senor   high medium low   yes no   fair excellent    buy
# 1        0         0      1     0     0     0   1    1     0           0  
'''
dummyX: 
[[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
'''
print(vec.get_feature_names())
'''
['age=middle_aged', 'age=senior', 'age=youth', 
 'credit_rating=excellent', 'credit_rating=fair', 
 'student=no', 'student=yes']
'''
print("labelList: " + str(labelList))
#labelList: 
#['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']

# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))
'''
dummyY: 
[[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]
'''

# Using decision tree for classification
# clf = tree.DecisionTreeClassifier()
'''
clf就是生成的决策树,参数可以选择决策树的算法种类,这里使用entropy即ID3信息熵算法。
'''
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf: " + str(clf))


# Visualize model
'''
创建.dot文件用于存放可视化决策树数据,决策树已经数值化,如果要还原属性到决策树,需要传入属性参数feature_names=vec.get_feature_names()
'''
with open("allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)

'''
最后把生成的.dot文件转换成可视化的pdf文件,dot -Tpdf input.dot -o output.pdf

'''

#决策树生成后,用demo实例预测结果

#取第一行数据,并稍做改动
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))
#oneRowX: [ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))
#newRowX: [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
#predictedY = clf.predict(newRowX)
'''
直接运行会报如下错误
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
提示需要reshape,所以入参改为newRowX.reshape(1,-1)
reshape作用可参考http://www.cnblogs.com/iamxyq/p/6683147.html
'''
predictedY = clf.predict(newRowX.reshape(1,-1))
print("predictedY: " + str(predictedY))
#predictedY: [1]

生成的决策树如下:
这里写图片描述

RID age income student credit_rating class_buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值