python机器学习的库:scikit-learn
特性:
简单高效的数据挖掘和机器学习分析
对所有用户开放,根据不同需求调试可重用性
基于Numpy,SciPy和matplotlib
开源:商用级别 ,获得BSD许可
覆盖问题领域:
分类(classification),回归(regression),聚类(clustering),降维(dimensionality reduction)
模型选择(model selection),预处理(preprocessing)
使用scikit-learn
安装scikit-learn:pip install scikit-learn(python3以上版本自带pip安装工具) ,也可使用easy_install 及windows installer工具进行安装
安装必要package: numpy ,scipy 和matplotlib,可使用Anaconda(包含numpy,scipy等科学计算常用package)
安装注意问题,Python解释器版本(2.7 or 3.4?),32-bit or 64-bit系统
文档:http://scikit-learn.org/stable/modules/tree.html
解释Python代码
安装Graphviz:
http://www.graphviz.org ,用来数据可视化处理,转化dot文件至pdf可视化决策树
pip install graphviz
pip install numpy
pip install scipy
样本数据:
RID | age | income | student | credit_rating | buys_computer |
1 | youth | high | no | fair | no |
2 | youth | high | no | excellent | no |
3 | middle_aged | high | no | fair | yes |
4 | senior | medium | no | fair | yes |
5 | senior | low | yes | fair | yes |
6 | senior | low | yes | excellent | no |
7 | middle_aged | low | yes | excellent | yes |
8 | youth | medium | no | fair | no |
9 | youth | low | yes | fair | yes |
10 | senior | medium | yes | fair | yes |
11 | youth | medium | yes | excellent | yes |
12 | middle_aged | medium | no | excellent | yes |
13 | middle_aged | high | yes | fair | yes |
14 | senior | medium | no | excellent | no |
下边开始使用python编码实现决策树应用:
from
sklearn.feature_extraction
import
DictVectorizer
import
csv
from
numpy
import
array
from
sklearn
import
preprocessing
from
sklearn
import
tree
from
sklearn.externals.six
import
StringIO
#Read in the csv file and put features in a list
allElectronicesData =
open
(
r"C:\Users\yzh\PycharmProjects\deep-learning\DecisionTree\decisionTree.csv"
,
'r'
,
encoding
=
'utf-8'
)
reader = csv.reader(allElectronicesData)
headers = reader.
__next__
()
#将特征向量与标签值提取出来
featureList = []
labelList = []
for
row
in
reader:
labelList.append(row[
len
(row) -
1
])
rowDict = {}
for
i
in
range
(
1
,
len
(row) -
1
):
rowDict[headers[i]] = row[i]
featureList.append(rowDict)
#将特征向量转化成算法法识别的数据格式
vec = DictVectorizer()
print
(featureList)
dummyX = vec.fit_transform(featureList).toarray()
# print("dummyX:"+str(dummyX))
# [[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
# [0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
# [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
# [0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
# [0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
# [0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]
# [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
# [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
# [0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
# [0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]
# [0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]
# [1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
# [1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]
# [0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
# 取特征值集合
# print(str(vec.get_feature_names()))
# [{'income': 'high', 'age': 'youth', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'high', 'age': 'youth', 'credit_rating': 'excellent', 'student': 'no'}, {'income': 'high', 'age': 'middle_aged', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'low', 'age': 'senior', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'low', 'age': 'senior', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'low', 'age': 'middle_aged', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'medium', 'age': 'youth', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'low', 'age': 'youth', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'youth', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'medium', 'age': 'middle_aged', 'credit_rating': 'excellent', 'student': 'no'}, {'income': 'high', 'age': 'middle_aged', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'excellent', 'student': 'no'}]
# ['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']
#将classLable处理成所识别的程序格式
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
# print("dummyY:"+str(dummyY))
#添加分类器对象,设置相关算法(ID3-信息熵),开始训练数据模型
clf = tree.DecisionTreeClassifier(
criterion
=
'entropy'
)
clf = clf.fit(dummyX,dummyY)
# 分类器配置信息
# print("clf:"+str(clf))
# clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
# max_features=None, max_leaf_nodes=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=2,
# min_weight_fraction_leaf=0.0, presort=False, random_state=None,
# splitter='best')
#将决策树另存为dot文件
with
open
(
"allElectronicInformationGainOri.dot"
,
"w"
)
as
f:
f = tree.export_graphviz(clf,
feature_names
=vec.get_feature_names(),
out_file
=f)
#传入测试数据,查看预测结果
testRowX = {
'credit_rating'
:
'excellent'
,
'income'
:
'high'
,
'age'
:
'youth'
,
'student'
:
'no'
}
#转化测试数据格式
testFeatures = array(vec.transform(testRowX).toarray()).reshape(
1
,-
1
)
# print("testFeatures;"+str(testFeatures))
# testFeatures;[[0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]]
# predictedY = clf.predict(array(testRowX).reshape(1, -1))
predictedY = clf.predict(testFeatures)
#输出预测结果
print
(predictedY)
# [0]