决策树(scikit-learn)

最新推荐文章于 2024-01-27 15:56:32 发布

辉兔子

最新推荐文章于 2024-01-27 15:56:32 发布

阅读量786

点赞数

分类专栏：机器学习文章标签：决策树python实现 decision tree

本文链接：https://blog.csdn.net/yangzhihui0627/article/details/79173254

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

python机器学习的库：scikit-learn

特性：

简单高效的数据挖掘和机器学习分析

对所有用户开放，根据不同需求调试可重用性

基于Numpy,SciPy和matplotlib

开源：商用级别，获得BSD许可

覆盖问题领域：

分类（classification）,回归(regression),聚类（clustering）,降维（dimensionality reduction）

模型选择（model selection）,预处理（preprocessing）

使用scikit-learn

安装scikit-learn:pip install scikit-learn（python3以上版本自带pip安装工具），也可使用easy_install 及windows installer工具进行安装

安装必要package: numpy ,scipy 和matplotlib,可使用Anaconda(包含numpy,scipy等科学计算常用package)

安装注意问题，Python解释器版本（2.7 or 3.4？），32-bit or 64-bit系统

文档：http://scikit-learn.org/stable/modules/tree.html

解释Python代码

安装Graphviz: http://www.graphviz.org ,用来数据可视化处理，转化dot文件至pdf可视化决策树

pip install graphviz

pip install numpy

pip install scipy

样本数据：

RID	age	income	student	credit_rating	buys_computer
1	youth	high	no	fair	no
2	youth	high	no	excellent	no
3	middle_aged	high	no	fair	yes
4	senior	medium	no	fair	yes
5	senior	low	yes	fair	yes
6	senior	low	yes	excellent	no
7	middle_aged	low	yes	excellent	yes
8	youth	medium	no	fair	no
9	youth	low	yes	fair	yes
10	senior	medium	yes	fair	yes
11	youth	medium	yes	excellent	yes
12	middle_aged	medium	no	excellent	yes
13	middle_aged	high	yes	fair	yes
14	senior	medium	no	excellent	no

下边开始使用python编码实现决策树应用：

from sklearn.feature_extraction import DictVectorizer

import csv

from numpy import array

from sklearn import preprocessing

from sklearn import tree

from sklearn.externals.six import StringIO

#Read in the csv file and put features in a list

allElectronicesData = open ( r"C:\Users\yzh\PycharmProjects\deep-learning\DecisionTree\decisionTree.csv" , 'r' , encoding = 'utf-8' )

reader = csv.reader(allElectronicesData)

headers = reader. __next__ ()

#将特征向量与标签值提取出来

featureList = []

labelList = []

for row in reader:

labelList.append(row[ len (row) - 1 ])

rowDict = {}

for i in range ( 1 , len (row) - 1 ):

rowDict[headers[i]] = row[i]

featureList.append(rowDict)

#将特征向量转化成算法法识别的数据格式

vec = DictVectorizer()

print (featureList)

dummyX = vec.fit_transform(featureList).toarray()

# print("dummyX:"+str(dummyX))

# [[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]

# [0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]

# [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]

# [0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]

# [0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]

# [0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]

# [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]

# [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]

# [0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]

# [0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]

# [0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]

# [1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]

# [1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]

# [0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

# 取特征值集合

# print(str(vec.get_feature_names()))

# [{'income': 'high', 'age': 'youth', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'high', 'age': 'youth', 'credit_rating': 'excellent', 'student': 'no'}, {'income': 'high', 'age': 'middle_aged', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'low', 'age': 'senior', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'low', 'age': 'senior', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'low', 'age': 'middle_aged', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'medium', 'age': 'youth', 'credit_rating': 'fair', 'student': 'no'}, {'income': 'low', 'age': 'youth', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'youth', 'credit_rating': 'excellent', 'student': 'yes'}, {'income': 'medium', 'age': 'middle_aged', 'credit_rating': 'excellent', 'student': 'no'}, {'income': 'high', 'age': 'middle_aged', 'credit_rating': 'fair', 'student': 'yes'}, {'income': 'medium', 'age': 'senior', 'credit_rating': 'excellent', 'student': 'no'}]

# ['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']

#将classLable处理成所识别的程序格式

lb = preprocessing.LabelBinarizer()

dummyY = lb.fit_transform(labelList)

# print("dummyY:"+str(dummyY))

#添加分类器对象，设置相关算法(ID3-信息熵)，开始训练数据模型

clf = tree.DecisionTreeClassifier( criterion = 'entropy' )

clf = clf.fit(dummyX,dummyY)

# 分类器配置信息

# print("clf:"+str(clf))

# clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,

# max_features=None, max_leaf_nodes=None,

# min_impurity_decrease=0.0, min_impurity_split=None,