scikit-learn实现决策树

最新推荐文章于 2021-12-23 22:20:02 发布

boyan_RF

最新推荐文章于 2021-12-23 22:20:02 发布

阅读量950

点赞数

分类专栏： sklearn 机器学习文章标签：机器学习 scipy python 决策树

本文链接：https://blog.csdn.net/zhongjunlang/article/details/78153160

版权

机器学习同时被 2 个专栏收录

17 篇文章 2 订阅

订阅专栏

sklearn

8 篇文章 0 订阅

订阅专栏

Python机器学习的库：scikit-learn

1.1：特性：
简单高效的数据挖掘和机器学习分析
对所有用户开放，根据不同需求高度可重用性
基于Numpy, SciPy和matplotlib
开源，商用级别：获得 BSD许可

1.2 覆盖问题领域：
分类（classification), 回归（regression), 聚类（clustering), 降维(dimensionality reduction)
模型选择(model selection), 预处理(preprocessing)
使用用scikit-learn
安装scikit-learn: pip, easy_install, windows installer
安装必要package：numpy， SciPy和matplotlib，可使用Anaconda (包含numpy, scipy等科学计算常用package）

例子：某机构调查人群中买电脑的情况。

根据信息熵可以选择最优的决策树：

代码如下：

csv文件下载地址：https://pan.baidu.com/s/1sluPilZ

# -*- coding:utf-8 -*-
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO


# Read in the csv file and put features into list of dict and list of class label
allElectronicsData = open(r'C:\Users\zmj\Desktop\AllElectronics2.csv', 'rb')
reader = csv.reader(allElectronicsData)
headers = next(reader) #headers 指的是特征

# print(headers)
# 将特征打印出来如下所示
# ['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1]) #len(row) = 6 指的是特征值的个数
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
        #当第二次循环完全结束后rowDict是这种模样
        # {'credit_rating': 'excellent', 'age': 'senior', 'student': 'no', 'income': 'medium'}
    featureList.append(rowDict)


# print(featureList)
#featureList是一个列表，列表中的没一个元素都是一个字典，每个字典的模样就是rowDict的模样
#{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'high'}

# Vetorize features
vec = DictVectorizer() # DictVectorizer将dict类型的list数据，转换成numpy array
dummyX = vec.fit_transform(featureList) .toarray() #dummyX 就是一个矩阵了

print(("dummyX: " + str(dummyX)))
# dummyX: [[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
#  [ 0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
#  [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
#  [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  0.]
#  [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
#  [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
#  [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  1.]
#  [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
#  [ 0.  0.  1.  0.  1.  0.  1.  0.  0.  1.]
#  [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
#  [ 0.  0.  1.  1.  0.  0.  0.  1.  0.  1.]
#  [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
#  [ 1.  0.  0.  0.  1.  1.  0.  0.  0.  1.]
#  [ 0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
print((vec.get_feature_names()))
#['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high',
# 'income=low', 'income=medium', 'student=no', 'student=yes']

print(("labelList: " + str(labelList)))
#labelList: ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']

# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print(("dummyY: " + str(dummyY)))
# dummyY: [[0]
#  [0]
#  [1]
#  [1]
#  [1]
#  [0]
#  [1]
#  [0]
#  [1]
#  [1]
#  [1]
#  [1]
#  [1]
#  [0]]

# Using decision tree for classification
# clf = tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier(criterion='entropy')#创建一个分类器,entropy指的是熵，此分类器以信息熵作为度量标准
clf = clf.fit(dummyX, dummyY)#建模
print(("clf: " + str(clf)))
# clf: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
#             max_features=None, max_leaf_nodes=None,
#             min_impurity_decrease=0.0, min_impurity_split=None,
#             min_samples_leaf=1, min_samples_split=2,
#             min_weight_fraction_leaf=0.0, presort=False, random_state=None,
#             splitter='best')


# Visualize model
with open("allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
# 用graphviz的dot命令生成决策树的可视化文件，敲完这个命令后当前目录就可以看到决策树的可视化文件
#allElectronicInformationGainOri.pdf.打开可以看到决策树的模型图。
# #注意，这个命令在命令行执行
# dot -Tpdf allElectronicInformationGainOri.dot -o allElectronicInformationGainOri.pdf



#下面是对这个模型进行测试
oneRowX = dummyX[0, :]
print(("oneRowX: " + str(oneRowX)))
#oneRowX: [ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print(("newRowX: " + str(newRowX)))
#newRowX: [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]

# predictedY = clf.predict(newRowX)
# 如果是上面的语句会出现这样的错误
# ValueError: Expected 2D array, got 1D array instead:
# array=[ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature
# or array.reshape(1, -1) if it contains a single sample.
predictedY = clf.predict(newRowX.reshape(1, -1))
print(("predictedY: " + str(predictedY)))

boyan_RF

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
scikit-learn实现决策树

Python机器学习的库：scikit-learn1.1：特性：简单高效的数据挖掘和机器学习分析对所有用户开放，根据不同需求高度可重用性基于Numpy, SciPy和matplotlib 开源，商用级别：获得 BSD许可1.2 覆盖问题领域：分类（classification), 回归（regression), 聚类（clustering), 降维(dimension
复制链接

扫一扫

专栏目录