scikit-learn实现决策树

  1. Python机器学习的库:scikit-learn

    1.1: 特性:
    简单高效的数据挖掘和机器学习分析
    对所有用户开放,根据不同需求高度可重用性
    基于Numpy, SciPy和matplotlib
    开源,商用级别:获得 BSD许可

    1.2 覆盖问题领域:
    分类(classification), 回归(regression), 聚类(clustering), 降维(dimensionality reduction)
    模型选择(model selection), 预处理(preprocessing)

  2. 使用用scikit-learn
    安装scikit-learn: pip, easy_install, windows installer
    安装必要package:numpy, SciPy和matplotlib, 可使用Anaconda (包含numpy, scipy等科学计算常用package)

    例子:某机构调查人群中买电脑的情况。
    这里写图片描述
    根据信息熵可以选择最优的决策树:
    这里写图片描述
    代码如下:

    csv文件下载地址:https://pan.baidu.com/s/1sluPilZ

# -*- coding:utf-8 -*-
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO


# Read in the csv file and put features into list of dict and list of class label
allElectronicsData = open(r'C:\Users\zmj\Desktop\AllElectronics2.csv', 'rb')
reader = csv.reader(allElectronicsData)
headers = next(reader) #headers 指的是特征

# print(headers)
# 将特征打印出来如下所示
# ['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1]) #len(row) = 6 指的是特征值的个数
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
        #当第二次循环完全结束后rowDict是这种模样
        # {'credit_rating': 'excellent', 'age': 'senior', 'student': 'no', 'income': 'medium'}
    featureList.append(rowDict)


# print(featureList)
#featureList是一个列表,列表中的没一个元素都是一个字典,每个字典的模样就是rowDict的模样
#{'credit_rating': 'fair', 'age': 'youth', 'student': 'no', 'income': 'high'}

# Vetorize features
vec = DictVectorizer() # DictVectorizer将dict类型的list数据,转换成numpy array
dummyX = vec.fit_transform(featureList) .toarray() #dummyX 就是一个矩阵了

print(("dummyX: " + str(dummyX)))
# dummyX: [[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
#  [ 0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
#  [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
#  [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  0.]
#  [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
#  [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
#  [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  1.]
#  [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
#  [ 0.  0.  1.  0.  1.  0.  1.  0.  0.  1.]
#  [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
#  [ 0.  0.  1.  1.  0.  0.  0.  1.  0.  1.]
#  [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
#  [ 1.  0.  0.  0.  1.  1.  0.  0.  0.  1.]
#  [ 0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
print((vec.get_feature_names()))
#['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high',
# 'income=low', 'income=medium', 'student=no', 'student=yes']

print(("labelList: " + str(labelList)))
#labelList: ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']

# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print(("dummyY: " + str(dummyY)))
# dummyY: [[0]
#  [0]
#  [1]
#  [1]
#  [1]
#  [0]
#  [1]
#  [0]
#  [1]
#  [1]
#  [1]
#  [1]
#  [1]
#  [0]]

# Using decision tree for classification
# clf = tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier(criterion='entropy')#创建一个分类器,entropy指的是熵,此分类器以信息熵作为度量标准
clf = clf.fit(dummyX, dummyY)#建模
print(("clf: " + str(clf)))
# clf: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
#             max_features=None, max_leaf_nodes=None,
#             min_impurity_decrease=0.0, min_impurity_split=None,
#             min_samples_leaf=1, min_samples_split=2,
#             min_weight_fraction_leaf=0.0, presort=False, random_state=None,
#             splitter='best')


# Visualize model
with open("allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
# 用graphviz的dot命令生成决策树的可视化文件,敲完这个命令后当前目录就可以看到决策树的可视化文件
#allElectronicInformationGainOri.pdf.打开可以看到决策树的模型图。
# #注意,这个命令在命令行执行
# dot -Tpdf allElectronicInformationGainOri.dot -o allElectronicInformationGainOri.pdf



#下面是对这个模型进行测试
oneRowX = dummyX[0, :]
print(("oneRowX: " + str(oneRowX)))
#oneRowX: [ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print(("newRowX: " + str(newRowX)))
#newRowX: [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]

# predictedY = clf.predict(newRowX)
# 如果是上面的语句会出现这样的错误
# ValueError: Expected 2D array, got 1D array instead:
# array=[ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature
# or array.reshape(1, -1) if it contains a single sample.
predictedY = clf.predict(newRowX.reshape(1, -1))
print(("predictedY: " + str(predictedY)))
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值