机器学习（决策树）

最新推荐文章于 2023-07-07 19:54:06 发布

Lee_jiaqi

最新推荐文章于 2023-07-07 19:54:06 发布

阅读量380

点赞数

分类专栏：机器学习文章标签：机器学习决策树算法信息熵

本文链接：https://blog.csdn.net/zoinsung_lee/article/details/78387560

版权

机器学习专栏收录该内容

25 篇文章 0 订阅

订阅专栏

机器学习中分类和预测算法的评估：
准确性
速度
强壮性
可规模性
可解释性

1.决策树概念

决策树是一个类似于流程图的树结构；其中，每个内部结点代表类或类分布。树的最顶层是根节点。

2.构造决策树的基本算法

2.1.熵的概念

一条信息的信息大小和它的不确定性有直接的关系，要搞清楚一件非常不确定的事情，需要了解大量信息。所以信息的度量就等于不确定的多少。

用比特来衡量信息的多少

-(p1*logp1 + p2*logp2 + ...... + pn*logpn)

变量的不确定性越大，熵就越大

3.决策树的归纳算法

选择属性判断结点

信息获取量：gain(A) = info(D) - info_A(D)，通过A来作为节点分类获取了多少信息。

算法：

树以代表训练样本的单个结点开始（步骤1）
如果样本都在同一类，则该节点成为树叶，并用该类标号（步骤2和步骤3）
否则，算法使用称为信息增益的基于熵的度量作为启发信息，选择能够最好地将样本分类地属性（步骤6），该属性成为该结点地“测试”或“判定”属性（步骤7）。在算法中，所有的属性都是分类的，即离散值，连续属性必须离散化。
对测试属性每个已知的值，创建一个分支，并据此划分样本（步骤8-步骤10）。
算法使用同样的过程，递归地形成每个划分上的样本判定树，一旦一个属性出现在一个结点上，就不必在该结点的任何后代上考虑它（步骤13）。
递归划分步骤仅当下列条件之一成立停止。
（a）给定结点的所有样本属于同一类（步骤2和3）。
（b）没有剩余属性可以用来进一步划分样本（步骤4）。在次情况下，使用多数表决（步骤5）
这涉及将给定的结点转换为树叶，并用样本中的多数所在类标记它
（c）分枝
没有样本（步骤11），创建一个树叶。

4.决策树的优缺点

优点：
直观、便于理解、小规模数据集有效

缺点：
处理连续量效果不好
类别较多时，错误增加的比较快
可规模性一般

5.决策树实现

5.1.将训练集保存在.csv文件中，用excel打开如下所示
这里写图片描述

5.2.实现代码

# -*- coding:utf-8 -*- 
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import preprocessing
from sklearn import tree
from sklearn.externals.six import StringIO

allElectronicsData = open('E:\demo_py\python\machine_learning\CSV.csv','rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)

print(headers)

featureList = []
labelList = []

for row in reader:
    labelList.append(row[len(row)-1])
    rowDict = {}
    for i in range(1,len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

print(featureList)

vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()

print("dummyX:"+str(dummyX))
print(vec.get_feature_names())

print("labelList:"+str(labelList))

lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY:"+str(dummyY))

clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(dummyX,dummyY)
print("clf:"+str(clf))

with open("E:\demo_py\python\machine_learning\CSV.dot","w") as f:
    f=tree.export_graphviz(clf,feature_names=vec.get_feature_names(),out_file = f)

oneRowX = dummyX[0,:]
print("oneRowX:"+str(oneRowX))

newRowX = oneRowX

newRowX[0] = 1
newRowX[2] = 0
print("newRowX:"+str(newRowX))

predicatedY = clf.predict(newRowX)
print("predicatedY:"+str(predicatedY))

运行结果：

['RID', 'age', 'income', 'student', 'credit_rating', 'Class_buys_computer']
[{'student': 'no', 'credit_rating': 'fair', 'income': 'high', 'age': 'youth'}, {'student': 'no', 'credit_rating': 'excellent', 'income': 'high', 'age': 'youth'}, {'student': 'no', 'credit_rating': 'fair', 'income': 'high', 'age': 'middle_aged'}, {'student': 'no', 'credit_rating': 'fair', 'income': 'medium', 'age': 'senior'}, {'student': 'yes', 'credit_rating': 'fair', 'income': 'low', 'age': 'senior'}, {'student': 'yes', 'credit_rating': 'excellent', 'income': 'low', 'age': 'senior'}, {'student': 'yes', 'credit_rating': 'excellent', 'income': 'low', 'age': 'middle_aged'}, {'student': 'no', 'credit_rating': 'fair', 'income': 'medium', 'age': 'youth'}, {'student': 'yes', 'credit_rating': 'fair', 'income': 'low', 'age': 'youth'}, {'student': 'yes', 'credit_rating': 'fair', 'income': 'medium', 'age': 'senior'}, {'student': 'yes', 'credit_rating': 'excellent', 'income': 'medium', 'age': 'youth'}, {'student': 'no', 'credit_rating': 'excellent', 'income': 'medium', 'age': 'middle_aged'}, {'student': 'yes', 'credit_rating': 'fair', 'income': 'high', 'age': 'middle_aged'}, {'student': 'no', 'credit_rating': 'excellent', 'income': 'medium', 'age': 'senior'}]
dummyX:[[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  1.  1.  0.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']
labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
dummyY:[[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]
clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
oneRowX:[ 0.  0.  1.  0.  1.  1.  0.  0.  1.  0.]
newRowX:[ 1.  0.  0.  0.  1.  1.  0.  0.  1.  0.]
D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
predicatedY:[1]

决策树生成.dot文件，将.dot文件转化成.pdf文件
如下所示：
这里写图片描述

在cmd命令行中键入如下命令：
这里写图片描述

可视化决策树
这里写图片描述

Lee_jiaqi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习（决策树）

机器学习中分类和预测算法的评估：准确性速度强壮性可规模性可解释性1.决策树概念决策树是一个类似于流程图的树结构；其中，每个内部结点代表类或类分布。树的最顶层是根节点。2.构造决策树的基本算法2.1.熵的概念一条信息的信息大小和它的不确定性有直接的关系，要搞清楚一件非常不确定的事情，需要了解大量信息。所以信息的度量就等于不确定的多少。用比特来衡量信息的多少-(p1*logp1 + p2*l
复制链接

扫一扫