机器学习：Experiment 6: Decision Tree

最新推荐文章于 2023-04-29 14:18:24 发布

Nianf

最新推荐文章于 2023-04-29 14:18:24 发布

阅读量1.2k

点赞数 3

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_43738932/article/details/115442221

版权

机器学习专栏收录该内容

7 篇文章 4 订阅

订阅专栏

实验目的

在本练习中，实现决策树

软件环境

软件环境： Anaconda3（Spyder）
这个实验决策树实现代码是Python，所以可以安装Spyder，或者pycharm也可以，Anaconda还没安装的可以看看我的这篇博客anaconda安装

实验步骤与内容

1. 数据加载

首先，下载 ex6Data.zip 并从 zip 文件中提取该文件。该文件是一个逗号分隔的文件(csv)，它实际上是一个葡萄酒数据集。该数据集通常用于评估分类算法，其中分类任务是确定葡萄酒度数是否超过 7。我们已经将您的葡萄酒质量分数映射到 0 和 1 的二进制类。葡萄酒得分从 0 到 6（含）映射为 0，葡萄酒得分 7 及以上映射为 1。您将在数据集上执行二进制分类每一行描述一种葡萄酒，使用 12 列：前 11 列描述葡萄酒的特征（细节)，最后一列是葡萄酒质量的标签(0/1）。在对数据进行分类时，不能使用最后一列作为输入特性。

2. 决策树

在此任务中，您将实现一个众所周知的决策树分类器。分类器的性能将通过在提供的数据集上进行 10 倍交叉验证来评估。在课堂上讨论了决策树和交叉验证。您将使用 Python 或 Matlab 从零开始实现决策树分类器。你不应该使用现有的机器学习或决策树库。

2.1 Implementing Decision Tree（构造决策树）

在你的决策树实现中，你可以应用任何你喜欢的变体(例如，使用熵，基尼系数，或其他测量二进制分割或多路分割)。在实验报告中说明你的方法以及它们对分类性能的影响。我们提供了用 Python 编写的框架代码(here代码框架)。它帮助您设置环境(加载数据和评估模型)。您可以选择使用这个框架，用 Python 或 Matlab 从头开始编写自己的代码。

2.2 Evaluation using Cross Validation （使用交叉验证进行评估）

你将使用 10 倍交叉验证来评估决策树。详情请参阅讲座幻灯片。简单地说，您将首先将提供的数据分成 10 个部分。然后拿出 1 部分作为测试集和使用剩余的 9 部分作为训练集。使用训练集训练决策树，并使用训练过的决策树对测试集中的条目进行分类。对所有 10 个部分重复此过程，以便每个条目将被准确地用作测试集一次。为了得到最终的精度值，取 10 倍精度的平均值。如果正确地实现了这两部分(决策树和交叉验证)，分类精度应该在 0.78 左右或更高

2.3 Visualizing your decision tree （可视化决策树）

你最好使用 Python 或 Matlab 中的可视化库来可视化决策树，但是你也可以使用现有的绘图工具来绘制决策树流程图。一旦你得到了决策树流程图，查看结构并将其放入实验报告中。

在这里插入图片描述

结论分析与体会

决策树原理分析：决策树由节点、边和叶子组成。节点表示要测试的某个属性的值，边表示对应测试结果，并且连接下一个节点或者叶子，叶子表示类标签。
那么一棵好的决策树就是要最小泛化误差。
决策树：组织一系列的测试，在层次结构中确定一个示例的类标签，实现方法基本的分治算法:

为根节点选择一个测试
➢创建分支为每个可能结果的测试
将实例拆分为子集
➢每个分支从一个节点
为每个分支递归重复，只使用到达该分支的实例
如果分支的所有实例具有相同的类，则停止该分支的递归为了判定是否为好的属性作为根节点，我们采用熵和基尼系数，本次实验我采用基尼系数得出的结果为 0.82（准确率）

熵计算公式：
在这里插入图片描述
基尼系数计算公式：
转化到代码：

def Entropy(p):#二分类求熵
    q=1-p
    p=max(1e-9,p)
    q=max(1e-9,q)
    E=-p*log2(p)-q*log2(q)
    return E
def Gini(p): #二分类求基尼系数
    q=1-p
    G=1-p*p-q*q
    return G

实验过程中所遇问题分析

1. 安装 graphviz 失败

并不是简单的 pip install graphviz 就能成功地搞定这个库了。（我是失败的，所以记录一下解决错误的过程。）

a. 将第三方库 graphviz 删除（没错，先把它删干净。一般在 D:\users\Anaconda3\Lib\site_packages 里面。）

b. 去官网下载一个 graphviz,graphviz
在这里插入图片描述
c. 然后双击下载好的文件，解压到 site_packages 文件夹，解压完长这样

在这里插入图片描述
d. 将解压后的 graphviz 文件的 bin 添加到环境变量中去，编辑新建就可以了.复制bin的路径，然后粘贴。

e.打开命令行（anaconda的命令行），输入dot -version 然后会出现graphviz的版本。下面这个anaconda的命令行
在这里插入图片描述

f.最后 pip install graphviz就行了。重启Spyder
重启！重启！重启！重启！重启！重启！重启！

Codes

# CSE6242/CX4242 Homework 4 Sketch Code
# Please use this outline to implement your decision tree. You can add any code around this.

import csv
from graphviz import Digraph#这个包需要自己安装
from math import log2
# Enter You Name Here
#myname = "Doe-John-" # or "Doe-Jane-"
# Implement your decision tree below
def Entropy(p):#二分类求熵
    q=1-p
    p=max(1e-9,p)
    q=max(1e-9,q)
    E=-p*log2(p)-q*log2(q)
    return E
def Gini(p): #二分类求基尼系数
    q=1-p
    G=1-p*p-q*q
    return G
child=0
class DecisionTree():#决策树
    def __init__(self,data,threshold,level,*,way):
        self.threshold=threshold
        self.levelfeature=level #there is 11 features in data
        self.way=way
        self.tree=self.learn(data)
    def learn(self, training_set):#训练集上学习
        # implement this function
        select=sorted([(i,)+self.split(i,training_set) for i in range(self.levelfeature)],key=lambda example:example[1])
        selected=select[0]#the least
        feature=selected[0]#the least entropy of feature
        ans=selected[1]
        pos=selected[2]
        tree = {} 
        if(ans<self.threshold):# the tree is single node tree
            tree['feature']=-1
            label=[example[-1] for example in training_set]
            if(label.count(1)>label.count(0)):
                tree['result']=1
            else:
                tree['result']=0
            return tree
        tree['feature']=feature
        tree['position']=pos
        left_tree=[example for example in training_set if example[feature]<pos]
        right_tree=[example for example in training_set if example[feature]>pos]
        #build the tree childs
        label1=[example[-1] for example in left_tree]
        if(len(left_tree)<self.levelfeature or label1.count(1)==0 or label1.count(0)==0):
            leaf={}
            leaf['feature']=-1
            if(label1.count(1)>label1.count(0)):
                leaf['result'] = 1
            else:
                leaf['result'] = 0
            tree['left_tree'] = leaf
        else:
            tree['left_tree']=self.learn(left_tree)
        
        label2=[example[-1] for example in right_tree]
        if(len(right_tree)<self.levelfeature or label2.count(1)==0 or label2.count(0)==0):
            leaf={}
            leaf['feature']=-1
            if(label2.count(1)>label2.count(0)):
                leaf['result'] = 1
            else:
                leaf['result'] = 0
            tree['right_tree'] = leaf
        else:
            tree['right_tree']=self.learn(right_tree)
        return tree
    # implement this function
    def split(self,feature,training_set):#find the best site of splitting
        training_set.sort(key=lambda sample:sample[feature])
        y0=[example[-1] for example in training_set].count(0)#label 0 number,right
        y1=[example[-1] for example in training_set].count(1)#label 1 number,right
        x0=0#left 0
        x1=0#left 1
        pos=-1
        ans=1#the minimum -H(D|A)
        t=len(training_set)
        for i in range(len(training_set)-1):
            if((x0+x1)!=0 and (y0+y1)!=0):
                if(training_set[i][-1]!=training_set[i+1][-1]):
                    h1=(training_set[i][feature]+training_set[i+1][feature])/2  #the feature/2
                    h2=(x0+x1)/t*self.way(x0/(x0+x1))+(y0+y1)/t*self.way(y0/(y0+y1)) 
                    if(ans>h2):
                        ans=h2
                        pos=h1
            if(training_set[i][-1]==1):#get a step,if the value is 1,then...else ...
                x1+=1
                y1-=1
            else:
                x0+=1
                y0-=1
        return ans,pos
    def classify(self, test_instance):
         # baseline: always classifies as 0
        thetree=self.tree
        while thetree['feature']!=-1:
            feature=thetree['feature']
            pos=thetree['position']
            if(test_instance[feature]<pos):
                thetree=thetree['left_tree']
            else:
                thetree=thetree['right_tree']   
        return thetree['result']
def preorder(tree,r):#前序遍历
    if(tree['feature']==-1):
          print("%d result %d" %(tree['feature'],tree['result']))
          r.write("%d result %d\n" %(tree['feature'],tree['result']))
    else:
          print("%d position %d" %(tree['feature'],tree['position']))
          r.write("%d position %d\n" %(tree['feature'],tree['position']))
          preorder(tree['left_tree'],r)
          preorder(tree['right_tree'],r)
def inorder(tree,r):#中序遍历
    if(tree['feature']==-1):
          print("%d result %d" %(tree['feature'],tree['result']))
          r.write("%d result %d\n" %(tree['feature'],tree['result']))
    else:
          inorder(tree['left_tree'],r)
          print("%d position %d" %(tree['feature'],tree['position']))
          r.write("%d position %d\n" %(tree['feature'],tree['position']))
          inorder(tree['right_tree'],r) 
dot = Digraph(comment='decision tree')
def draw(tree,fea,father):#画树
    global child
    if(child>1):
        dot.edge("n%d" %(father),"n%d" %(child))    
    if tree['feature']==-1:
        dot.node("n%d" %(child),"%d" %(tree['result']))
    else :
        dot.node("n%d" %(child),"feature:%s\nposition:%d" %(fea[tree['feature']],tree['position']))
        father=child
        child+=1
        draw(tree['left_tree'],fea,father)
        child+=1
        draw(tree['right_tree'],fea,father)
def run_decision_tree():
    # Load data set
    with open("ex6Data.csv") as f:
        for i,feas in enumerate(csv.reader(f)):
            if i==0:
                fea=feas;
    f.close()
    with open("ex6Data.csv") as f:
        next(f, None)
        data = [tuple(line) for line in csv.reader(f, delimiter=",")]
    print("Number of records: %d" % (len(data)))
    for i in range(len(data)):
        label=[int(data[i][-1])]
        data[i]=[float(x) for x in data[i][:-1]]+label
    # Split training/test sets
    # You need to modify the following code for cross validation.
    K = 10
    Accuracy=0;
    r = open("result.txt", "w")
    for j in range(10):
        accuracy=0
        training_set = [x for i, x in enumerate(data) if i % K != j]
        test_set = [x for i, x in enumerate(data) if i % K == j]    
        tree = DecisionTree(training_set,0.2,11,way=Gini)
        results = []
        for instance in test_set:
            result = tree.classify( instance[:-1] )
            results.append( result == instance[-1])
    # Construct a tree using training set
    # Classify the test set using the tree we just constructed
    # Accuracy
        accuracy+= float(results.count(True))/float(len(results))
        Accuracy+=accuracy
        print("preorder traversal\nFeaturePosition，fractional dose，result")
        r.write("preorder traversal FeaturePosition，fractional dose，result\n")
        preorder(tree.tree,r)
        print('inorder traversal\nFeaturePosition，fractional dose，result')
        r.write('inorder traversal FeaturePosition，fractional dose，result')
        inorder(tree.tree,r)
        print("accuracy: %.4f" % accuracy)       
        r.write("accuracy: %.4f\n" % accuracy)
    
    Accuracy/=10 
    draw(tree.tree,fea,0)
    dot.view()
    print("Accuracy: %.4f" % Accuracy)
    r.write("accuracy: %.4f" % accuracy)
    r.close()
    f.close()
    # Writing results to a file (DO NOT CHANGE)
if __name__ == "__main__":
    run_decision_tree()

Nianf

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
1
评论
机器学习：Experiment 6: Decision Tree

目录实验目的软件环境实验步骤与内容1. 数据加载2. 决策树2.1 Implementing Decision Tree（构造决策树）2.2 Evaluation using Cross Validation （使用交叉验证进行评估）2.3 Visualizing your decision tree （可视化决策树）结论分析与体会实验过程中所遇问题分析1. 安装 graphviz 失败Codes小前言：因为懒惰，上学期的实验这学期才放上来，23333.，留个参考。(能够留存下来也是不容易的，谁让我又完全
复制链接

扫一扫

专栏目录