我的机器学习之路-决策树

最新推荐文章于 2023-05-07 19:23:56 发布

暮雨橙海

最新推荐文章于 2023-05-07 19:23:56 发布

阅读量252

点赞数

分类专栏：机器学习文章标签：机器学习决策树

本文链接：https://blog.csdn.net/qq_36130482/article/details/71512576

版权

机器学习专栏收录该内容

15 篇文章 0 订阅

订阅专栏

1.什么是决策树

决策树（DTS）是一种用于分类和回归的非参数监督学习方法。目标是创建一个模型，预测目标变量的值，通过学习简单的决策规则推断的数据特征。是一个类似于流程图的树结构：其中，每个内部结点表示在一个属性上的测试，每个分支代表一个属性输出，而每个树叶结点代表类或类分布。树的最顶层是根结点。
2.熵

1948年，香农提出了 ”信息熵(entropy)“的概念一条信息的信息量大小和它的不确定性有直接的关系，要搞清楚一件非常非常不确定的事情，或者是我们一无所知的事情，需要了解大量信息==>信息量的度量就等于不确定性的多少。可以用下面的公式计算。
这里写图片描述

3 代码

# -*- coding: utf-8 -*-
"""
Created on Mon May  8 13:49:04 2017

@author: ThinkCentre
"""
from math import log
import operator
#==============================================================================
# 计算香浓熵
#==============================================================================
def calcshannonEnt(dataSet):
    #计算dataSet的列数
    numEntries=len(dataSet)
    #定义标签字典
    labelCounts ={}
    #遍历数据集dataSet的每一个列表featVec
    for featVec in dataSet:
        #取分类currentLabel（yes,no）
        currentLabel = featVec[-1]
        #计算每一个分类下yes,no的个数，用字典保存
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    # 计算香浓熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*log(prob,2)
    return shannonEnt
#==============================================================================
# 建立数据集
#==============================================================================
def creatDataSet():
    dataSet = [[1,1,'yes'],
               [1,1,'yes'], 
               [1,0,'no'],
               [0,1,'no'],                            
               [0,1,'no']]
    labels = ['no surfacing','flippers']
    return dataSet,labels
#==============================================================================   
# 划分数据集    
#==============================================================================
def splitDataSet(dataSet,axis,value):
    retDataSet =[]
    for featVec in dataSet:
        #判断featVec的第axis值是否等于value
        if featVec[axis] == value:
            #reducedFeatVec的值为featVec除了第axis个值之外的值
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            #得到划分过的dataset
            retDataSet.append(reducedFeatVec)
    return retDataSet

#==============================================================================
# 计算最好的划分方式
#==============================================================================

def chooseBestFeatureToSplit(dataSet):
     #计算特征的数量
    numFeatures = len(dataSet[0])-1
    # calculate shannon  entropy
    baseEntropy = calcshannonEnt(dataSet)

    bestInfoGain = 0.0;bestFeature = -1
    for i in range(numFeatures):
        # 选择对应I的列
        featList = [example[i] for example in dataSet]
        # 取每一个featList 的集合uniqueVals
        uniqueVals = set(featList)
        newEntropy =0.0
        #遍历集合uniqueVals里的每一个元素value
        for value in uniqueVals:
            #划分数据集
            subDataSet = splitDataSet(dataSet,i,value)
            #计算每一种特征在数据集中的概率
            prob = len(subDataSet)/float(len(dataSet))
            #计算划分数据集之后的香浓熵
            newEntropy += prob*calcshannonEnt(subDataSet)
        #计算信息增益
        infoGain = baseEntropy-newEntropy
        #选择最好的划分方式
        if (infoGain>bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature
#==============================================================================
# 返回出现次数最多的分类名称
#==============================================================================
def majorityCnt(classList):
    classCount ={}
    #添加新特征到classCount
    for vote in classList:
        if vote not in classCount.keys:
            classCount[vote] += 1
        #排序
        sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reversed =True)
    return sortedClassCount[0][0]

#==============================================================================
# 建立决策树
#==============================================================================
def creatTree(dateSet,labels):
    #得到分类列表
    classList = [example[-1] for example in dateSet]
    #判断classList[0]即第一个标签的数量是否等于标签列表的长度，也就是说当类别相同时停止划分
    if classList.count(classList[0]) ==len(classList):
        return classList[0]
    #当数据集的每一行的长度为1时，返回出现次数最多的特征
    if len(dateSet[0]) == 1:
        return majorityCnt(classList)
    #得到使划分方式最优的特征
    bestFeat = chooseBestFeatureToSplit(dateSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    #删除列表中最好的那个特征
    del(labels[bestFeat])
    #得到对应最好特征的值
    featValues = [example[bestFeat] for example in dateSet]
    uniqueVals = set(featValues)
    #向字典myTree插入相应的分类结果，并遍历
    for value in uniqueVals:
        subLabel = labels[:]
        myTree[bestFeatLabel][value] = creatTree(splitDataSet(dateSet,bestFeat,value),subLabel)
    return myTree

4.测试
新建文件test.py

import DecisionTree
myDat, labels = DecisionTree.creatDataSet()
myTree = DecisionTree.creatTree(myDat,labels)
print(myTree)

保存运行，得到结果

 {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

暮雨橙海

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
我的机器学习之路-决策树

决策树（DTS）是一种用于分类和回归的非参数监督学习方法。目标是创建一个模型，预测目标变量的值，通过学习简单的决策规则推断的数据特征。
复制链接

扫一扫

专栏目录

我的机器学习之路-决策树

“相关推荐”对你有帮助么？