机器学习（四）——决策树

jmu_hjc

已于 2023-11-03 02:41:34 修改

阅读量146

点赞数 1

分类专栏：机器学习文章标签：机器学习决策树人工智能

于 2023-11-03 02:33:29 首次发布

本文链接：https://blog.csdn.net/qq_61179907/article/details/133984010

版权

机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

机器学习（四）——决策树

4.1 决策树概述
4.2 实现
4.3 总结

4.1 决策树概述

决策树是一种非参数的监督学习方法，通过对训练集数据学习，挖掘一定规则用于对新的数据集进行预测，通俗来说，是if-then决策集合。目的是使样本尽可能属于同一类别，分类更准确，通过递归选择最优特征对数据集进行分割，使每个子集都有一个最优分类过程。通过特征选择，选择最佳特征，将数据集分割成正确分类的子集。
常用的特征选择及对应算法：
信息增益——ID3算法
信息增益率——C4.5算法
基尼系数——CART算法
三个算法比较一览：

	模型	连续值	缺失值
ID3	分类	不支持	不支持
C4.5	分类	支持	支持
CART	分类回归	支持	支持

在这里插入图片描述

4.1.1 ID3算法

基于信息增益为度量指标的分类算法，用到了熵理论，熵越小信息越纯，效果越好，选取熵值小（信息增益大）作为分类节点

一般步骤如下：

假设数据集D，|D|表示样本总个数，数据集有K个分类，记为Ck，特征A有j个不同取值{a1,……,aj}
由A可以把D分为j个子集
Dik为子集D再按k划分而得的子集

已知
$p_{k}=\frac{\left |C_{k} \right | }{\left |D \right | }$

$p_{i}=\frac{\left |D_{i} \right | }{\left |D \right | }$

①因此总信息熵为
$\sum_{k=1}^{K}p_{k}log_{2} (p_{k})=- \sum_{k=1}^{K}\frac{\left |C_{k} \right | }{\left |D \right | } log_{2} (\frac{\left |C_{k} \right | }{\left |D \right | })$

②特征条件下经验条件熵
$\sum_{i=1}^{j}p_{i}E(D_{i})=- \sum_{i=1}^{j}\frac{\left |D_{i} \right | }{\left |D \right | } \sum_{k=1}^{K}\frac{\left |D_{ik} \right | }{\left |D_{i} \right | }log_{2} (\frac{\left |D_{ik} \right | }{\left |Di \right | })$

③特征的信息增益
$G ain (D, A) = E n t ro p y (D) - E n t ro p y (D ∣ A)$

④进行①-③计算每个点的信息增益，选择值最大的进行扩展

⑤重复①-④直到叶子节点唯一即建立决策树

4.1.2 C4.5算法

基于信息增益率作为指标，在ID3基础上能处理连续型数据，也能处理有缺失情况的数据集。
在ID3的①-③步中额外新增两步:
$Split（D）=-\sum_{k=1}^{K} p_{k} log_{2}(p_{k}) =-\sum_{k=1}^{K} \frac{\left |D_{i} \right | }{\left |D \right | } log_{2}(\frac{\left |D_{i} \right | }{\left |D \right | } )$

$GainRate（A）=\frac{Gain（D,A）}{Split（D）}$

选择最大信息增益率作为分裂节点

4.1.3 CART算法

基于Gini系数的分类回归算法，选择Gini系数小的作为分裂节点
①计算总Gini系数
$Gini（D）=1-\sum_{k=1}^{K}(\frac{\left |C_{k} \right | }{\left |D \right | } )^{2}$

②计算每个特征变量的Gini系数（可能因此分为D1、D2两部分）
$Gini（D,A）=\frac{\left |D_{1} \right | }{\left |D \right | } Gini(D_{1})+\frac{\left |D_{2} \right | }{\left |D \right | } Gini(D_{2})$

③选择Gini系数最小的作为分裂节点

连续型：
将连续值离散化，将值按序划分，m个值就有m-1种划分方式，分为D1.D2.计算每种划分下的Gini系数，选最小的作为最终结果

离散型（文本）：
一个值为D1，另外一个值为D2，计算Gini系数，选最小的结果

4.2 实现

4.2.1 数据集介绍

数据集采用周志华《机器学习》课后习题4.3的西瓜数据集
在这里插入图片描述

4.2.2 代码

4.2.2.1 决策树构建

计算特征熵增益，以选择最佳特征进行决策树构建。
Entropy ：计算香农熵。
Entropy_Gain ：计算特征熵增益。这个函数首先根据特征将个案分为不同的类别，然后计算每个类别的熵，最后计算特征熵增益。
select_best_feature ：选择最佳特征。这个函数遍历所有特征，计算每个特征的熵增益，并找到最大熵增益所对应的特征。

import math
data = [[0, 0, 0, 0, 0, 0, 0.697, 1],
        [1, 0, 1, 0, 0, 0, 0.774, 1],
        [1, 0, 0, 0, 0, 0, 0.634, 1],
        [0, 0, 1, 0, 0, 0, 0.608, 1],
        [2, 0, 0, 0, 0, 0, 0.556, 1],
        [0, 1, 0, 0, 1, 1, 0.403, 1],
        [1, 1, 0, 1, 1, 1, 0.481, 1],
        [1, 1, 0, 0, 1, 0, 0.437, 1],
        [1, 1, 1, 1, 1, 0, 0.666, 0],
        [0, 2, 2, 0, 2, 1, 0.243, 0],
        [2, 2, 2, 2, 2, 0, 0.245, 0],
        [2, 0, 0, 2, 2, 1, 0.343, 0],
        [0, 1, 0, 1, 0, 0, 0.639, 0],
        [2, 1, 1, 1, 0, 0, 0.657, 0],
        [1, 1, 0, 0, 1, 1, 0.360, 0],
        [2, 0, 0, 2, 2, 0, 0.593, 0],
        [0, 0, 1, 1, 1, 0, 0.719, 0]]

divide_point = [0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518, 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708,
                0.746]

def Entropy(melons):
    melons_num = len(melons)
    pos_num = 0
    nag_num = 0
    for i in range(melons_num):
        if melons[i][7] == 1:
            pos_num = pos_num + 1
    nag_num = melons_num - pos_num
    p_pos = pos_num / melons_num
    p_nag = nag_num / melons_num
    entropy = -(p_pos * math.log(p_pos, 2) + p_nag * math.log(p_nag, 2))
    return entropy

def Entropy_Gain(melons, charac):
    charac_entropy = 0
    entropy_gain = 0
    melons_num = len(melons)

    if charac >= 6:
        density_entropy = list()
        density0 = list()
        density1 = list()
        class0_small_num = 0 
        class0_big_num = 0
        class1_small_num = 0
        class1_big_num = 0

        for i in range(melons_num):
            if melons[i][7] == 1:
                if melons[i][6] > divide_point[charac - 6]:
                    class1_big_num = class1_big_num + 1
                else:
                    class1_small_num = class1_small_num + 1
            else:
                if melons[i][6] > divide_point[charac - 6]:
                    class0_big_num = class0_big_num + 1
                else:
                    class0_small_num = class0_small_num + 1

        if class0_small_num == 0 and class1_small_num == 0:
            p0_small = 0
            p1_small = 0
        else:
            p0_small = class0_small_num / (class0_small_num + class1_small_num)
            p1_small = class1_small_num / (class0_small_num + class1_small_num)
        if class0_big_num == 0 and class1_big_num == 0:
            p0_big = 0
            p1_big = 0
        else:
            p0_big = class0_big_num / (class0_big_num + class1_big_num)
            p1_big = class1_big_num / (class0_big_num + class1_big_num)

        if p0_small != 0 and p1_small != 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -(p0_small * math.log(p0_small, 2)
                    + p1_small * math.log(p1_small, 2)))
        elif p0_small == 0 and p1_small != 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -p1_small * math.log(p1_small, 2))
        elif p0_small != 0 and p1_small == 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -p0_small * math.log(p0_small, 2))
        else:
            entropy_small = 0

        if p0_big != 0 and p1_big != 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -(p0_big * math.log(p0_big, 2) + p1_big *
                    math.log(p1_big, 2)))
        elif p0_big == 0 and p1_big != 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -p1_big * math.log(p1_big, 2))
        elif p0_big != 0 and p1_big == 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -p0_big * math.log(p0_big, 2))
        else:
            entropy_big = 0
        entropy_gain = Entropy(melons) + entropy_small + entropy_big

    elif charac == 5:
        class0_melons = []
        class1_melons = []
        class_melons = [[], []]
        for i in range(melons_num):
            if melons[i][5] == 0:
                class0_melons.append(melons[i][7])
            else:
                class1_melons.append(melons[i][7])
        class_melons[0] = class0_melons
        class_melons[1] = class1_melons


        for i in range(2):
            class0_num = 0
            class1_num = 0
            total_num = len(class_melons[i])
            for j in range(total_num):
                if class_melons[i][j] == 0:
                    class0_num = class0_num + 1
                else:
                    class1_num = class1_num + 1
            p_class0 = class0_num / total_num
            p_class1 = class1_num / total_num
            if p_class0 != 0 and p_class1 != 0:         # 防止log0的报错
                entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
            elif p_class0 == 0 and p_class1 != 0:
                entropy_class = - p_class1 * math.log(p_class1, 2)
            else:
                entropy_class = -p_class0 * math.log(p_class0, 2)
            charac_entropy = charac_entropy - total_num / melons_num * entropy_class
            entropy_gain = Entropy(melons) + charac_entropy

    else:
        class0_melons = []
        class1_melons = []
        class2_melons = []
        class_melons = [[], [], []]
        for i in range(melons_num):
            if melons[i][charac] == 0:
                class0_melons.append(melons[i][7])
            elif melons[i][charac] == 1:
                class1_melons.append(melons[i][7])
            else:
                class2_melons.append(melons[i][7])
        class_melons[0] = class0_melons
        class_melons[1] = class1_melons
        class_melons[2] = class2_melons

        for i in range(3):
            class0_num = 0
            class1_num = 0
            total_num = len(class_melons[i])

            if total_num != 0:
                for j in range(total_num):
                    if class_melons[i][j] == 0:
                        class0_num = class0_num + 1
                    else:
                        class1_num = class1_num + 1
                p_class0 = class0_num / total_num
                p_class1 = class1_num / total_num
                if p_class0 != 0 and p_class1 != 0:             # 防止log0的报错
                    entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
                elif p_class0 == 0 and p_class1 != 0:
                    entropy_class = - p_class1 * math.log(p_class1, 2)
                else:
                    entropy_class = -p_class0 * math.log(p_class0, 2)
                charac_entropy = charac_entropy - total_num / melons_num * entropy_class
                entropy_gain = Entropy(melons) + charac_entropy
            else:
                entropy_gain = 0
    return [entropy_gain, charac]

def select_best_feature(melons, features):
    best_feature = 0
    max_entropy = Entropy_Gain(melons, features[0])
    for i in range(len(features)):
        entropy = Entropy_Gain(melons, features[i])
        if entropy[0] > max_entropy[0]:
            max_entropy = entropy
    return max_entropy

4.2.2.2 预测

通过调用tree_generate并传入训练数据data和特征集A，可以生成一个基于划分点的决策树。find_most和select_best_feature分别用于找到出现次数最多的类别和选择最优特征，从而优化决策树的分割。

from Divide_Select import *
import numpy as np

# 训练集data，属性集A
# 0色泽，1根蒂，2敲声，3纹理，4脐部，5触感，
# 对于密度，每个划分点算作一个特征，共16个划分点，即6~21
A = list(range(22))

def find_most(x):

    return sorted([(np.sum(x == i), i) for i in np.unique(x)])[-1][-1]

def tree_generate(melons, features):
    melons_y = [i[7] for i in melons]
    if len(np.unique(melons_y)) == 1:
        return melons_y[0]
    same_flag = 1
    for i in range(6):        
        if len(np.unique([j[i] for j in melons])) > 1:
            same_flag = 0
    if not features or same_flag == 1:
        return find_most(melons_y)

    [max_entropy, best_feature] = select_best_feature(melons, features)
    node = {best_feature: {}}
    division = list()
    to_divide = list()

    if best_feature < 6:
        division = [i[best_feature] for i in data]        
        to_divide = [i[best_feature] for i in melons]      
    else:
        for j in [i[6] for i in melons]:
            if j > divide_point[best_feature - 6]:
                to_divide.append(1)
            else:
                to_divide.append(0)
        division = [0, 1]

    data_y = [i[7] for i in data]
    for i in np.unique(division):
        loc = list(np.where(to_divide == i))
        if len(loc[0]) == 0:    
            test = find_most(melons_y)
            node[best_feature][i] = find_most(melons_y)
        else:
            new_melons = []
            for k in range(len(loc[0])):
                new_melons.append(melons[loc[0][k]])
            if best_feature in features:            
                features.remove(best_feature)
            node[best_feature][i] = tree_generate(new_melons, features)
    return node
print(tree_generate(data, A))

4.2.2.3 绘图

调用createPlot传入决策树myTree进行绘图。这个决策树可以根据给定的特征和划分点对西瓜好坏进行分类。

import matplotlib.pyplot as plt
from pylab import *

decisionNode = dict(boxstyle="square", pad=0.5,fc="0.8")
leafNode = dict(boxstyle="circle", fc="0.8")
arrow_args = dict(arrowstyle="<-")
mpl.rcParams['font.sans-serif'] = ['SimHei']

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        maxDepth = max(maxDepth, thisDepth)
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    cntrPt = (plotTree.xOff + (1 + numLeafs) / 2 / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    firstStr = list(myTree.keys())[0]
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1 / plotTree.totalD

4.2.3 结果

在这里插入图片描述

4.3 总结

4.3.1 优点

速度快
准确度高
可处理连续字段和种类字段
无需领域知识和参数假设
适合高维数据

4.3.2 缺点

容易过拟合
忽略相关性
各类别样本数量不一致
特征选择偏向于取值较多的特征

4.3.3 思考

4.3.3.1 集成学习算法：随机森林

集成学习起源于强学习和弱学习的等价原理，将多个分类器组合得到更好泛化能力的强学习模型、融合的方法有平均法和投票法

常见算法：
Bagging：训练多个分类器再使用投票法
Boosting：不断构建新模型对旧的错误分类进行修正
Stacking：分层框架，形成最终预测前，总是由一组向另一组学习器提供信息

而随机森林是基于决策树、随机子空间和Bagging思想的算法，每棵决策树生成过程中样本和特征都是随机选取的

基本步骤如下：

随机选择样本（有放回）
随机选择特征
构建多棵决策树
随机森林投票

优点：

解决分类和回归的问题，泛化性能优秀
对高维的数据集处理好
可应对缺失的数据，不用归一化
存在分类不平衡时，提供平衡误差的方法
训练快，高并行，易于分布式实现

缺点：

解决回归问题时不能给连续型的输出，可能会导致过拟合
无法控制内部运行，只能在不同参数和随机种子之间尝试
忽略特征相关性

jmu_hjc

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习（四）——决策树

决策树是一种非参数的监督学习方法，通过对训练集数据学习，挖掘一定规则用于对新的数据集进行预测，通俗来说，是if-then决策集合。目的是使样本尽可能属于同一类别，分类更准确，通过递归选择最优特征对数据集进行分割，使每个子集都有一个最优分类过程。通过特征选择，选择最佳特征，将数据集分割成正确分类的子集。常用的特征选择及对应算法信息增益——ID3算法信息增益率——C4.5算法基尼系数——CART算法三个算法比较一览模型连续值缺失值ID3分类不支持不支持C4.5分类支持支持。
复制链接

扫一扫

专栏目录