机器学习(四)——决策树

4.1 决策树概述

决策树是一种非参数监督学习方法,通过对训练集数据学习,挖掘一定规则用于对新的数据集进行预测,通俗来说,是if-then决策集合。目的是使样本尽可能属于同一类别,分类更准确,通过递归选择最优特征对数据集进行分割,使每个子集都有一个最优分类过程。通过特征选择,选择最佳特征,将数据集分割成正确分类的子集。
常用的特征选择及对应算法
信息增益——ID3算法
信息增益率——C4.5算法
基尼系数——CART算法
三个算法比较一览

模型连续值缺失值
ID3分类不支持不支持
C4.5分类支持支持
CART分类回归支持支持

在这里插入图片描述

4.1.1 ID3算法

基于信息增益为度量指标的分类算法,用到了熵理论,熵越小信息越纯,效果越好,选取熵值小(信息增益大)作为分类节点

一般步骤如下:

假设数据集D,|D|表示样本总个数,数据集有K个分类,记为Ck,特征A有j个不同取值{a1,……,aj}
由A可以把D分为j个子集
Dik为子集D再按k划分而得的子集

已知
p k = ∣ C k ∣ ∣ D ∣ p_{k}=\frac{\left |C_{k} \right | }{\left |D \right | } pk=DCk

p i = ∣ D i ∣ ∣ D ∣ p_{i}=\frac{\left |D_{i} \right | }{\left |D \right | } pi=DDi

①因此总信息熵
E n t r o p y ( D ) = − ∑ k = 1 K p k l o g 2 ( p k ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ l o g 2 ( ∣ C k ∣ ∣ D ∣ ) Entropy(D) = - \sum_{k=1}^{K}p_{k}log_{2} (p_{k})=- \sum_{k=1}^{K}\frac{\left |C_{k} \right | }{\left |D \right | } log_{2} (\frac{\left |C_{k} \right | }{\left |D \right | }) Entropy(D)=k=1Kpklog2(pk)=k=1KDCklog2(DCk)

②特征条件下经验条件熵
E n t r o p y ( D ∣ A ) = − ∑ i = 1 j p i E ( D i ) = − ∑ i = 1 j ∣ D i ∣ ∣ D ∣ ∑ k = 1 K ∣ D i k ∣ ∣ D i ∣ l o g 2 ( ∣ D i k ∣ ∣ D i ∣ ) Entropy(D|A) = - \sum_{i=1}^{j}p_{i}E(D_{i})=- \sum_{i=1}^{j}\frac{\left |D_{i} \right | }{\left |D \right | } \sum_{k=1}^{K}\frac{\left |D_{ik} \right | }{\left |D_{i} \right | }log_{2} (\frac{\left |D_{ik} \right | }{\left |Di \right | }) Entropy(DA)=i=1jpiE(Di)=i=1jDDik=1KDiDiklog2(DiDik)

③特征的信息增益
G a i n ( D , A ) = E n t r o p y ( D ) − E n t r o p y ( D ∣ A ) Gain(D,A)=Entropy(D)-Entropy(D|A) Gain(D,A)=Entropy(D)Entropy(DA)

④进行①-③计算每个点的信息增益选择值最大的进行扩展

⑤重复①-④直到叶子节点唯一即建立决策树

4.1.2 C4.5算法

基于信息增益率作为指标,在ID3基础上能处理连续型数据,也能处理有缺失情况的数据集。
在ID3的①-③步中额外新增两步:
S p l i t ( D ) = − ∑ k = 1 K p k l o g 2 ( p k ) = − ∑ k = 1 K ∣ D i ∣ ∣ D ∣ l o g 2 ( ∣ D i ∣ ∣ D ∣ ) Split(D)=-\sum_{k=1}^{K} p_{k} log_{2}(p_{k}) =-\sum_{k=1}^{K} \frac{\left |D_{i} \right | }{\left |D \right | } log_{2}(\frac{\left |D_{i} \right | }{\left |D \right | } ) SplitD=k=1Kpklog2(pk)=k=1KDDilog2(DDi)

G a i n R a t e ( A ) = G a i n ( D , A ) S p l i t ( D ) GainRate(A)=\frac{Gain(D,A)}{Split(D)} GainRateA=SplitDGainD,A

选择最大信息增益率作为分裂节点

4.1.3 CART算法

基于Gini系数分类回归算法选择Gini系数小的作为分裂节点
①计算总Gini系数
G i n i ( D ) = 1 − ∑ k = 1 K ( ∣ C k ∣ ∣ D ∣ ) 2 Gini(D)=1-\sum_{k=1}^{K}(\frac{\left |C_{k} \right | }{\left |D \right | } )^{2} GiniD=1k=1K(DCk)2

②计算每个特征变量的Gini系数(可能因此分为D1、D2两部分)
G i n i ( D , A ) = ∣ D 1 ∣ ∣ D ∣ G i n i ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ G i n i ( D 2 ) Gini(D,A)=\frac{\left |D_{1} \right | }{\left |D \right | } Gini(D_{1})+\frac{\left |D_{2} \right | }{\left |D \right | } Gini(D_{2}) GiniD,A=DD1Gini(D1)+DD2Gini(D2)

③选择Gini系数最小的作为分裂节点

连续型
将连续值离散化,将值按序划分,m个值就有m-1种划分方式,分为D1.D2.计算每种划分下的Gini系数,选最小的作为最终结果

离散型(文本):
一个值为D1,另外一个值为D2,计算Gini系数,选最小的结果

4.2 实现

4.2.1 数据集介绍

数据集采用周志华《机器学习》课后习题4.3的西瓜数据集
在这里插入图片描述

4.2.2 代码

4.2.2.1 决策树构建

计算特征熵增益,以选择最佳特征进行决策树构建
Entropy :计算香农熵。
Entropy_Gain :计算特征熵增益。这个函数首先根据特征将个案分为不同的类别,然后计算每个类别的熵,最后计算特征熵增益。
select_best_feature :选择最佳特征。这个函数遍历所有特征,计算每个特征的熵增益,并找到最大熵增益所对应的特征。

import math
data = [[0, 0, 0, 0, 0, 0, 0.697, 1],
        [1, 0, 1, 0, 0, 0, 0.774, 1],
        [1, 0, 0, 0, 0, 0, 0.634, 1],
        [0, 0, 1, 0, 0, 0, 0.608, 1],
        [2, 0, 0, 0, 0, 0, 0.556, 1],
        [0, 1, 0, 0, 1, 1, 0.403, 1],
        [1, 1, 0, 1, 1, 1, 0.481, 1],
        [1, 1, 0, 0, 1, 0, 0.437, 1],
        [1, 1, 1, 1, 1, 0, 0.666, 0],
        [0, 2, 2, 0, 2, 1, 0.243, 0],
        [2, 2, 2, 2, 2, 0, 0.245, 0],
        [2, 0, 0, 2, 2, 1, 0.343, 0],
        [0, 1, 0, 1, 0, 0, 0.639, 0],
        [2, 1, 1, 1, 0, 0, 0.657, 0],
        [1, 1, 0, 0, 1, 1, 0.360, 0],
        [2, 0, 0, 2, 2, 0, 0.593, 0],
        [0, 0, 1, 1, 1, 0, 0.719, 0]]

divide_point = [0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518, 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708,
                0.746]

def Entropy(melons):
    melons_num = len(melons)
    pos_num = 0
    nag_num = 0
    for i in range(melons_num):
        if melons[i][7] == 1:
            pos_num = pos_num + 1
    nag_num = melons_num - pos_num
    p_pos = pos_num / melons_num
    p_nag = nag_num / melons_num
    entropy = -(p_pos * math.log(p_pos, 2) + p_nag * math.log(p_nag, 2))
    return entropy

def Entropy_Gain(melons, charac):
    charac_entropy = 0
    entropy_gain = 0
    melons_num = len(melons)

    if charac >= 6:
        density_entropy = list()
        density0 = list()
        density1 = list()
        class0_small_num = 0 
        class0_big_num = 0
        class1_small_num = 0
        class1_big_num = 0

        for i in range(melons_num):
            if melons[i][7] == 1:
                if melons[i][6] > divide_point[charac - 6]:
                    class1_big_num = class1_big_num + 1
                else:
                    class1_small_num = class1_small_num + 1
            else:
                if melons[i][6] > divide_point[charac - 6]:
                    class0_big_num = class0_big_num + 1
                else:
                    class0_small_num = class0_small_num + 1

        if class0_small_num == 0 and class1_small_num == 0:
            p0_small = 0
            p1_small = 0
        else:
            p0_small = class0_small_num / (class0_small_num + class1_small_num)
            p1_small = class1_small_num / (class0_small_num + class1_small_num)
        if class0_big_num == 0 and class1_big_num == 0:
            p0_big = 0
            p1_big = 0
        else:
            p0_big = class0_big_num / (class0_big_num + class1_big_num)
            p1_big = class1_big_num / (class0_big_num + class1_big_num)

        if p0_small != 0 and p1_small != 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -(p0_small * math.log(p0_small, 2)
                    + p1_small * math.log(p1_small, 2)))
        elif p0_small == 0 and p1_small != 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -p1_small * math.log(p1_small, 2))
        elif p0_small != 0 and p1_small == 0:
            entropy_small = -(class0_small_num + class1_small_num) / melons_num * (
                -p0_small * math.log(p0_small, 2))
        else:
            entropy_small = 0

        if p0_big != 0 and p1_big != 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -(p0_big * math.log(p0_big, 2) + p1_big *
                    math.log(p1_big, 2)))
        elif p0_big == 0 and p1_big != 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -p1_big * math.log(p1_big, 2))
        elif p0_big != 0 and p1_big == 0:
            entropy_big = -(class0_big_num + class1_big_num) / melons_num * (
                -p0_big * math.log(p0_big, 2))
        else:
            entropy_big = 0
        entropy_gain = Entropy(melons) + entropy_small + entropy_big

    elif charac == 5:
        class0_melons = []
        class1_melons = []
        class_melons = [[], []]
        for i in range(melons_num):
            if melons[i][5] == 0:
                class0_melons.append(melons[i][7])
            else:
                class1_melons.append(melons[i][7])
        class_melons[0] = class0_melons
        class_melons[1] = class1_melons


        for i in range(2):
            class0_num = 0
            class1_num = 0
            total_num = len(class_melons[i])
            for j in range(total_num):
                if class_melons[i][j] == 0:
                    class0_num = class0_num + 1
                else:
                    class1_num = class1_num + 1
            p_class0 = class0_num / total_num
            p_class1 = class1_num / total_num
            if p_class0 != 0 and p_class1 != 0:         # 防止log0的报错
                entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
            elif p_class0 == 0 and p_class1 != 0:
                entropy_class = - p_class1 * math.log(p_class1, 2)
            else:
                entropy_class = -p_class0 * math.log(p_class0, 2)
            charac_entropy = charac_entropy - total_num / melons_num * entropy_class
            entropy_gain = Entropy(melons) + charac_entropy

    else:
        class0_melons = []
        class1_melons = []
        class2_melons = []
        class_melons = [[], [], []]
        for i in range(melons_num):
            if melons[i][charac] == 0:
                class0_melons.append(melons[i][7])
            elif melons[i][charac] == 1:
                class1_melons.append(melons[i][7])
            else:
                class2_melons.append(melons[i][7])
        class_melons[0] = class0_melons
        class_melons[1] = class1_melons
        class_melons[2] = class2_melons

        for i in range(3):
            class0_num = 0
            class1_num = 0
            total_num = len(class_melons[i])

            if total_num != 0:
                for j in range(total_num):
                    if class_melons[i][j] == 0:
                        class0_num = class0_num + 1
                    else:
                        class1_num = class1_num + 1
                p_class0 = class0_num / total_num
                p_class1 = class1_num / total_num
                if p_class0 != 0 and p_class1 != 0:             # 防止log0的报错
                    entropy_class = -p_class0 * math.log(p_class0, 2) - p_class1 * math.log(p_class1, 2)
                elif p_class0 == 0 and p_class1 != 0:
                    entropy_class = - p_class1 * math.log(p_class1, 2)
                else:
                    entropy_class = -p_class0 * math.log(p_class0, 2)
                charac_entropy = charac_entropy - total_num / melons_num * entropy_class
                entropy_gain = Entropy(melons) + charac_entropy
            else:
                entropy_gain = 0
    return [entropy_gain, charac]


def select_best_feature(melons, features):
    best_feature = 0
    max_entropy = Entropy_Gain(melons, features[0])
    for i in range(len(features)):
        entropy = Entropy_Gain(melons, features[i])
        if entropy[0] > max_entropy[0]:
            max_entropy = entropy
    return max_entropy

4.2.2.2 预测

通过调用tree_generate并传入训练数据data和特征集A,可以生成一个基于划分点的决策树find_mostselect_best_feature分别用于找到出现次数最多的类别选择最优特征,从而优化决策树的分割。

from Divide_Select import *
import numpy as np

# 训练集data,属性集A
# 0色泽,1根蒂,2敲声,3纹理,4脐部,5触感,
# 对于密度,每个划分点算作一个特征,共16个划分点,即6~21
A = list(range(22))

def find_most(x):

    return sorted([(np.sum(x == i), i) for i in np.unique(x)])[-1][-1]

def tree_generate(melons, features):
    melons_y = [i[7] for i in melons]
    if len(np.unique(melons_y)) == 1:
        return melons_y[0]
    same_flag = 1
    for i in range(6):        
        if len(np.unique([j[i] for j in melons])) > 1:
            same_flag = 0
    if not features or same_flag == 1:
        return find_most(melons_y)

    [max_entropy, best_feature] = select_best_feature(melons, features)
    node = {best_feature: {}}
    division = list()
    to_divide = list()

    if best_feature < 6:
        division = [i[best_feature] for i in data]        
        to_divide = [i[best_feature] for i in melons]      
    else:
        for j in [i[6] for i in melons]:
            if j > divide_point[best_feature - 6]:
                to_divide.append(1)
            else:
                to_divide.append(0)
        division = [0, 1]

    data_y = [i[7] for i in data]
    for i in np.unique(division):
        loc = list(np.where(to_divide == i))
        if len(loc[0]) == 0:    
            test = find_most(melons_y)
            node[best_feature][i] = find_most(melons_y)
        else:
            new_melons = []
            for k in range(len(loc[0])):
                new_melons.append(melons[loc[0][k]])
            if best_feature in features:            
                features.remove(best_feature)
            node[best_feature][i] = tree_generate(new_melons, features)
    return node
print(tree_generate(data, A))

4.2.2.3 绘图

调用createPlot传入决策树myTree进行绘图。这个决策树可以根据给定的特征划分点对西瓜好坏进行分类。

import matplotlib.pyplot as plt
from pylab import *

decisionNode = dict(boxstyle="square", pad=0.5,fc="0.8")
leafNode = dict(boxstyle="circle", fc="0.8")
arrow_args = dict(arrowstyle="<-")
mpl.rcParams['font.sans-serif'] = ['SimHei']

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        maxDepth = max(maxDepth, thisDepth)
    return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    cntrPt = (plotTree.xOff + (1 + numLeafs) / 2 / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    firstStr = list(myTree.keys())[0]
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]) is dict:
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1 / plotTree.totalD

4.2.3 结果

在这里插入图片描述

4.3 总结

4.3.1 优点

速度快
准确度高
可处理连续字段和种类字段
无需领域知识和参数假设
适合高维数据

4.3.2 缺点

容易过拟合
忽略相关性
各类别样本数量不一致
特征选择偏向于取值较多的特征

4.3.3 思考

4.3.3.1 集成学习算法:随机森林

集成学习起源于强学习和弱学习的等价原理,将多个分类器组合得到更好泛化能力的强学习模型、融合的方法有平均法和投票法

常见算法:
Bagging:训练多个分类器再使用投票法
Boosting:不断构建新模型对旧的错误分类进行修正
Stacking:分层框架,形成最终预测前,总是由一组向另一组学习器提供信息

随机森林是基于决策树随机子空间Bagging思想的算法,每棵决策树生成过程中样本和特征都是随机选取的

基本步骤如下:

随机选择样本(有放回)
随机选择特征
构建多棵决策树
随机森林投票

优点:

解决分类和回归的问题,泛化性能优秀
对高维的数据集处理好
可应对缺失的数据,不用归一化
存在分类不平衡时,提供平衡误差的方法
训练快,高并行,易于分布式实现

缺点:

解决回归问题时不能给连续型的输出,可能会导致过拟合
无法控制内部运行,只能在不同参数和随机种子之间尝试
忽略特征相关性
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值