【机器学习第4章——决策树】

4.决策树

  • 决策树基于“树”结构进行决策
    • 每个“内部结点”对应于某个属性上的“测试”(test)
    • 每个分支对应于该测试的一种可能结果(即该属性的某个取值)
    • 每个“叶结点”对应于一个“预测结果”
  • 学习过程:通过对训练样本的分析来确定“划分属性”(即内部结点所对应的属性)
  • 预测过程:将测试示例从根结点开始,沿着划分属性所构成的“判定测试序列”下行,直到叶结点

在这里插入图片描述

4.1 基本流程

策略:“分而治之”

自根至叶的递归过程

在每个中间结点寻找一个“划分”属性

  • 三种停止条件
    • (1)当前结点包含的样本全属于同一类别,无需划分
    • (2)当前属性集为空,或是所有样本在所有属性上取值相同,无法划分
    • (3)当前结点包含的样本集合为空,不能划分

在这里插入图片描述

4.2划分选择

4.2.1信息增益

”信息嫡“是度量样本集合“纯度”最常用的一种指标假定当前样本集合D中第k类样本所占的比例为
p k p_k pk
则D的信息嫡定义为
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k l o g 2 p k 若 p = 0 ,则 p l o g 2 p = 0 Ent(D)=-\sum_{k=1}^{|y|}p_klog_2{p_k}\\ 若p=0,则plog_2p=0 Ent(D)=k=1ypklog2pkp=0,则plog2p=0

E n t ( D ) 的值越小,则 D 的纯度越高 E n t ( D ) 的最小值为 0 ,最大值为 l o g 2 ∣ y ∣ Ent(D)的值越小,则D的纯度越高\\ Ent(D)的最小值为0,最大值为log_2|y| Ent(D)的值越小,则D的纯度越高Ent(D)的最小值为0,最大值为log2y

信息增益直接以信息熵为基础,计算当前划分对信息熵所造成的变化

离散属性a的取值:
{ a 1 , a 2 , . . . , a V } \{a^1, a^2,...,a^V\} {a1,a2,...,aV}

D v : D 中在 a 上取值 = a v 的样本集合 D^v:D中在a上取值= a^v的样本集合 Dv:D中在a上取值=av的样本集合

以属性a对数据集D进行划分所获得的信息增益为

在这里插入图片描述

下表包含17个训练样例,|y|=2(是好瓜、不是好瓜)
正例占 p 1 = 8 17 反例占 p 2 = 9 17 正例占p_1=\frac{8}{17}\\ 反例占p_2=\frac{9}{17}\\ 正例占p1=178反例占p2=179
在这里插入图片描述

根节点的信息熵为
E n t ( D ) = − ∑ k = 1 2 p k l o g 2 p k = − ( 8 17 l o g 2 8 17 + 9 17 l o g 2 9 17 ) = 0.998 Ent(D)=-\sum_{k=1}^{2}p_klog_2{p_k}\\ =-(\frac{8}{17}log_2\frac{8}{17}+\frac{9}{17}log_2\frac{9}{17})\\ =0.998 Ent(D)=k=12pklog2pk=(178log2178+179log2179)=0.998
以属性“色泽”为例,其对应的3个子集分别为
D 1 ( 色泽 = 青绿 ) : 正例 3 6 , 反例 3 6 D 2 ( 色泽 = 乌黑 ) : 正例 4 6 , 反例 2 6 D 3 ( 色泽 = 浅白 ) : 正例 1 6 , 反例 4 5 D^1(色泽=青绿):正例\frac{3}{6},反例\frac{3}{6}\\ D^2(色泽=乌黑):正例\frac{4}{6},反例\frac{2}{6}\\ D^3(色泽=浅白):正例\frac{1}{6},反例\frac{4}{5}\\ D1(色泽=青绿):正例63,反例63D2(色泽=乌黑):正例64,反例62D3(色泽=浅白):正例61,反例54

E n t ( D 1 ) = − ( 3 6 l o g 2 3 6 + 3 6 l o g 2 3 6 ) = 1 E n t ( D 2 ) = − ( 4 6 l o g 2 4 6 + 2 6 l o g 2 2 6 ) = 0.918 E n t ( D 3 ) = − ( 1 5 l o g 2 1 5 + 4 5 l o g 2 4 5 ) = 0.722 Ent(D^1)=-(\frac{3}{6}log_2\frac{3}{6}+\frac{3}{6}log_2\frac{3}{6})=1\\ Ent(D^2)=-(\frac{4}{6}log_2\frac{4}{6}+\frac{2}{6}log_2\frac{2}{6})=0.918\\ Ent(D^3)=-(\frac{1}{5}log_2\frac{1}{5}+\frac{4}{5}log_2\frac{4}{5})=0.722 Ent(D1)=(63log263+63log263)=1Ent(D2)=(64log264+62log262)=0.918Ent(D3)=(51log251+54log254)=0.722

于是,属性“色泽”的信息增益为
G a i n ( D , 色泽 ) = E n t ( D ) − ∑ v = 1 3 ∣ D v ∣ D E n t ( D v ) = 0.988 − ( 6 17 × 1 + 6 17 × 0.918 + 5 17 × 0.722 ) = 0.109 Gain(D,色泽)=Ent(D)-\sum_{v=1}^3\frac{|D^v|}{D}Ent(D^v)\\ =0.988-(\frac{6}{17}\times 1+\frac{6}{17}\times 0.918+\frac{5}{17}\times 0.722)=0.109 Gain(D,色泽)=Ent(D)v=13DDvEnt(Dv)=0.988(176×1+176×0.918+175×0.722)=0.109
类似的,其他属性的信息增益为
G a i n ( D , 根蒂 ) = 0.143 G a i n ( D , 敲声 ) = 0.141 G a i n ( D , 纹理 ) = 0.381 G a i n ( D , 脐部 ) = 0.289 G a i n ( D , 触感 ) = 0.006 Gain(D,根蒂)=0.143\quad Gain(D,敲声)=0.141\\ Gain(D,纹理)=0.381\quad Gain(D,脐部)=0.289\\ Gain(D,触感)=0.006\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad Gain(D,根蒂)=0.143Gain(D,敲声)=0.141Gain(D,纹理)=0.381Gain(D,脐部)=0.289Gain(D,触感)=0.006
属性“纹理”的信息增益最大,被选为划分属性

在这里插入图片描述

对每个分支结点做进一步划分,最终得到决策树

在这里插入图片描述

信息增益:对可取值数目较多的属性有所偏好

  • 增益率
    G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) I V ( a ) = − ∑ v = 1 V ∣ D v ∣ D l o g 2 ∣ D v ∣ D Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)}\\ IV(a)=-\sum_{v=1}^V\frac{|D^v|}{D}log_2\frac{|D^v|}{D} Gain_ratio(D,a)=IV(a)Gain(D,a)IV(a)=v=1VDDvlog2DDv
    属性a的可能取值数目越多(即V越大),则IV(a)的值通常就越大

  • 启发式:先从候选划分属性中找出信息增益高于平均水平的,再从中选取增益率最高的

4.2.3 基尼指数

反映了从D中随机抽取两个样例,其类别标记不一致的概率
G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k , ≠ k p k p k , = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini(D)=\sum_{k=1}^{|y|}\sum_{k^,\neq k}p_kp_{k^,}\\ =1-\sum_{k=1}^{|y|}p_k^2 Gini(D)=k=1yk,=kpkpk,=1k=1ypk2

G i n i ( D ) 越小 , 数据集 D 的纯度越高 Gini(D)越小,数据集D的纯度越高 Gini(D)越小,数据集D的纯度越高

属性a的基尼指数
G i n i _ i n d e x ( D , a ) = ∑ v = 1 V D v D G i n i ( D v ) Gini\_index(D,a)=\sum_{v=1}^V\frac{D^v}{D}Gini(D^v) Gini_index(D,a)=v=1VDDvGini(Dv)
在候选属性集合中,选取那个使划分基尼指数最小的属性

4.3 剪枝处理

研究表明:划分选择的各种准则虽然对决策树的尺寸有较大影响,但对泛化性能的影响很有限

剪枝方法和程度对决策树泛化性能的影响更为显著
剪枝是决策树对付过拟合的主要手段 \color{red}{剪枝是决策树对付过拟合的主要手段} 剪枝是决策树对付过拟合的主要手段
为了尽可能正确分类训练样本,有可能造成分支过多→过拟合
可通过主动去掉一些分支来降低过拟合的风险 可通过主动去掉一些分支来降低过拟合的风险 可通过主动去掉一些分支来降低过拟合的风险

  • 基本策略:
    • 预剪枝(pre-pruning):提前终止某些分支的生长
    • 后剪枝(post-pruning):生成一棵完全树,再“回头”剪枝
  • 剪枝过程中需评估剪枝前后决策树的优劣

4.4 连续与缺失值

4.4.1 连续值处理

  • 连续属性的可取值数目不再有限,这时候就需要采用二分法对连续属性进行处理
    在这里插入图片描述

    • 假定样本集D中,属性a有n个不同的取值,将这些值从小到大排序
      { 0.243 , 0.245 , 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } \{0.243,0.245,0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\} {0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

    • 构建一个候选划分点集合
      T a = { a i + a i + 1 2 ∣ 1 ≤ i ≤ n − 1 } T_a=\{\frac{a^i+a^{i+1}}{2}|1\leq i \leq n-1\} Ta={2ai+ai+1∣1in1}

      { 0.244 , 0.294 , 0.351 , 0.381 , 0.420 , 0.459 , 0.518 , 0.574 , 0.600 , 0.621 , 0.636 , 0.648 , 0.661 , 0.681 , 0.708 , 0.746 } \{0.244,0.294,0.351,0.381,0.420,\\ 0.459,0.518,0.574,0.600,0.621,\\ 0.636,0.648,0.661,0.681,0.708,0.746\} {0.244,0.294,0.351,0.381,0.420,0.459,0.518,0.574,0.600,0.621,0.636,0.648,0.661,0.681,0.708,0.746}

    • 计算信息增益
      E n t ( D ) = − ∑ k = 1 2 p k l o g 2 p k = − ( 8 17 l o g 2 8 17 + 9 17 l o g 2 9 17 ) = 0.998 Ent(D)=-\sum_{k=1}^{2}p_klog_2{p_k}\\ =-(\frac{8}{17}log_2\frac{8}{17}+\frac{9}{17}log_2\frac{9}{17})\\ =0.998 Ent(D)=k=12pklog2pk=(178log2178+179log2179)=0.998

    • 逐个选取候选划分点集合中的元素当作划分点t,计算
      D t − , D t + D^-_t,D^+_t Dt,Dt+

      G a i n ( D , a ) = m a x t ∈ T a G a i n ( D , a , t ) = m a x t ∈ T a E n t ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ D E n t ( D t λ ) Gain(D,a)=max_{t\in T_a}Gain(D,a,t)\\ =max_{t\in T_a}Ent(D)-\sum_{\lambda\in\{-,+\}}\frac{|D_t^\lambda|}{D}Ent(D_t^\lambda) Gain(D,a)=maxtTaGain(D,a,t)=maxtTaEnt(D)λ{,+}DDtλEnt(Dtλ)

      • t=0.244
        D t − = { 0.243 } , D t + = { 0.245 , 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243\},D^+_t=\{0.245,0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt={0.243},Dt+={0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

        E n t ( D t − ) = − ( 0 × l o g 2 0 + 1 × l o g 2 1 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + 1\times log_21)=0\\ Ent(Dt)=(0×log20+1×log21)=0

        E n t ( D t + ) = − ( 8 16 l o g 2 8 16 + 8 16 l o g 2 8 16 ) = 1 Ent(D^+_t)=-(\frac{8}{16}log_2\frac{8}{16}+\frac{8}{16}log_2\frac{8}{16})=1 Ent(Dt+)=(168log2168+168log2168)=1

        G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.244 ) = 0.998 − ( 1 17 × 0 + 16 17 × 1 ) = 0.057 Gain(D,a,t)=Gain(D,密度,0.244)=0.998-(\frac{1}{17}\times 0+\frac{16}{17}\times 1)=0.057 Gain(D,a,t)=Gain(D,密度,0.244)=0.998(171×0+1716×1)=0.057

      • t=0.294
        D t − = { 0.243 , 0.245 } , D t + = { 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245\},D^+_t=\{0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt={0.243,0.245},Dt+={0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

        E n t ( D t − ) = − ( 0 × l o g 2 0 + 2 2 × l o g 2 2 2 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{2}{2}\times log_2\frac{2}{2})=0\\ Ent(Dt)=(0×log20+22×log222)=0

        E n t ( D t + ) = − ( 8 15 l o g 2 8 15 + 7 15 l o g 2 7 15 ) = 0.997 Ent(D^+_t)=-(\frac{8}{15}log_2\frac{8}{15}+\frac{7}{15}log_2\frac{7}{15})=0.997 Ent(Dt+)=(158log2158+157log2157)=0.997

        G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.294 ) = 0.998 − ( 2 17 × 0 + 15 17 × 0.997 ) = 0.118 Gain(D,a,t)=Gain(D,密度,0.294)=0.998-(\frac{2}{17}\times 0+\frac{15}{17}\times 0.997)=0.118 Gain(D,a,t)=Gain(D,密度,0.294)=0.998(172×0+1715×0.997)=0.118

      • t=0.351
        D t − = { 0.243 , 0.245 , 0.343 } , D t + = { 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343\},D^+_t=\{0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt={0.243,0.245,0.343},Dt+={0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

        E n t ( D t − ) = − ( 0 × l o g 2 0 + 3 3 × l o g 2 3 3 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{3}{3}\times log_2\frac{3}{3})=0\\ Ent(Dt)=(0×log20+33×log233)=0

        E n t ( D t + ) = − ( 8 14 l o g 2 8 14 + 6 14 l o g 2 6 14 ) = 0.985 Ent(D^+_t)=-(\frac{8}{14}log_2\frac{8}{14}+\frac{6}{14}log_2\frac{6}{14})=0.985 Ent(Dt+)=(148log2148+146log2146)=0.985

        G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.351 ) = 0.998 − ( 3 17 × 0 + 14 17 × 0.985 ) = 0.187 Gain(D,a,t)=Gain(D,密度,0.351)=0.998-(\frac{3}{17}\times 0+\frac{14}{17}\times 0.985)=0.187 Gain(D,a,t)=Gain(D,密度,0.351)=0.998(173×0+1714×0.985)=0.187

      • t=0.381
        D t − = { 0.243 , 0.245 , 0.343 , 0.360 } , D t + = { 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343,0.360\},D^+_t=\{0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt={0.243,0.245,0.343,0.360},Dt+={0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

        E n t ( D t − ) = − ( 0 × l o g 2 0 + 4 4 × l o g 2 4 4 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{4}{4}\times log_2\frac{4}{4})=0\\ Ent(Dt)=(0×log20+44×log244)=0

        E n t ( D t + ) = − ( 8 13 l o g 2 8 13 + 5 13 l o g 2 5 13 ) = 0.961 Ent(D^+_t)=-(\frac{8}{13}log_2\frac{8}{13}+\frac{5}{13}log_2\frac{5}{13})=0.961 Ent(Dt+)=(138log2138+135log2135)=0.961

        G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.381 ) = 0.998 − ( 4 17 × 0 + 13 17 × 0.961 ) = 0.262 Gain(D,a,t)=Gain(D,密度,0.381)=0.998-(\frac{4}{17}\times 0+\frac{13}{17}\times 0.961)=0.262 Gain(D,a,t)=Gain(D,密度,0.381)=0.998(174×0+1713×0.961)=0.262

      • t=0.420
        D t − = { 0.243 , 0.245 , 0.343 , 0.360 , 0.403 } , D t + = { 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343,0.360,0.403\},D^+_t=\{0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\} Dt={0.243,0.245,0.343,0.360,0.403},Dt+={0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}

        E n t ( D t − ) = − ( 1 5 × l o g 2 1 5 + 4 5 × l o g 2 4 5 ) = 0.722 Ent(D^-_t)=-(\frac{1}{5}\times log_2\frac{1}{5} + \frac{4}{5}\times log_2\frac{4}{5})=0.722\\ Ent(Dt)=(51×log251+54×log254)=0.722

        E n t ( D t + ) = − ( 7 12 l o g 2 7 12 + 5 12 l o g 2 5 12 ) = 0.980 Ent(D^+_t)=-(\frac{7}{12}log_2\frac{7}{12}+\frac{5}{12}log_2\frac{5}{12})=0.980 Ent(Dt+)=(127log2127+125log2125)=0.980

        G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.420 ) = 0.998 − ( 5 17 × 0.722 + 12 17 × 0.980 ) = 0.094 Gain(D,a,t)=Gain(D,密度,0.420)=0.998-(\frac{5}{17}\times 0.722+\frac{12}{17}\times 0.980)=0.094 Gain(D,a,t)=Gain(D,密度,0.420)=0.998(175×0.722+1712×0.980)=0.094

      • 按理可以算出其他t值的情况,比较发现,当t=0.381时,信息增益最大,为0.262

    • 对于“含糖率”的计算与上述类似,可得当t=0.126时,信息增益最大,为0.349

    • 计算可得其他信息增益
      G a i n ( D , 色泽 ) = 0.109 G a i n ( D , 根蒂 ) = 0.143 G a i n ( D , 敲声 ) = 0.141 G a i n ( D , 纹理 ) = 0.381 G a i n ( D , 脐部 ) = 0.289 G a i n ( D , 触感 ) = 0.006 Gain(D,色泽)=0.109\quad Gain(D,根蒂)=0.143\\ Gain(D,敲声)=0.141\quad Gain(D,纹理)=0.381\\ Gain(D,脐部)=0.289\quad Gain(D,触感)=0.006 Gain(D,色泽)=0.109Gain(D,根蒂)=0.143Gain(D,敲声)=0.141Gain(D,纹理)=0.381Gain(D,脐部)=0.289Gain(D,触感)=0.006

    • “纹理”的信息增益最大,选做根节点
      在这里插入图片描述

4.4.2缺失值处理

现实应用中,经常会遇到属性值“缺失”现象
仅使用无缺失的样例 ? → 对数据的极大浪费 \color{red}{仅使用无缺失的样例?→对数据的极大浪费} 仅使用无缺失的样例?对数据的极大浪费

  • 使用带缺失值的样例,需解决
    • 如何进行划分属性选择?
    • 给定划分属性,若样本在该属性上的值缺失,如何进行划分?

基本思路 : 样本赋权,权重划分 仅通过无缺失值的样例来判断划分属性的优劣 \color{red}{基本思路:样本赋权,权重划分}\\ 仅通过无缺失值的样例来判断划分属性的优劣 基本思路:样本赋权,权重划分仅通过无缺失值的样例来判断划分属性的优劣

  • 以下表为例,学习开始时,根节点包含样例集D中全部17个样例,权重均为1

    • 以属性“色泽”为例,该属性熵无缺失值的样例子集包含14个样例,则信息熵为
      E n t ( D ~ ) = − ∑ k = 1 2 p ˉ k l o g 2 p ˉ k = − ( 6 14 l o g 2 6 14 + 8 14 l o g 2 8 14 ) = 0.985 Ent(\tilde D)=-\sum_{k=1}^2\bar p_klog_2\bar p_k=-(\frac{6}{14}log_2\frac{6}{14}+\frac{8}{14}log_2\frac{8}{14})=0.985 Ent(D~)=k=12pˉklog2pˉk=(146log2146+148log2148)=0.985

      D ~ 1 , D ~ 2 , D ~ 3 \tilde D^1,\tilde D^2,\tilde D^3 D~1,D~2,D~3
      分别表示在属性“色泽”上取值为“青绿”、“乌黑”、“浅白”的样本子集,有
      E n t ( D ~ 1 ) = − ( 2 4 l o g 2 2 4 + 2 4 l o g 2 2 4 ) = 1 E n t ( D ~ 2 ) = − ( 4 6 l o g 2 4 6 + 2 6 l o g 2 2 6 ) = 0.918 E n t ( D ~ 3 ) = − ( 0 4 l o g 2 4 4 + 4 4 l o g 2 2 4 ) = 0 Ent(\tilde D^1)=-(\frac{2}{4}log_2\frac{2}{4}+\frac{2}{4}log_2\frac{2}{4})=1\\ Ent(\tilde D^2)=-(\frac{4}{6}log_2\frac{4}{6}+\frac{2}{6}log_2\frac{2}{6})=0.918\\ Ent(\tilde D^3)=-(\frac{0}{4}log_2\frac{4}{4}+\frac{4}{4}log_2\frac{2}{4})=0\\ Ent(D~1)=(42log242+42log242)=1Ent(D~2)=(64log264+62log262)=0.918Ent(D~3)=(40log244+44log242)=0
      因此,样本子集上属性“色泽”的信息增益为
      G a i n ( D , 色泽 ) = E n t ( D ~ ) − ∑ v = 1 3 r ~ v E n t ( D ~ v ) r ~ v : 无缺失值样例中属性 a 取值为 v 的占比 = 0.985 − ( 4 14 × 1 + 6 14 × 0.918 + 4 14 × 0 ) = 0.306 Gain(D,色泽)=Ent(\tilde D)-\sum_{v=1}^3\tilde r_vEnt(\tilde D^v)\\ \color{blue}{\tilde r_v:无缺失值样例中属性a取值为v的占比}\\ =0.985-(\frac{4}{14}\times 1+\frac{6}{14}\times 0.918+\frac{4}{14}\times 0)\\ =0.306 Gain(D,色泽)=Ent(D~)v=13r~vEnt(D~v)r~v:无缺失值样例中属性a取值为v的占比=0.985(144×1+146×0.918+144×0)=0.306
      于是,样本集D上属性“色泽”的信息增益为
      G a i n ( D , 色泽 ) = ρ × G a i n ( D ~ , 色泽 ) = 14 17 × 0.306 = 0.252 ρ : 无缺失值样例占比 Gain(D,色泽)=\rho \times Gain(\tilde D,色泽)=\frac{14}{17}\times 0.306=0.252\\ \color{blue}{\rho:无缺失值样例占比}\\ Gain(D,色泽)=ρ×Gain(D~,色泽)=1714×0.306=0.252ρ:无缺失值样例占比
      类似地可计算出所有属性在数据集上的信息增益
      G a i n ( D , 根蒂 ) = 0.171 G a i n ( D , 敲声 ) = 0.145 G a i n ( D , 纹理 ) = 0.424 G a i n ( D , 脐部 ) = 0.289 G a i n ( D , 触感 ) = 0.006 Gain(D,根蒂)=0.171\quad Gain(D,敲声)=0.145\\ Gain(D,纹理)=0.424\quad Gain(D,脐部)=0.289\\ Gain(D,触感)=0.006\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad Gain(D,根蒂)=0.171Gain(D,敲声)=0.145Gain(D,纹理)=0.424Gain(D,脐部)=0.289Gain(D,触感)=0.006

“纹理”在所有属性中取得了最大的信息增益
清晰占 7 15 , 稍糊占 5 15 , 模糊占 3 15 清晰占\frac{7}{15},稍糊占\frac{5}{15},模糊占\frac{3}{15} 清晰占157,稍糊占155,模糊占153

  • 权重划分
    • 样本8和10为缺失值,将它以上面的权重划入三个分支中去

在这里插入图片描述

4.5 多变量决策树

决策树所形成的分类边界有一个明显的特点:轴平行,即它的分类边界由若干个与坐标轴平行的分段组成。

可以看出此时分类边界并不简单,分成了好几段。若能使用斜的划分边界,则决策树模型将大为简化

**“多变量决策树”**就是能实现这样的“斜划分”甚至更复杂划分的决策树。

  • 5
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值