机器学习第4章——决策树
4.决策树
- 决策树基于“树”结构进行决策
- 每个“内部结点”对应于某个属性上的“测试”(test)
- 每个分支对应于该测试的一种可能结果(即该属性的某个取值)
- 每个“叶结点”对应于一个“预测结果”
- 学习过程:通过对训练样本的分析来确定“划分属性”(即内部结点所对应的属性)
- 预测过程:将测试示例从根结点开始,沿着划分属性所构成的“判定测试序列”下行,直到叶结点
4.1 基本流程
策略:“分而治之”
自根至叶的递归过程
在每个中间结点寻找一个“划分”属性
- 三种停止条件
- (1)当前结点包含的样本全属于同一类别,无需划分
- (2)当前属性集为空,或是所有样本在所有属性上取值相同,无法划分
- (3)当前结点包含的样本集合为空,不能划分
4.2划分选择
4.2.1信息增益
”信息嫡“是度量样本集合“纯度”最常用的一种指标假定当前样本集合D中第k类样本所占的比例为
p
k
p_k
pk
则D的信息嫡定义为
E
n
t
(
D
)
=
−
∑
k
=
1
∣
y
∣
p
k
l
o
g
2
p
k
若
p
=
0
,则
p
l
o
g
2
p
=
0
Ent(D)=-\sum_{k=1}^{|y|}p_klog_2{p_k}\\ 若p=0,则plog_2p=0
Ent(D)=−k=1∑∣y∣pklog2pk若p=0,则plog2p=0
E n t ( D ) 的值越小,则 D 的纯度越高 E n t ( D ) 的最小值为 0 ,最大值为 l o g 2 ∣ y ∣ Ent(D)的值越小,则D的纯度越高\\ Ent(D)的最小值为0,最大值为log_2|y| Ent(D)的值越小,则D的纯度越高Ent(D)的最小值为0,最大值为log2∣y∣
信息增益直接以信息熵为基础,计算当前划分对信息熵所造成的变化
离散属性a的取值:
{
a
1
,
a
2
,
.
.
.
,
a
V
}
\{a^1, a^2,...,a^V\}
{a1,a2,...,aV}
D v : D 中在 a 上取值 = a v 的样本集合 D^v:D中在a上取值= a^v的样本集合 Dv:D中在a上取值=av的样本集合
以属性a对数据集D进行划分所获得的信息增益为
下表包含17个训练样例,|y|=2(是好瓜、不是好瓜)
正例占
p
1
=
8
17
反例占
p
2
=
9
17
正例占p_1=\frac{8}{17}\\ 反例占p_2=\frac{9}{17}\\
正例占p1=178反例占p2=179
根节点的信息熵为
E
n
t
(
D
)
=
−
∑
k
=
1
2
p
k
l
o
g
2
p
k
=
−
(
8
17
l
o
g
2
8
17
+
9
17
l
o
g
2
9
17
)
=
0.998
Ent(D)=-\sum_{k=1}^{2}p_klog_2{p_k}\\ =-(\frac{8}{17}log_2\frac{8}{17}+\frac{9}{17}log_2\frac{9}{17})\\ =0.998
Ent(D)=−k=1∑2pklog2pk=−(178log2178+179log2179)=0.998
以属性“色泽”为例,其对应的3个子集分别为
D
1
(
色泽
=
青绿
)
:
正例
3
6
,
反例
3
6
D
2
(
色泽
=
乌黑
)
:
正例
4
6
,
反例
2
6
D
3
(
色泽
=
浅白
)
:
正例
1
6
,
反例
4
5
D^1(色泽=青绿):正例\frac{3}{6},反例\frac{3}{6}\\ D^2(色泽=乌黑):正例\frac{4}{6},反例\frac{2}{6}\\ D^3(色泽=浅白):正例\frac{1}{6},反例\frac{4}{5}\\
D1(色泽=青绿):正例63,反例63D2(色泽=乌黑):正例64,反例62D3(色泽=浅白):正例61,反例54
E n t ( D 1 ) = − ( 3 6 l o g 2 3 6 + 3 6 l o g 2 3 6 ) = 1 E n t ( D 2 ) = − ( 4 6 l o g 2 4 6 + 2 6 l o g 2 2 6 ) = 0.918 E n t ( D 3 ) = − ( 1 5 l o g 2 1 5 + 4 5 l o g 2 4 5 ) = 0.722 Ent(D^1)=-(\frac{3}{6}log_2\frac{3}{6}+\frac{3}{6}log_2\frac{3}{6})=1\\ Ent(D^2)=-(\frac{4}{6}log_2\frac{4}{6}+\frac{2}{6}log_2\frac{2}{6})=0.918\\ Ent(D^3)=-(\frac{1}{5}log_2\frac{1}{5}+\frac{4}{5}log_2\frac{4}{5})=0.722 Ent(D1)=−(63log263+63log263)=1Ent(D2)=−(64log264+62log262)=0.918Ent(D3)=−(51log251+54log254)=0.722
于是,属性“色泽”的信息增益为
G
a
i
n
(
D
,
色泽
)
=
E
n
t
(
D
)
−
∑
v
=
1
3
∣
D
v
∣
D
E
n
t
(
D
v
)
=
0.988
−
(
6
17
×
1
+
6
17
×
0.918
+
5
17
×
0.722
)
=
0.109
Gain(D,色泽)=Ent(D)-\sum_{v=1}^3\frac{|D^v|}{D}Ent(D^v)\\ =0.988-(\frac{6}{17}\times 1+\frac{6}{17}\times 0.918+\frac{5}{17}\times 0.722)=0.109
Gain(D,色泽)=Ent(D)−v=1∑3D∣Dv∣Ent(Dv)=0.988−(176×1+176×0.918+175×0.722)=0.109
类似的,其他属性的信息增益为
G
a
i
n
(
D
,
根蒂
)
=
0.143
G
a
i
n
(
D
,
敲声
)
=
0.141
G
a
i
n
(
D
,
纹理
)
=
0.381
G
a
i
n
(
D
,
脐部
)
=
0.289
G
a
i
n
(
D
,
触感
)
=
0.006
Gain(D,根蒂)=0.143\quad Gain(D,敲声)=0.141\\ Gain(D,纹理)=0.381\quad Gain(D,脐部)=0.289\\ Gain(D,触感)=0.006\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
Gain(D,根蒂)=0.143Gain(D,敲声)=0.141Gain(D,纹理)=0.381Gain(D,脐部)=0.289Gain(D,触感)=0.006
属性“纹理”的信息增益最大,被选为划分属性
对每个分支结点做进一步划分,最终得到决策树
信息增益:对可取值数目较多的属性有所偏好
-
增益率
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) I V ( a ) = − ∑ v = 1 V ∣ D v ∣ D l o g 2 ∣ D v ∣ D Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)}\\ IV(a)=-\sum_{v=1}^V\frac{|D^v|}{D}log_2\frac{|D^v|}{D} Gain_ratio(D,a)=IV(a)Gain(D,a)IV(a)=−v=1∑VD∣Dv∣log2D∣Dv∣
属性a的可能取值数目越多(即V越大),则IV(a)的值通常就越大 -
启发式:先从候选划分属性中找出信息增益高于平均水平的,再从中选取增益率最高的
4.2.3 基尼指数
反映了从D中随机抽取两个样例,其类别标记不一致的概率
G
i
n
i
(
D
)
=
∑
k
=
1
∣
y
∣
∑
k
,
≠
k
p
k
p
k
,
=
1
−
∑
k
=
1
∣
y
∣
p
k
2
Gini(D)=\sum_{k=1}^{|y|}\sum_{k^,\neq k}p_kp_{k^,}\\ =1-\sum_{k=1}^{|y|}p_k^2
Gini(D)=k=1∑∣y∣k,=k∑pkpk,=1−k=1∑∣y∣pk2
G i n i ( D ) 越小 , 数据集 D 的纯度越高 Gini(D)越小,数据集D的纯度越高 Gini(D)越小,数据集D的纯度越高
属性a的基尼指数
G
i
n
i
_
i
n
d
e
x
(
D
,
a
)
=
∑
v
=
1
V
D
v
D
G
i
n
i
(
D
v
)
Gini\_index(D,a)=\sum_{v=1}^V\frac{D^v}{D}Gini(D^v)
Gini_index(D,a)=v=1∑VDDvGini(Dv)
在候选属性集合中,选取那个使划分基尼指数最小的属性
4.3 剪枝处理
研究表明:划分选择的各种准则虽然对决策树的尺寸有较大影响,但对泛化性能的影响很有限
剪枝方法和程度对决策树泛化性能的影响更为显著
剪枝是决策树对付过拟合的主要手段
\color{red}{剪枝是决策树对付过拟合的主要手段}
剪枝是决策树对付过拟合的主要手段
为了尽可能正确分类训练样本,有可能造成分支过多→过拟合
可通过主动去掉一些分支来降低过拟合的风险
可通过主动去掉一些分支来降低过拟合的风险
可通过主动去掉一些分支来降低过拟合的风险
- 基本策略:
- 预剪枝(pre-pruning):提前终止某些分支的生长
- 后剪枝(post-pruning):生成一棵完全树,再“回头”剪枝
- 剪枝过程中需评估剪枝前后决策树的优劣
4.4 连续与缺失值
4.4.1 连续值处理
-
连续属性的可取值数目不再有限,这时候就需要采用二分法对连续属性进行处理
-
假定样本集D中,属性a有n个不同的取值,将这些值从小到大排序
{ 0.243 , 0.245 , 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } \{0.243,0.245,0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\} {0.243,0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774} -
构建一个候选划分点集合
T a = { a i + a i + 1 2 ∣ 1 ≤ i ≤ n − 1 } T_a=\{\frac{a^i+a^{i+1}}{2}|1\leq i \leq n-1\} Ta={2ai+ai+1∣1≤i≤n−1}{ 0.244 , 0.294 , 0.351 , 0.381 , 0.420 , 0.459 , 0.518 , 0.574 , 0.600 , 0.621 , 0.636 , 0.648 , 0.661 , 0.681 , 0.708 , 0.746 } \{0.244,0.294,0.351,0.381,0.420,\\ 0.459,0.518,0.574,0.600,0.621,\\ 0.636,0.648,0.661,0.681,0.708,0.746\} {0.244,0.294,0.351,0.381,0.420,0.459,0.518,0.574,0.600,0.621,0.636,0.648,0.661,0.681,0.708,0.746}
-
计算信息增益
E n t ( D ) = − ∑ k = 1 2 p k l o g 2 p k = − ( 8 17 l o g 2 8 17 + 9 17 l o g 2 9 17 ) = 0.998 Ent(D)=-\sum_{k=1}^{2}p_klog_2{p_k}\\ =-(\frac{8}{17}log_2\frac{8}{17}+\frac{9}{17}log_2\frac{9}{17})\\ =0.998 Ent(D)=−k=1∑2pklog2pk=−(178log2178+179log2179)=0.998 -
逐个选取候选划分点集合中的元素当作划分点t,计算
D t − , D t + D^-_t,D^+_t Dt−,Dt+G a i n ( D , a ) = m a x t ∈ T a G a i n ( D , a , t ) = m a x t ∈ T a E n t ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ D E n t ( D t λ ) Gain(D,a)=max_{t\in T_a}Gain(D,a,t)\\ =max_{t\in T_a}Ent(D)-\sum_{\lambda\in\{-,+\}}\frac{|D_t^\lambda|}{D}Ent(D_t^\lambda) Gain(D,a)=maxt∈TaGain(D,a,t)=maxt∈TaEnt(D)−λ∈{−,+}∑D∣Dtλ∣Ent(Dtλ)
-
t=0.244
D t − = { 0.243 } , D t + = { 0.245 , 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243\},D^+_t=\{0.245,0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt−={0.243},Dt+={0.245,0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}E n t ( D t − ) = − ( 0 × l o g 2 0 + 1 × l o g 2 1 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + 1\times log_21)=0\\ Ent(Dt−)=−(0×log20+1×log21)=0
E n t ( D t + ) = − ( 8 16 l o g 2 8 16 + 8 16 l o g 2 8 16 ) = 1 Ent(D^+_t)=-(\frac{8}{16}log_2\frac{8}{16}+\frac{8}{16}log_2\frac{8}{16})=1 Ent(Dt+)=−(168log2168+168log2168)=1
G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.244 ) = 0.998 − ( 1 17 × 0 + 16 17 × 1 ) = 0.057 Gain(D,a,t)=Gain(D,密度,0.244)=0.998-(\frac{1}{17}\times 0+\frac{16}{17}\times 1)=0.057 Gain(D,a,t)=Gain(D,密度,0.244)=0.998−(171×0+1716×1)=0.057
-
t=0.294
D t − = { 0.243 , 0.245 } , D t + = { 0.343 , 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245\},D^+_t=\{0.343,0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt−={0.243,0.245},Dt+={0.343,0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}E n t ( D t − ) = − ( 0 × l o g 2 0 + 2 2 × l o g 2 2 2 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{2}{2}\times log_2\frac{2}{2})=0\\ Ent(Dt−)=−(0×log20+22×log222)=0
E n t ( D t + ) = − ( 8 15 l o g 2 8 15 + 7 15 l o g 2 7 15 ) = 0.997 Ent(D^+_t)=-(\frac{8}{15}log_2\frac{8}{15}+\frac{7}{15}log_2\frac{7}{15})=0.997 Ent(Dt+)=−(158log2158+157log2157)=0.997
G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.294 ) = 0.998 − ( 2 17 × 0 + 15 17 × 0.997 ) = 0.118 Gain(D,a,t)=Gain(D,密度,0.294)=0.998-(\frac{2}{17}\times 0+\frac{15}{17}\times 0.997)=0.118 Gain(D,a,t)=Gain(D,密度,0.294)=0.998−(172×0+1715×0.997)=0.118
-
t=0.351
D t − = { 0.243 , 0.245 , 0.343 } , D t + = { 0.360 , 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343\},D^+_t=\{0.360,0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt−={0.243,0.245,0.343},Dt+={0.360,0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}E n t ( D t − ) = − ( 0 × l o g 2 0 + 3 3 × l o g 2 3 3 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{3}{3}\times log_2\frac{3}{3})=0\\ Ent(Dt−)=−(0×log20+33×log233)=0
E n t ( D t + ) = − ( 8 14 l o g 2 8 14 + 6 14 l o g 2 6 14 ) = 0.985 Ent(D^+_t)=-(\frac{8}{14}log_2\frac{8}{14}+\frac{6}{14}log_2\frac{6}{14})=0.985 Ent(Dt+)=−(148log2148+146log2146)=0.985
G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.351 ) = 0.998 − ( 3 17 × 0 + 14 17 × 0.985 ) = 0.187 Gain(D,a,t)=Gain(D,密度,0.351)=0.998-(\frac{3}{17}\times 0+\frac{14}{17}\times 0.985)=0.187 Gain(D,a,t)=Gain(D,密度,0.351)=0.998−(173×0+1714×0.985)=0.187
-
t=0.381
D t − = { 0.243 , 0.245 , 0.343 , 0.360 } , D t + = { 0.403 , 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343,0.360\},D^+_t=\{0.403,0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\}\\ Dt−={0.243,0.245,0.343,0.360},Dt+={0.403,0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}E n t ( D t − ) = − ( 0 × l o g 2 0 + 4 4 × l o g 2 4 4 ) = 0 Ent(D^-_t)=-(0\times log_2 0 + \frac{4}{4}\times log_2\frac{4}{4})=0\\ Ent(Dt−)=−(0×log20+44×log244)=0
E n t ( D t + ) = − ( 8 13 l o g 2 8 13 + 5 13 l o g 2 5 13 ) = 0.961 Ent(D^+_t)=-(\frac{8}{13}log_2\frac{8}{13}+\frac{5}{13}log_2\frac{5}{13})=0.961 Ent(Dt+)=−(138log2138+135log2135)=0.961
G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.381 ) = 0.998 − ( 4 17 × 0 + 13 17 × 0.961 ) = 0.262 Gain(D,a,t)=Gain(D,密度,0.381)=0.998-(\frac{4}{17}\times 0+\frac{13}{17}\times 0.961)=0.262 Gain(D,a,t)=Gain(D,密度,0.381)=0.998−(174×0+1713×0.961)=0.262
-
t=0.420
D t − = { 0.243 , 0.245 , 0.343 , 0.360 , 0.403 } , D t + = { 0.437 , 0.481 , 0.556 , 0.593 , 0.608 , 0.634 , 0.639 , 0.657 , 0.666 , 0.697 , 0.719 , 0.774 } D^-_t=\{0.243,0.245,0.343,0.360,0.403\},D^+_t=\{0.437,\\ 0.481,0.556,0.593,0.608,0.634,0.639,\\ 0.657,0.666,0.697,0.719,0.774\} Dt−={0.243,0.245,0.343,0.360,0.403},Dt+={0.437,0.481,0.556,0.593,0.608,0.634,0.639,0.657,0.666,0.697,0.719,0.774}E n t ( D t − ) = − ( 1 5 × l o g 2 1 5 + 4 5 × l o g 2 4 5 ) = 0.722 Ent(D^-_t)=-(\frac{1}{5}\times log_2\frac{1}{5} + \frac{4}{5}\times log_2\frac{4}{5})=0.722\\ Ent(Dt−)=−(51×log251+54×log254)=0.722
E n t ( D t + ) = − ( 7 12 l o g 2 7 12 + 5 12 l o g 2 5 12 ) = 0.980 Ent(D^+_t)=-(\frac{7}{12}log_2\frac{7}{12}+\frac{5}{12}log_2\frac{5}{12})=0.980 Ent(Dt+)=−(127log2127+125log2125)=0.980
G a i n ( D , a , t ) = G a i n ( D , 密度 , 0.420 ) = 0.998 − ( 5 17 × 0.722 + 12 17 × 0.980 ) = 0.094 Gain(D,a,t)=Gain(D,密度,0.420)=0.998-(\frac{5}{17}\times 0.722+\frac{12}{17}\times 0.980)=0.094 Gain(D,a,t)=Gain(D,密度,0.420)=0.998−(175×0.722+1712×0.980)=0.094
-
按理可以算出其他t值的情况,比较发现,当t=0.381时,信息增益最大,为0.262
-
-
对于“含糖率”的计算与上述类似,可得当t=0.126时,信息增益最大,为0.349
-
计算可得其他信息增益
G a i n ( D , 色泽 ) = 0.109 G a i n ( D , 根蒂 ) = 0.143 G a i n ( D , 敲声 ) = 0.141 G a i n ( D , 纹理 ) = 0.381 G a i n ( D , 脐部 ) = 0.289 G a i n ( D , 触感 ) = 0.006 Gain(D,色泽)=0.109\quad Gain(D,根蒂)=0.143\\ Gain(D,敲声)=0.141\quad Gain(D,纹理)=0.381\\ Gain(D,脐部)=0.289\quad Gain(D,触感)=0.006 Gain(D,色泽)=0.109Gain(D,根蒂)=0.143Gain(D,敲声)=0.141Gain(D,纹理)=0.381Gain(D,脐部)=0.289Gain(D,触感)=0.006 -
“纹理”的信息增益最大,选做根节点
-
4.4.2缺失值处理
现实应用中,经常会遇到属性值“缺失”现象
仅使用无缺失的样例
?
→
对数据的极大浪费
\color{red}{仅使用无缺失的样例?→对数据的极大浪费}
仅使用无缺失的样例?→对数据的极大浪费
- 使用带缺失值的样例,需解决
- 如何进行划分属性选择?
- 给定划分属性,若样本在该属性上的值缺失,如何进行划分?
基本思路 : 样本赋权,权重划分 仅通过无缺失值的样例来判断划分属性的优劣 \color{red}{基本思路:样本赋权,权重划分}\\ 仅通过无缺失值的样例来判断划分属性的优劣 基本思路:样本赋权,权重划分仅通过无缺失值的样例来判断划分属性的优劣
-
以下表为例,学习开始时,根节点包含样例集D中全部17个样例,权重均为1
- 以属性“色泽”为例,该属性熵无缺失值的样例子集包含14个样例,则信息熵为
E n t ( D ~ ) = − ∑ k = 1 2 p ˉ k l o g 2 p ˉ k = − ( 6 14 l o g 2 6 14 + 8 14 l o g 2 8 14 ) = 0.985 Ent(\tilde D)=-\sum_{k=1}^2\bar p_klog_2\bar p_k=-(\frac{6}{14}log_2\frac{6}{14}+\frac{8}{14}log_2\frac{8}{14})=0.985 Ent(D~)=−k=1∑2pˉklog2pˉk=−(146log2146+148log2148)=0.985
令
D ~ 1 , D ~ 2 , D ~ 3 \tilde D^1,\tilde D^2,\tilde D^3 D~1,D~2,D~3
分别表示在属性“色泽”上取值为“青绿”、“乌黑”、“浅白”的样本子集,有
E n t ( D ~ 1 ) = − ( 2 4 l o g 2 2 4 + 2 4 l o g 2 2 4 ) = 1 E n t ( D ~ 2 ) = − ( 4 6 l o g 2 4 6 + 2 6 l o g 2 2 6 ) = 0.918 E n t ( D ~ 3 ) = − ( 0 4 l o g 2 4 4 + 4 4 l o g 2 2 4 ) = 0 Ent(\tilde D^1)=-(\frac{2}{4}log_2\frac{2}{4}+\frac{2}{4}log_2\frac{2}{4})=1\\ Ent(\tilde D^2)=-(\frac{4}{6}log_2\frac{4}{6}+\frac{2}{6}log_2\frac{2}{6})=0.918\\ Ent(\tilde D^3)=-(\frac{0}{4}log_2\frac{4}{4}+\frac{4}{4}log_2\frac{2}{4})=0\\ Ent(D~1)=−(42log242+42log242)=1Ent(D~2)=−(64log264+62log262)=0.918Ent(D~3)=−(40log244+44log242)=0
因此,样本子集上属性“色泽”的信息增益为
G a i n ( D , 色泽 ) = E n t ( D ~ ) − ∑ v = 1 3 r ~ v E n t ( D ~ v ) r ~ v : 无缺失值样例中属性 a 取值为 v 的占比 = 0.985 − ( 4 14 × 1 + 6 14 × 0.918 + 4 14 × 0 ) = 0.306 Gain(D,色泽)=Ent(\tilde D)-\sum_{v=1}^3\tilde r_vEnt(\tilde D^v)\\ \color{blue}{\tilde r_v:无缺失值样例中属性a取值为v的占比}\\ =0.985-(\frac{4}{14}\times 1+\frac{6}{14}\times 0.918+\frac{4}{14}\times 0)\\ =0.306 Gain(D,色泽)=Ent(D~)−v=1∑3r~vEnt(D~v)r~v:无缺失值样例中属性a取值为v的占比=0.985−(144×1+146×0.918+144×0)=0.306
于是,样本集D上属性“色泽”的信息增益为
G a i n ( D , 色泽 ) = ρ × G a i n ( D ~ , 色泽 ) = 14 17 × 0.306 = 0.252 ρ : 无缺失值样例占比 Gain(D,色泽)=\rho \times Gain(\tilde D,色泽)=\frac{14}{17}\times 0.306=0.252\\ \color{blue}{\rho:无缺失值样例占比}\\ Gain(D,色泽)=ρ×Gain(D~,色泽)=1714×0.306=0.252ρ:无缺失值样例占比
类似地可计算出所有属性在数据集上的信息增益
G a i n ( D , 根蒂 ) = 0.171 G a i n ( D , 敲声 ) = 0.145 G a i n ( D , 纹理 ) = 0.424 G a i n ( D , 脐部 ) = 0.289 G a i n ( D , 触感 ) = 0.006 Gain(D,根蒂)=0.171\quad Gain(D,敲声)=0.145\\ Gain(D,纹理)=0.424\quad Gain(D,脐部)=0.289\\ Gain(D,触感)=0.006\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad Gain(D,根蒂)=0.171Gain(D,敲声)=0.145Gain(D,纹理)=0.424Gain(D,脐部)=0.289Gain(D,触感)=0.006
- 以属性“色泽”为例,该属性熵无缺失值的样例子集包含14个样例,则信息熵为
“纹理”在所有属性中取得了最大的信息增益
清晰占
7
15
,
稍糊占
5
15
,
模糊占
3
15
清晰占\frac{7}{15},稍糊占\frac{5}{15},模糊占\frac{3}{15}
清晰占157,稍糊占155,模糊占153
- 权重划分
- 样本8和10为缺失值,将它以上面的权重划入三个分支中去
4.5 多变量决策树
决策树所形成的分类边界有一个明显的特点:轴平行,即它的分类边界由若干个与坐标轴平行的分段组成。
可以看出此时分类边界并不简单,分成了好几段。若能使用斜的划分边界,则决策树模型将大为简化
**“多变量决策树”**就是能实现这样的“斜划分”甚至更复杂划分的决策树。