决策树知识点
ID3 规则——信息增益(基于熵)
- 先计算根结点的信息熵 H ( D ) = − ∑ k = 1 ∣ Y ∣ p k log p k H(D)=-\sum_{k=1}^{|Y|}{p_k\log{p_k}} H(D)=−∑k=1∣Y∣pklogpk
- 再计算根据某特征分割之后的条件熵 H ( D ∣ f e a t u r e ) = ∑ v V ∣ D v ∣ ∣ D ∣ ∑ k = 1 ∣ Y ∣ p k log p k H(D|feature)=\sum_{v}^{V}{\frac{|D^v|}{|D|}\sum_{k=1}^{|Y|}{p_k\log{p_k}}} H(D∣feature)=∑vV∣D∣∣Dv∣∑k=1∣Y∣pklogpk
- 两者相减得到信息增益 g ( D , f e a t u r e ) = H ( D ) − H ( D ∣ f e a t u r e ) g(D,feature)=H(D)-H(D|feature) g(D,feature)=H(D)−H(D∣feature)
C4.5 规则——信息增益比(基于熵)
- 先按照 ID3 规则计算出信息增益 g ( D , f e a t u r e ) g(D,feature) g(D,feature)
- 计算该特征的“固有值”(类似于归一化因子)
I
V
(
f
e
a
t
u
r
e
)
=
−
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
log
2
∣
D
v
∣
∣
D
∣
IV(feature)=-\sum_{v=1}^{V}{\frac{|D^v|}{|D|}\log_2{\frac{|D^v|}{|D|}}}
IV(feature)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣
可以看到这个公式和熵很像,其实就是把特征的每一个取值都当成一个结点,并假设每个结点纯度都为100%算出来的一个假想中的熵,用于去除由于特征取值过多带来的假性信息增益。 - 增益比为两者相除: G a i n _ r a t i o ( D , f e a t u r e ) = g ( D , f e a t u r e ) I V ( f e a t u r e ) Gain\_ratio(D,feature)=\frac{g(D,feature)}{IV(feature)} Gain_ratio(D,feature)=IV(feature)g(D,feature)
例如:西瓜数据集
若把编号当作一种特征,则其“固有值”为:
I
V
(
编号
)
=
−
(
1
14
log
2
1
14
+
.
.
.
⏞
14
个特征取值
)
=
−
14
∗
(
1
14
log
2
1
14
)
=
3.80735
IV(编号)=- {\left(\overset{14个特征取值}{\overbrace{\frac{1}{14}{\log_{2}\frac{1}{14}}~ + ...~}}\right)} = - 14*\left( {\frac{1}{14}{\log_{2}\frac{1}{14}}} \right) =3.80735
IV(编号)=−⎝
⎛141log2141 +...
14个特征取值⎠
⎞=−14∗(141log2141)=3.80735
CART 规则——基尼指数(基于基尼值)
基尼值:
G
i
n
i
(
D
)
=
1
−
∑
k
=
1
∣
Y
∣
p
k
2
Gini(D)=1-\sum_{k=1}^{|Y|}{p_k^2}
Gini(D)=1−∑k=1∣Y∣pk2
基尼指数:
G
i
n
i
_
i
n
d
e
x
(
D
,
f
e
a
t
u
r
e
)
=
∑
v
=
1
V
∣
D
v
∣
∣
D
∣
G
i
n
i
(
D
v
)
Gini\_index(D,feature)=\sum_{v=1}^{V}{\frac{|D^v|}{|D|}Gini(D^v)}
Gini_index(D,feature)=∑v=1V∣D∣∣Dv∣Gini(Dv)
需要注意的是,CART 和 ID3、C4.5 规则不同,CART 树是二叉树,并且特征可复用,因此在进行特征分类的时候,对于取值大于等于3的特征,需要多次分裂才能判断完全。
例子:工作数据集
G
i
n
i
(
D
,
工资
)
=
3
8
∗
(
1
−
(
3
3
)
2
−
(
0
3
)
2
)
+
5
8
∗
(
1
−
(
3
5
)
2
−
(
2
5
)
2
)
=
0.3
Gini(D,\mathbf{工资})=\frac{3}{8}*(1-(\frac{3}{3})^2-(\frac{0}{3})^2)+\frac{5}{8}*(1-(\frac{3}{5})^2-(\frac{2}{5})^2)=\mathbf{0.3}
Gini(D,工资)=83∗(1−(33)2−(30)2)+85∗(1−(53)2−(52)2)=0.3
G i n i ( D , 压力 ) = 3 8 ∗ ( 1 − ( 2 3 ) 2 − ( 1 3 ) 2 ) + 5 8 ∗ ( 1 − ( 1 5 ) 2 − ( 4 5 ) 2 ) = 0.37 Gini(D,压力)=\frac{3}{8}*(1-(\frac{2}{3})^2-(\frac{1}{3})^2)+\frac{5}{8}*(1-(\frac{1}{5})^2-(\frac{4}{5})^2)=0.37 Gini(D,压力)=83∗(1−(32)2−(31)2)+85∗(1−(51)2−(54)2)=0.37
G i n i ( D , 平台 = 0 ) = 3 8 ∗ ( 1 − ( 3 3 ) 2 − ( 0 3 ) 2 ) + 5 8 ∗ ( 1 − ( 3 5 ) 2 − ( 2 5 ) 2 ) = 0.3 Gini(D,\mathbf{平台=0})=\frac{3}{8}*(1-(\frac{3}{3})^2-(\frac{0}{3})^2)+\frac{5}{8}*(1-(\frac{3}{5})^2-(\frac{2}{5})^2)=\mathbf{0.3} Gini(D,平台=0)=83∗(1−(33)2−(30)2)+85∗(1−(53)2−(52)2)=0.3
G i n i ( D , 平台 = 1 ) = 3 8 ∗ ( 1 − ( 2 3 ) 2 − ( 1 3 ) 2 ) + 5 8 ∗ ( 1 − ( 4 5 ) 2 − ( 1 5 ) 2 ) = 0.37 Gini(D,平台=1)=\frac{3}{8}*(1-(\frac{2}{3})^2-(\frac{1}{3})^2)+\frac{5}{8}*(1-(\frac{4}{5})^2-(\frac{1}{5})^2)=0.37 Gini(D,平台=1)=83∗(1−(32)2−(31)2)+85∗(1−(54)2−(51)2)=0.37
G i n i ( D , 平台 = 2 ) = 2 8 ∗ ( 1 − ( 1 2 ) 2 − ( 1 2 ) 2 ) + 6 8 ∗ ( 1 − ( 4 6 ) 2 − ( 2 6 ) 2 ) = 0.46 Gini(D,平台=2)=\frac{2}{8}*(1-(\frac{1}{2})^2-(\frac{1}{2})^2)+\frac{6}{8}*(1-(\frac{4}{6})^2-(\frac{2}{6})^2)=0.46 Gini(D,平台=2)=82∗(1−(21)2−(21)2)+86∗(1−(64)2−(62)2)=0.46
基尼指数越小,说明结点划分越纯,综上,选择工资或者平台=0作为划分标准
问题1
以下是目标变量在训练集上的 8 个实际值 [0,0,0,1,1,1,1,1],目标变量的熵是所少?
A、
−
(
5
/
8
log
(
5
/
8
)
+
3
/
8
log
(
3
/
8
)
)
-(5/8 \log(5/8) + 3/8 \log(3/8))
−(5/8log(5/8)+3/8log(3/8))
B、
5
/
8
log
(
5
/
8
)
+
3
/
8
log
(
3
/
8
)
5/8 \log(5/8) + 3/8 \log(3/8)
5/8log(5/8)+3/8log(3/8)
C、
3
/
8
log
(
5
/
8
)
+
5
/
8
log
(
3
/
8
)
3/8 \log(5/8) + 5/8 \log(3/8)
3/8log(5/8)+5/8log(3/8)
D、
5
/
8
log
(
3
/
8
)
–
3
/
8
log
(
5
/
8
)
5/8 \log(3/8) – 3/8 \log(5/8)
5/8log(3/8)–3/8log(5/8)
答案:A
解析:信息熵的公式为:
−
∑
i
p
i
log
p
i
-\sum_{i}{p_i \log{p_i}}
−i∑pilogpi
问题2
下面关于ID3算法中说法错误的是( )
A、ID3算法要求特征必须离散化
B、信息增益可以用熵,而不是GINI系数来计算
C、选取信息增益最大的特征,作为树的根节点
D、ID3算法是一个二叉树模型
答案:D
解析:ID3算法(IterativeDichotomiser3迭代二叉树3代)是一个由RossQuinlan发明的用于决策树的算法。可以归纳为以下几点: 使用所有没有使用的属性并计算与之相关的样本熵值 选取其中熵值最小的属性 生成包含该属性的节点 D3算法对数据的要求: 1)所有属性必须为离散量; 2)所有的训练例的所有属性必须有一个明确的值; 3)相同的因素必须得到相同的结论且训练例必须唯一。
问题3
决策树的父节点和子节点的熵的大小关系是什么?
A. 决策树的父节点更大
B. 子节点的熵更大
C. 两者相等
D. 根据具体情况而定
答案:D
解析:假设一个父节点有2正3负样本
H
(
D
)
=
−
2
5
log
2
5
−
3
5
log
3
5
=
0.29229
H(D) = - \frac{2}{5}{\log\frac{2}{5}} - \frac{3}{5}{\log\frac{3}{5}} = 0.29229
H(D)=−52log52−53log53=0.29229
进一步分裂
情况1:两个叶节点(2正,3负),计算条件熵:
H ( D 1 ) = − 0 log 0 − 1 log 1 = 0 H\left( D_{1} \right) = - 0{\log 0} - 1{\log 1} = 0 H(D1)=−0log0−1log1=0
H ( D 2 ) = − 0 log 0 − 1 log 1 = 0 H\left( D_{2} \right) = - 0{\log 0} - 1{\log 1} = 0 H(D2)=−0log0−1log1=0
计算信息增益:
g
=
0.29229
−
[
2
5
∗
0
+
3
5
∗
0
]
=
0.29229
g = 0.29229 - \left\lbrack {\frac{2}{5}*0 + \frac{3}{5}*0} \right\rbrack = 0.29229
g=0.29229−[52∗0+53∗0]=0.29229
情况2:两个叶节点(1正1负,1正2负),计算条件熵:。
H ( D 1 ) = − 1 2 log 1 2 − 1 2 log 1 2 = 0.30103 H\left( D_{1} \right) = - \frac{1}{2}{\log\frac{1}{2}} - \frac{1}{2}{\log\frac{1}{2}} = 0.30103 H(D1)=−21log21−21log21=0.30103
H ( D 2 ) = − 1 3 log 1 3 − 2 3 log 2 3 = 0.27643 H\left( D_{2} \right) = - \frac{1}{3}{\log\frac{1}{3}} - \frac{2}{3}{\log\frac{2}{3}} = 0.27643 H(D2)=−31log31−32log32=0.27643
计算信息增益:
g
=
0.29229
−
[
2
5
∗
0.30103
+
3
5
∗
0.27643
]
=
0.00602
g = 0.29229 - \left\lbrack {\frac{2}{5}*0.30103 + \frac{3}{5}*0.27643} \right\rbrack = 0.00602
g=0.29229−[52∗0.30103+53∗0.27643]=0.00602
分别看下情况1和情况2,分裂前后确实都有信息增益,但是两种情况里不是每一个叶节点都比父节点的熵小。
问题4
如下表是用户是否使用某产品的调查结果( ) 请计算年龄、地区、学历、收入中对用户是否使用调查产品信息增益最大的属性。
A、年龄
B、地区
C、学历
D、收入
答案:C
解析:
问题5
给定一个天气数据集,请问最优划分特征是什么?
先计算label本身的熵:
H
(
D
)
=
−
9
14
log
9
14
−
5
14
log
5
14
=
0.28305
H(D) = - \frac{9}{14}{\log\frac{9}{14}} - \frac{5}{14}{\log\frac{5}{14}} = 0.28305
H(D)=−149log149−145log145=0.28305
计算Outlook的条件熵:
H
(
D
|
o
u
t
l
o
o
k
=
s
u
n
n
y
)
=
5
14
[
−
2
5
log
2
5
−
3
5
log
3
5
]
=
0.10109
H\left( D \middle| outlook = sunny \right) = \frac{5}{14}\left\lbrack {- \frac{2}{5}{\log\frac{2}{5}} - \frac{3}{5}{\log\frac{3}{5}}} \right\rbrack = 0.10109
H(D∣outlook=sunny)=145[−52log52−53log53]=0.10109
H ( D | o u t l o o k = o v e r c a s t ) = 0 H\left( D \middle| outlook = overcast \right) = 0 H(D∣outlook=overcast)=0
H ( D | o u t l o o k = r a i n y ) = 5 14 [ − 2 5 log 2 5 − 3 5 log 3 5 ] = 0.10109 H\left( D \middle| outlook = rainy \right) = \frac{5}{14}\left\lbrack {- \frac{2}{5}{\log\frac{2}{5}} - \frac{3}{5}{\log\frac{3}{5}}} \right\rbrack = 0.10109 H(D∣outlook=rainy)=145[−52log52−53log53]=0.10109
计算Outlook的信息增益:
g
(
D
,
o
u
t
l
o
o
k
)
=
0.28305
−
0.10109
∗
2
=
0.08087
\mathbf{g}\left( {\mathbf{D},\mathbf{o}\mathbf{u}\mathbf{t}\mathbf{l}\mathbf{o}\mathbf{o}\mathbf{k}} \right) = 0.28305 - 0.10109*2 =\mathbf{0.08087}
g(D,outlook)=0.28305−0.10109∗2=0.08087
计算Humidity的条件熵:
H
(
D
|
H
u
m
i
d
i
t
y
=
h
i
g
h
)
=
7
14
[
−
3
7
log
3
7
−
4
7
log
4
5
]
=
0.14829
H\left( D \middle| Humidity = high \right) = \frac{7}{14}\left\lbrack {- \frac{3}{7}{\log\frac{3}{7}} - \frac{4}{7}{\log\frac{4}{5}}} \right\rbrack = 0.14829
H(D∣Humidity=high)=147[−73log73−74log54]=0.14829
H ( D | H u m i d i t y = n o r m a l ) = 7 14 [ − 6 7 log 6 7 − 1 7 log 1 7 ] = 0.08906 H\left( D \middle| Humidity = normal \right) = \frac{7}{14}\left\lbrack {- \frac{6}{7}{\log\frac{6}{7}} - \frac{1}{7}{\log\frac{1}{7}}} \right\rbrack = 0.08906 H(D∣Humidity=normal)=147[−76log76−71log71]=0.08906
计算Humidity的信息增益:
g
(
D
,
H
u
m
i
d
i
t
y
)
=
0.28305
−
0.14829
−
0.08906
=
0.0457
\mathbf{g}\left( {\mathbf{D},\mathbf{H}\mathbf{u}\mathbf{m}\mathbf{i}\mathbf{d}\mathbf{i}\mathbf{t}\mathbf{y}} \right) = 0.28305 - 0.14829 - 0.08906 = 0.0457
g(D,Humidity)=0.28305−0.14829−0.08906=0.0457
计算Windy的条件熵:
H
(
D
|
W
i
n
d
y
=
F
A
L
S
E
)
=
8
14
[
−
2
8
log
2
8
−
6
8
log
6
8
]
=
0.13955
H\left( D \middle| Windy = FALSE \right) = \frac{8}{14}\left\lbrack {- \frac{2}{8}{\log\frac{2}{8}} - \frac{6}{8}{\log\frac{6}{8}}} \right\rbrack = 0.13955
H(D∣Windy=FALSE)=148[−82log82−86log86]=0.13955
H ( D | W i n d y = T R U E ) = 6 14 [ − 3 6 log 3 6 − 3 6 log 3 6 ] = 0.12901 H\left( D \middle| Windy = TRUE \right) = \frac{6}{14}\left\lbrack {- \frac{3}{6}{\log\frac{3}{6}} - \frac{3}{6}{\log\frac{3}{6}}} \right\rbrack = 0.12901 H(D∣Windy=TRUE)=146[−63log63−63log63]=0.12901
计算Windy的信息增益
g
(
D
,
W
i
n
d
y
)
=
0.28305
−
0.13955
−
0.12901
=
0.01449
\mathbf{g}\left( {\mathbf{D},\mathbf{W}\mathbf{i}\mathbf{n}\mathbf{d}\mathbf{y}} \right) = 0.28305 - 0.13955 - 0.12901 = 0.01449
g(D,Windy)=0.28305−0.13955−0.12901=0.01449
计算Temperature的条件熵:
H
(
D
|
T
e
m
p
e
r
a
t
u
r
e
=
h
o
t
)
=
4
14
[
−
2
4
log
2
4
−
2
4
log
2
4
]
=
0.08601
H\left( D \middle| Temperature = hot \right) = \frac{4}{14}\left\lbrack {- \frac{2}{4}{\log\frac{2}{4}} - \frac{2}{4}{\log\frac{2}{4}}} \right\rbrack = 0.08601
H(D∣Temperature=hot)=144[−42log42−42log42]=0.08601
H
(
D
|
T
e
m
p
e
r
a
t
u
r
e
=
m
i
l
d
)
=
6
14
[
−
2
6
log
2
6
−
4
6
log
4
6
]
=
0.11847
H\left( D \middle| Temperature = mild \right) = \frac{6}{14}\left\lbrack {- \frac{2}{6}{\log\frac{2}{6}} - \frac{4}{6}{\log\frac{4}{6}}} \right\rbrack = 0.11847
H(D∣Temperature=mild)=146[−62log62−64log64]=0.11847
H
(
D
|
T
e
m
p
e
r
a
t
u
r
e
=
c
o
o
l
)
=
6
14
[
−
2
6
log
2
6
−
4
6
log
4
6
]
=
0.06978
H\left( D \middle| Temperature = cool \right) = \frac{6}{14}\left\lbrack {- \frac{2}{6}{\log\frac{2}{6}} - \frac{4}{6}{\log\frac{4}{6}}} \right\rbrack = 0.06978
H(D∣Temperature=cool)=146[−62log62−64log64]=0.06978
计算Temperature的信息增益:
g
(
D
,
T
e
m
p
e
r
a
t
u
r
e
)
=
0.28305
−
0.08601
−
0.11847
−
0.06978
=
0.00879
\mathbf{g}\left( \mathbf{D},\mathbf{T}\mathbf{e}\mathbf{m}\mathbf{p}\mathbf{e}\mathbf{r}\mathbf{a}\mathbf{t}\mathbf{u}\mathbf{r}\mathbf{e} \right) = 0.28305 - 0.08601 - 0.11847 - 0.06978 = 0.00879
g(D,Temperature)=0.28305−0.08601−0.11847−0.06978=0.00879
综上,Outlook的信息增益最大,所以选Outlook作为划分特征。