信息增益计算和决策树生长过程
给定训练集S,下面以信息增益作为最佳划分的标准,演示信息增益的计算和决策树生长的过程:
根节点
(1)以“Outlook”被选做划分属性
总共有14条数据,打球9条,不打球的5条
根据Outlook进行划分:
- Sunny中有5条数据,其中打球2条,不打球3条
- Overcast中有4条数据,其中打球4条,不打球0条
- Rain中有5条数据,其中打球3条,不打球2条
如下图所示:
计算信息增益:
E
(
O
u
t
l
o
o
k
)
=
−
9
14
l
o
g
2
9
14
−
5
14
l
o
g
2
5
14
E
(
S
S
u
n
n
y
)
=
−
2
5
l
o
g
2
2
5
−
3
5
l
o
g
2
3
5
E
(
S
O
v
e
r
c
a
s
t
)
=
−
4
4
l
o
g
2
4
4
−
0
4
l
o
g
2
0
4
E
(
S
R
a
i
n
)
=
−
3
5
l
o
g
2
3
5
−
2
5
l
o
g
2
2
5
G
a
i
n
(
O
u
t
l
o
o
k
)
=
E
(
O
u
t
l
o
o
k
)
−
[
5
14
E
(
S
S
u
n
n
y
)
+
4
14
E
(
S
O
v
e
r
c
a
s
t
)
+
5
14
E
(
S
R
a
i
n
)
]
=
0.246
E(Outlook)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Sunny})=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Overcast})=-\frac{4}{4}log_2\frac{4}{4}-\frac{0}{4}log_2\frac{0}{4} \\E(S_{Rain})=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\Gain(Outlook)=E(Outlook)-[\frac{5}{14}E(S_{Sunny})+\frac{4}{14}E(S_{Overcast})+\frac{5}{14}E(S_{Rain})] =0.246
E(Outlook)=−149log2149−145log2145E(SSunny)=−52log252−53log253E(SOvercast)=−44log244−40log240E(SRain)=−53log253−52log252Gain(Outlook)=E(Outlook)−[145E(SSunny)+144E(SOvercast)+145E(SRain)]=0.246
可见,用属性“Outlook”划分样本集S的信息增益为:
Gain(S,Outlook)=0.246
(2)以“Temperature”作为划分属性
根据Temperature进行划分:
- Hot中有4条数据,其中打球2条,不打球2条
- Mild中有6条数据,其中打球4条,不打球2条
- Cool中有4条数据,其中打球3条,不打球1条
如下图所示:
计算信息增益:
E
(
T
e
m
p
e
r
a
t
u
r
e
)
=
−
9
14
l
o
g
2
9
14
−
5
14
l
o
g
2
5
14
E
(
S
H
o
t
)
=
−
2
4
l
o
g
2
2
4
−
2
4
l
o
g
2
2
4
E
(
S
M
i
l
d
)
=
−
4
6
l
o
g
2
4
6
−
2
6
l
o
g
2
2
6
E
(
S
R
a
i
n
)
=
−
3
4
l
o
g
2
3
4
−
1
4
l
o
g
2
1
4
G
a
i
n
(
T
e
m
p
e
r
a
t
u
r
e
)
=
E
(
T
e
m
p
e
r
a
t
u
r
e
)
−
[
4
14
E
(
S
H
o
t
)
+
6
14
E
(
S
M
i
l
d
)
+
4
14
E
(
S
C
o
o
l
)
]
=
0.029
E(Temperature)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Hot})=-\frac{2}{4}log_2\frac{2}{4}-\frac{2}{4}log_2\frac{2}{4} \\E(S_{Mild})=-\frac{4}{6}log_2\frac{4}{6}-\frac{2}{6}log_2\frac{2}{6} \\E(S_{Rain})=-\frac{3}{4}log_2\frac{3}{4}-\frac{1}{4}log_2\frac{1}{4} \\Gain(Temperature)=E(Temperature)-[\frac{4}{14}E(S_{Hot})+\frac{6}{14}E(S_{Mild})+\frac{4}{14}E(S_{Cool})] =0.029
E(Temperature)=−149log2149−145log2145E(SHot)=−42log242−42log242E(SMild)=−64log264−62log262E(SRain)=−43log243−41log241Gain(Temperature)=E(Temperature)−[144E(SHot)+146E(SMild)+144E(SCool)]=0.029
(3)以“Humidity”作为划分属性
根据Humidity进行划分:
- High中有7条数据,其中打球3条,不打球4条
- Normal中有7条数据,其中打球6条,不打球1条
如下图所示:
计算信息增益:
E
(
H
u
m
i
d
i
t
y
)
=
−
9
14
l
o
g
2
9
14
−
5
14
l
o
g
2
5
14
E
(
S
H
i
g
h
)
=
−
3
7
l
o
g
2
3
7
−
4
7
l
o
g
2
4
7
E
(
S
N
o
r
m
a
l
)
=
−
6
7
l
o
g
2
6
7
−
1
7
l
o
g
2
1
7
G
a
i
n
(
H
u
m
i
d
i
t
y
)
=
E
(
H
u
m
i
d
i
t
y
)
−
[
7
14
E
(
S
H
i
g
h
)
+
7
14
E
(
S
N
o
r
m
a
l
)
]
=
0.151
E(Humidity)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{High})=-\frac{3}{7}log_2\frac{3}{7}-\frac{4}{7}log_2\frac{4}{7} \\E(S_{Normal})=-\frac{6}{7}log_2\frac{6}{7}-\frac{1}{7}log_2\frac{1}{7} \\Gain(Humidity)=E(Humidity)-[\frac{7}{14}E(S_{High})+\frac{7}{14}E(S_{Normal})] =0.151
E(Humidity)=−149log2149−145log2145E(SHigh)=−73log273−74log274E(SNormal)=−76log276−71log271Gain(Humidity)=E(Humidity)−[147E(SHigh)+147E(SNormal)]=0.151
(4)以“Wind”作为划分属性
根据Wind进行划分:
- Weak中有8条数据,其中打球6条,不打球2条
- Strong中有6条数据,其中打球3条,不打球3条
如下图所示:
计算信息增益:
E
(
W
i
n
d
)
=
−
9
14
l
o
g
2
9
14
−
5
14
l
o
g
2
5
14
E
(
S
W
e
a
k
)
=
−
6
8
l
o
g
2
6
8
−
2
8
l
o
g
2
2
8
E
(
S
S
t
r
o
n
g
)
=
−
3
6
l
o
g
2
3
6
−
3
6
l
o
g
2
3
6
G
a
i
n
(
W
i
n
d
)
=
E
(
W
i
n
d
)
−
[
8
14
E
(
S
W
e
a
k
)
+
6
14
E
(
S
S
t
r
o
n
g
)
]
=
0.048
E(Wind)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Weak})=-\frac{6}{8}log_2\frac{6}{8}-\frac{2}{8}log_2\frac{2}{8} \\E(S_{Strong})=-\frac{3}{6}log_2\frac{3}{6}-\frac{3}{6}log_2\frac{3}{6} \\Gain(Wind)=E(Wind)-[\frac{8}{14}E(S_{Weak})+\frac{6}{14}E(S_{Strong})] =0.048
E(Wind)=−149log2149−145log2145E(SWeak)=−86log286−82log282E(SStrong)=−63log263−63log263Gain(Wind)=E(Wind)−[148E(SWeak)+146E(SStrong)]=0.048
(5)根节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Outlook)=0.246
- Gain(S,Temperature)=0.029
- Gain(S,Humidity)=0.151
- Gain(S,Wind)=0.048
所以,对于当前节点,用“Outlook”划分样本集S的信息增益最大,被选为划分属性。
左儿子节点(Sunny)
(1)以“Temperature”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Temperature进行划分:
- Hot中有2条数据,其中打球0条,不打球2条
- Mild中有2条数据,其中打球1条,不打球1条
- Cool中有1条数据,其中打球1条,不打球0条
如下图所示:
计算信息增益:
E
(
T
e
m
p
e
r
a
t
u
r
e
)
=
−
2
5
l
o
g
2
2
5
−
3
5
l
o
g
2
3
5
E
(
S
H
o
t
)
=
−
0
2
l
o
g
2
0
2
−
2
2
l
o
g
2
2
2
E
(
S
M
i
l
d
)
=
−
1
2
l
o
g
2
1
2
−
1
2
l
o
g
2
1
2
E
(
S
R
a
i
n
)
=
−
1
1
l
o
g
2
1
1
−
0
1
l
o
g
2
0
1
G
a
i
n
(
T
e
m
p
e
r
a
t
u
r
e
)
=
E
(
T
e
m
p
e
r
a
t
u
r
e
)
−
[
2
5
E
(
S
H
o
t
)
+
2
5
E
(
S
M
i
l
d
)
+
1
5
E
(
S
C
o
o
l
)
]
=
0.5710
E(Temperature)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Hot})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\E(S_{Mild})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Rain})=-\frac{1}{1}log_2\frac{1}{1}-\frac{0}{1}log_2\frac{0}{1} \\Gain(Temperature)=E(Temperature)-[\frac{2}{5}E(S_{Hot})+\frac{2}{5}E(S_{Mild})+\frac{1}{5}E(S_{Cool})] =0.5710
E(Temperature)=−52log252−53log253E(SHot)=−20log220−22log222E(SMild)=−21log221−21log221E(SRain)=−11log211−10log210Gain(Temperature)=E(Temperature)−[52E(SHot)+52E(SMild)+51E(SCool)]=0.5710
(2)以“Humidity”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Humidity进行划分:
- High中有3条数据,其中打球0条,不打球3条
- Normal中有2条数据,其中打球2条,不打球0条
如下图所示:
计算信息增益:
E
(
H
u
m
i
d
i
t
y
)
=
−
2
5
l
o
g
2
2
5
−
3
5
l
o
g
2
3
5
E
(
S
H
i
g
h
)
=
−
0
3
l
o
g
2
0
3
−
3
3
l
o
g
2
3
3
E
(
S
N
o
r
m
a
l
)
=
−
2
2
l
o
g
2
2
2
−
0
2
l
o
g
2
0
2
G
a
i
n
(
H
u
m
i
d
i
t
y
)
=
E
(
H
u
m
i
d
i
t
y
)
−
3
5
E
(
S
H
i
g
h
)
+
2
5
E
(
S
N
o
r
m
a
l
)
=
0.9710
E(Humidity)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{High})=-\frac{0}{3}log_2\frac{0}{3}-\frac{3}{3}log_2\frac{3}{3} \\E(S_{Normal})=-\frac{2}{2}log_2\frac{2}{2}-\frac{0}{2}log_2\frac{0}{2} \\Gain(Humidity)=E(Humidity)-\frac{3}{5}E(S_{High})+\frac{2}{5}E(S_{Normal}) =0.9710
E(Humidity)=−52log252−53log253E(SHigh)=−30log230−33log233E(SNormal)=−22log222−20log220Gain(Humidity)=E(Humidity)−53E(SHigh)+52E(SNormal)=0.9710
(3)以“Wind”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Wind进行划分:
- Weak中有3条数据,其中打球1条,不打球2条
- Strong中有2条数据,其中打球1条,不打球1条
如下图所示:
计算信息增益:
E
(
W
i
n
d
)
=
−
2
5
l
o
g
2
2
5
−
3
5
l
o
g
2
3
5
E
(
S
W
e
a
k
)
=
−
1
3
l
o
g
2
1
3
−
2
3
l
o
g
2
2
3
E
(
S
S
t
r
o
n
g
)
=
−
1
2
l
o
g
2
1
2
−
1
2
l
o
g
2
1
2
G
a
i
n
(
W
i
n
d
)
=
E
(
W
i
n
d
)
−
[
3
5
E
(
S
W
e
a
k
)
+
2
5
E
(
S
S
t
r
o
n
g
)
]
=
0.019973
E(Wind)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Weak})=-\frac{1}{3}log_2\frac{1}{3}-\frac{2}{3}log_2\frac{2}{3} \\E(S_{Strong})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.019973
E(Wind)=−52log252−53log253E(SWeak)=−31log231−32log232E(SStrong)=−21log221−21log221Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.019973
(4)左儿子节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Temperature)=0.5710
- Gain(S,Humidity)=0.9710
- Gain(S,Wind)=0.019973
所以,对于当前节点,用“Humidity”划分样本集S的信息增益最大,被选为划分属性。
右儿子节点(Rain)
(1)以“Temperature”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Temperature进行划分:
- Mild中有3条数据,其中打球2条,不打球1条
- Cool中有2条数据,其中打球1条,不打球1条
如下图所示:
计算信息增益:
E
(
T
e
m
p
e
r
a
t
u
r
e
)
=
−
3
5
l
o
g
2
3
5
−
2
5
l
o
g
2
2
5
E
(
S
M
i
l
d
)
=
−
2
3
l
o
g
2
2
3
−
1
3
l
o
g
2
1
3
E
(
S
R
a
i
n
)
=
−
1
2
l
o
g
2
1
2
−
1
2
l
o
g
2
1
2
G
a
i
n
(
T
e
m
p
e
r
a
t
u
r
e
)
=
E
(
T
e
m
p
e
r
a
t
u
r
e
)
−
[
3
5
E
(
S
M
i
l
d
)
+
2
5
E
(
S
C
o
o
l
)
]
=
0.019973
E(Temperature)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Mild})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\E(S_{Rain})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Temperature)=E(Temperature)-[\frac{3}{5}E(S_{Mild})+\frac{2}{5}E(S_{Cool})] = 0.019973
E(Temperature)=−53log253−52log252E(SMild)=−32log232−31log231E(SRain)=−21log221−21log221Gain(Temperature)=E(Temperature)−[53E(SMild)+52E(SCool)]=0.019973
(2)以“Humidity”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Humidity进行划分:
- High中有2条数据,其中打球1条,不打球1条
- Normal中有3条数据,其中打球2条,不打球1条
如下图所示:
计算信息增益:
E
(
H
u
m
i
d
i
t
y
)
=
−
3
5
l
o
g
2
3
5
−
2
5
l
o
g
2
2
5
E
(
S
H
i
g
h
)
=
−
1
2
l
o
g
2
1
2
−
1
2
l
o
g
2
1
2
E
(
S
N
o
r
m
a
l
)
=
−
2
3
l
o
g
2
2
3
−
1
3
l
o
g
2
1
3
G
a
i
n
(
H
u
m
i
d
i
t
y
)
=
E
(
H
u
m
i
d
i
t
y
)
−
[
2
5
E
(
S
H
i
g
h
)
+
3
5
E
(
S
N
o
r
m
a
l
)
]
=
0.019973
E(Humidity)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{High})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Normal})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\Gain(Humidity)=E(Humidity)-[\frac{2}{5}E(S_{High})+\frac{3}{5}E(S_{Normal})] = 0.019973
E(Humidity)=−53log253−52log252E(SHigh)=−21log221−21log221E(SNormal)=−32log232−31log231Gain(Humidity)=E(Humidity)−[52E(SHigh)+53E(SNormal)]=0.019973
(3)以“Wind”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Wind进行划分:
- Weak中有3条数据,其中打球3条,不打球0条
- Strong中有2条数据,其中打球0条,不打球2条
如下图所示:
计算信息增益:
E
(
W
i
n
d
)
=
−
3
5
l
o
g
2
3
5
−
2
5
l
o
g
2
2
5
E
(
S
W
e
a
k
)
=
−
3
3
l
o
g
2
3
3
−
0
3
l
o
g
2
0
3
E
(
S
S
t
r
o
n
g
)
=
−
0
2
l
o
g
2
0
2
−
2
2
l
o
g
2
2
2
G
a
i
n
(
W
i
n
d
)
=
E
(
W
i
n
d
)
−
[
3
5
E
(
S
W
e
a
k
)
+
2
5
E
(
S
S
t
r
o
n
g
)
]
=
0.9710
E(Wind)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Weak})=-\frac{3}{3}log_2\frac{3}{3}-\frac{0}{3}log_2\frac{0}{3} \\E(S_{Strong})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.9710
E(Wind)=−53log253−52log252E(SWeak)=−33log233−30log230E(SStrong)=−20log220−22log222Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.9710
(4)右儿子节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Temperature)= 0.019973
- Gain(S,Humidity)=0.019973
- Gain(S,Wind)=0.9710
所以,对于当前节点,用“Wind”划分样本集S的信息增益最大,被选为划分属性。