Age | Income | Student | Credit rating | Buy computer |
---|---|---|---|---|
<=30 | high | no | fair | no |
<=30 | high | no | excellent | no |
31-40 | high | no | fair | yes |
>40 | medium | no | fair | yes |
>40 | low | yes | fair | yes |
>40 | low | yes | excellent | no |
31-40 | low | yes | excellent | yes |
<=30 | medium | no | fair | no |
<=30 | low | yes | fair | yes |
>40 | medium | yes | fair | yes |
<=30 | medium | yes | excellent | yes |
31-40 | medium | no | excellent | yes |
31-40 | high | yes | fair | yes |
>40 | medium | no | excellent | no |
ID3
选择信息增益最大的特征进行分割。
1.熵(Entropy)
H ( X ) = − ∑ i ∈ C p ( i ) log ( p ( i ) ) H(X) = -\sum_{ i \in C} p(i) \log(p(i)) H(X)=−i∈C∑p(i)log(p(i))
- X: 特征X
- C: 特征X中的所有类
- i:C中的一个类
- 通常为 l o g 2 log_2 log2
以Buy computer这一特征为例:
p
(
n
o
)
=
5
14
p
(
y
e
s
)
=
9
14
H
(
Buy computer
)
=
−
5
14
log
(
5
14
)
−
9
14
log
(
9
14
)
p(no) = \frac{5}{14}\\ p(yes) = \frac{9}{14}\\ H(\text{Buy computer}) = - \frac{5}{14} \log(\frac{5}{14})-\frac{9}{14} \log(\frac{9}{14})
p(no)=145p(yes)=149H(Buy computer)=−145log(145)−149log(149)
2.条件熵(Conditional Entropy)
H ( X ∣ Y ) = ∑ i ∈ C Y p ( i ) H ( X ∣ Y = i ) H(X|Y) = \sum_{i \in C_Y} p(i)H(X|Y=i) H(X∣Y)=i∈CY∑p(i)H(X∣Y=i)
- X: 特征X
- Y: 特征Y
- C Y C_{Y} CY: 特征Y中的所有类
- i: C Y C_{Y} CY中的一个类
以X=Buy computer, Y=Age为例:
p
(
<
=
30
)
=
5
14
p
(
31
−
40
)
=
4
14
p
(
>
40
)
=
5
14
H
(
Buy computer
∣
Age
)
=
5
14
H
(
Buy computer
∣
<
=
30
)
+
4
14
H
(
Buy computer
∣
31
−
40
)
+
5
14
H
(
Buy computer
∣
>
40
)
p
(
n
o
∣
<
=
30
)
=
3
5
p
(
y
e
s
∣
<
=
30
)
=
2
5
H
(
Buy computer
∣
<
=
30
)
=
−
3
5
log
(
3
5
)
−
2
5
log
(
2
5
)
p
(
n
o
∣
31
−
40
)
=
0
p
(
y
e
s
∣
31
−
40
)
=
1
H
(
Buy computer
∣
31
−
40
)
=
−
0
log
(
0
)
−
1
log
(
1
)
=
0
p
(
n
o
∣
>
40
)
=
2
5
p
(
y
e
s
∣
>
40
)
=
3
5
H
(
Buy computer
∣
>
40
)
=
−
2
5
log
(
2
5
)
−
3
5
log
(
3
5
)
p(<=30)=\frac{5}{14}\\ p(31-40) = \frac{4}{14}\\ p(>40)=\frac{5}{14}\\ H(\text{Buy computer}|\text{Age}) = \frac{5}{14}H(\text{Buy computer}|<=30)+ \frac{4}{14}H(\text{Buy computer}|31-40)+\frac{5}{14}H(\text{Buy computer}|>40)\\ p(no|<=30) = \frac{3}{5}\\ p(yes|<=30) = \frac{2}{5}\\ H(\text{Buy computer}|<=30) = - \frac{3}{5} \log(\frac{3}{5})-\frac{2}{5} \log(\frac{2}{5})\\ p(no|31-40) = 0\\ p(yes|31-40) = 1\\ H(\text{Buy computer}|31-40) = -0 \log(0)-1\log(1) = 0\\ p(no|>40) = \frac{2}{5}\\ p(yes|>40) = \frac{3}{5}\\ H(\text{Buy computer}|>40) = - \frac{2}{5} \log(\frac{2}{5})-\frac{3}{5} \log(\frac{3}{5})
p(<=30)=145p(31−40)=144p(>40)=145H(Buy computer∣Age)=145H(Buy computer∣<=30)+144H(Buy computer∣31−40)+145H(Buy computer∣>40)p(no∣<=30)=53p(yes∣<=30)=52H(Buy computer∣<=30)=−53log(53)−52log(52)p(no∣31−40)=0p(yes∣31−40)=1H(Buy computer∣31−40)=−0log(0)−1log(1)=0p(no∣>40)=52p(yes∣>40)=53H(Buy computer∣>40)=−52log(52)−53log(53)
需要注意的是 0 log ( 0 ) = 0 0 \log(0)=0 0log(0)=0
3.信息增益
I G ( X , Y ) = H ( X ) − H ( X ∣ Y ) IG(X,Y) = H(X) - H(X|Y) IG(X,Y)=H(X)−H(X∣Y)
- X: 特征X,为目标特征
- Y: 特征Y
以Age为例:
I G ( Buy computer , Age ) = H ( Buy computer ) − H ( Buy computer ∣ Age ) IG(\text{Buy computer},\text{Age}) = H(\text{Buy computer}) - H(\text{Buy computer}|\text{Age}) IG(Buy computer,Age)=H(Buy computer)−H(Buy computer∣Age)
Gini
选择Gini Split最小的特征进行分割。
Gini Index
g i n i ( X ) = 1 − ∑ i ∈ C p ( i ) 2 gini(X) = 1-\sum_{i \in C}p(i)^2 gini(X)=1−i∈C∑p(i)2
- X: 特征X
- C: 特征X中的所有类
- i:C中的一个类
以Buy computer这一特征为例:
p
(
n
o
)
=
5
14
p
(
y
e
s
)
=
9
14
g
i
n
i
(
Buy computer
)
=
1
−
(
5
14
)
2
−
(
9
14
)
2
p(no) = \frac{5}{14}\\ p(yes) = \frac{9}{14}\\ gini(\text{Buy computer}) = 1- (\frac{5}{14})^2-(\frac{9}{14})^2
p(no)=145p(yes)=149gini(Buy computer)=1−(145)2−(149)2
Gini Split
g i n i s p l i t ( X , Y ) = ∑ i ∈ C Y p ( i ) g i n i ( X ∣ Y = i ) gini_{split}(X,Y)=\sum_{i \in C_Y}p(i)gini(X|Y=i) ginisplit(X,Y)=i∈CY∑p(i)gini(X∣Y=i)
- X: 特征X,为目标特征
- Y: 特征Y
- C Y C_{Y} CY: 特征Y中的所有类
- i: C Y C_{Y} CY中的一个类
以Age为例:
p
(
<
=
30
)
=
5
14
p
(
31
−
40
)
=
4
14
p
(
>
40
)
=
5
14
g
i
n
i
s
p
l
i
t
(
Buy computer
,
Age
)
=
5
14
g
i
n
i
(
Buy computer
∣
<
=
30
)
+
4
14
g
i
n
i
(
Buy computer
∣
31
−
40
)
+
5
14
g
i
n
i
(
Buy computer
∣
>
40
)
p
(
n
o
∣
<
=
30
)
=
3
5
p
(
y
e
s
∣
<
=
30
)
=
2
5
g
i
n
i
(
Buy computer
∣
<
=
30
)
=
1
−
(
3
5
)
2
−
(
2
5
)
2
p
(
n
o
∣
31
−
40
)
=
0
p
(
y
e
s
∣
31
−
40
)
=
1
g
i
n
i
(
Buy computer
∣
31
−
40
)
=
1
−
0
2
−
1
2
=
0
p
(
n
o
∣
>
40
)
=
2
5
p
(
y
e
s
∣
>
40
)
=
3
5
g
i
n
i
(
Buy computer
∣
>
40
)
=
1
−
(
2
5
)
2
−
(
3
5
)
2
p(<=30)=\frac{5}{14}\\ p(31-40) = \frac{4}{14}\\ p(>40)=\frac{5}{14}\\ gini_{split}(\text{Buy computer},\text{Age})=\frac{5}{14}gini(\text{Buy computer}|<=30)+\frac{4}{14}gini(\text{Buy computer}|31-40)+\frac{5}{14}gini(\text{Buy computer}|>40)\\ p(no|<=30) = \frac{3}{5}\\ p(yes|<=30) = \frac{2}{5}\\ gini(\text{Buy computer}|<=30)= 1- (\frac{3}{5})^2-(\frac{2}{5})^2\\ p(no|31-40) = 0\\ p(yes|31-40) = 1\\ gini(\text{Buy computer}|31-40) = 1-0^2-1^2 = 0\\ p(no|>40) = \frac{2}{5}\\ p(yes|>40) = \frac{3}{5}\\ gini(\text{Buy computer}|>40) =1-(\frac{2}{5})^2- (\frac{3}{5})^2
p(<=30)=145p(31−40)=144p(>40)=145ginisplit(Buy computer,Age)=145gini(Buy computer∣<=30)+144gini(Buy computer∣31−40)+145gini(Buy computer∣>40)p(no∣<=30)=53p(yes∣<=30)=52gini(Buy computer∣<=30)=1−(53)2−(52)2p(no∣31−40)=0p(yes∣31−40)=1gini(Buy computer∣31−40)=1−02−12=0p(no∣>40)=52p(yes∣>40)=53gini(Buy computer∣>40)=1−(52)2−(53)2