首先举出打网球的例子。
Day | Outlook | Temperature | Humidity | Wind | Play Tennis |
1 | sunny | hot | high | weak | no |
2 | sunny | hot | high | strong | no |
3 | overcast | hot | high | weak | yes |
4 | rain | mild | high | weak | yes |
5 | rain | cool | normal | weak | yes |
6 | rain | cool | normal | strong | no |
7 | overcast | cool | normal | strong | yes |
8 | sunny | mild | high | weak | no |
9 | sunny | cool | normal | weak | yes |
10 | rain | mild | normal | weak | yes |
11 | sunny | mild | normal | strong | yes |
12 | overcast | mild | high | strong | yes |
13 | overcast | hot | normal | weak | yes |
14 | rain | mild | high | strong | no |
数据集中包含14个样本,其中9个正样本(yes),5个负样本(no)。则这些元组的期望信息(即熵)为:
Info(D) = - 9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940
现在观察每个属性的期望信息需求。在属性Outlook中,对于sunny,正样本数为2,负样本数为3;对于overcast,正样本数为4,负样本数为0;对与rain,正样本数为3,负样本数为2。
按照Outlook划分样例得到的期望信息为:
5/14 * ( - 2/5log22/5 – 3/5log23/5) + 4/15 * ( - 4/4log24/4) + 5/14 * ( - 3/5log23/5 – 2/5log22/5)=0.694
即其信息增益为:
Gain(outlook) = 0.940 – 0.694 = 0.246
Gain(Temperature) = 0.029
Gain(Humidity) = 0.151
Gain(Wind) = 0.048
继续信息增益的计算,最终得到如下的决策树:
以sunny,mild,normal,FALSE作为测试集,使用决策树,得出其结论为yes。