id3 学习总结
一、数据特征
DataSet
ID Attribute1 Attribute2 .. AttributeN Class
1 value_11 value_12 ... value_1N class1
2 value_21 value_22 ... value_2N class2
. ... ... ... ... ...
M value_M1 value_M2... value_MN classM
注:
[1] 每个数据项有N个属性
[2] 每个属性有多个属性值
[3] 每个数据项归属于1个类
二、属性值熵的计算
Entropy
Entropy(S) = S -p(I) log2 p(I)
注:
S ——是数据集合
p(I) ——是S属于某一个类的概率
log2 ——以2为底求对数
例:
[1]如果S是14个数据集合,其中有9个属于类Yes和5个属于类No:
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940
[2]如果S的所有元素属于同一类则entropy为0 (该数据集是被完全分类的).Entropy的范围是从0(完全确定)到1 (全部随机).
三、信息增益的计算
Gain
Gain(S, A) 是数据集S在属性A上的信息增益定义为:
Gain(S, A) = Entropy(S) - S((|Sv| / |S|) * Entropy(Sv))
注:
S对属性A的所有可能值求和
Sv = S中的属性A值为v的子数据集Sv
|Sv| = Sv中的元素个数
|S| = S中的元素个数
四、递归计算生成决策树
[1] 对于S计算Entropy(S),对S的每个属性A计算其Gain(S,A);
[2] 取gain(S,A)最大值时的A属性为决策树跟节点p0;
[3] 对A的每个属性值 v1,v2,… ,vN,分别将S划分为S1,S2,… ,SN,其中v1,v2,… ,vN为p0点的分支的路径取值;
[4] 对S1,S2,…,SN重复[1],[2],[3]步骤,依次获得次决策点p1,p2,…pN,直到得到叶子节点,叶子节点上为类的值。
五、规则生成
[0] DataSet例子
Day | Outlook | Temperature | Humidity | Wind | Play ball |
D1 | Sunny | Hot | High | Weak | No |
D2 | Sunny | Hot | High | Strong | No |
D3 | Overcast | Hot | High | Weak | Yes |
D4 | Rain | Mild | High | Weak | Yes |
D5 | Rain | Cool | Normal | Weak | Yes |
D6 | Rain | Cool | Normal | Strong | No |
D7 | Overcast | Cool | Normal | Strong | Yes |
D8 | Sunny | Mild | High | Weak | No |
D9 | Sunny | Cool | Normal | Weak | Yes |
D10 | Rain | Mild | Normal | Weak | Yes |
D11 | Sunny | Mild | Normal | Strong | Yes |
D12 | Overcast | Mild | High | Strong | Yes |
D13 | Overcast | Hot | Normal | Weak | Yes |
D14 | Rain | Mild | High | Strong | No |
Gain(S,Wind)=Entropy(S)-(8/14)*Entropy(Sweak)-(6/14)*Entropy(Sstrong)
= 0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940
Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
[1] 决策树形象表示:叶子节点为归属类,非叶子节点为某一个属性。
[2] 规则表达式
If Attribute[i] = value[j] AND …THEN Class = class[x]
Attribute[i]为Attribute1到AttributeN
Value[j]为v1到vN
Class[x]为class1到classN
如上图可表示为:
IF outlook = sunny AND humidity = high THEN playball = no
IF outlook = rain AND humidity = high THEN playball = no
IF outlook = rain AND wind = strong THEN playball = yes
IF outlook = overcast THEN playball = yes
IF outlook = rain AND wind = weak THEN playball = yes
我的程序运行结果:
IF Outlook = [Sunny] IF Humidity = [High] THEN IsPlay = No
Humidity = [ Normal ] THEN IsPlay = Yes
IF Outlook = [Overcast] THEN IsPlay = Yes
IF Outlook = [Rain] IF Wind = [Strong] THEN IsPlay = No
Wind = [Weak] THEN IsPlay = Yes
Press any key to continue