作者:chen_h
微信号 & QQ:862251340
微信公众号:coderpai
决策树现在仍然是数据科学界的热门话题。 在这里,ID3是最常见的传统决策树算法,但它有瓶颈。 属性必须是名义值,数据集不得包含缺失数据,最后算法往往会过度拟合。 在这里,ID3的发明者Ross Quinlan对这些瓶颈做了一些改进,并创建了一个名为C4.5的新算法。**(ID3 还有一些什么瓶颈,需要调研一下。) **现在,该算法可以创建更通用的模型,包括连续数据,并可以处理丢失的数据。 此外,一些资源如 Weka 将此算法命名为 J48。 实际上,它指的是C4.5版本8的重新实现。
我们将为以下数据集创建决策表。 它告知决策因素。 如果你学习了上面的 ID3 算法可能对数据集很熟悉。 不同之处在于温度和湿度列具有连续值而不是标称值。
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
我们将在 ID3 示例中完成我们所做的工作。首先,我们需要计算全局熵。上表中有 14 个实例,9 个实例指的是 yes 的决定,5 个实例指的是 no 的决定。
E n e r o p y ( D e c i s i o n ) = ∑ − p ( x ) ∗ l o g 2 p ( x ) = − p ( y e s ) ∗ l o g 2 p ( y e s ) − p ( n o ) ∗ l o g 2 p ( n o ) = − 9 14 ∗ l o g 2 ( 9 14 ) − 5 14 ∗ l o g 2 ( 5 14 ) = 0.940 Eneropy(Decision)=\sum-p(x)*log_2{p(x)} = -p(yes)*log_2{p(yes)}-p(no)*log_2{p(no)}=-\frac{9}{14}*log_2(\frac{9}{14})-\frac{5}{14}*log_2(\frac{5}{14})=0.940 Eneropy(Decision)=∑−p(x)∗log2p(x)=−p(yes)∗log2p(yes)−p(no)∗log2p(no)=−149∗log2(149)−145∗log2(145)=0.940
在 ID3 算法中,我们计算了每个属性的信息增益。在这里,我们需要计算信息增益比,而不是信息增益。
G a i n R a t i o ( A ) = G a i n ( A ) / S p l i t I n f o ( A ) GainRatio(A)=Gain(A)/SplitInfo(A) GainRatio(A)=Gain(A)/SplitInfo(A)
S p l i t I n f o ( A ) = − ∑ ∣ D j ∣ ∣ D ∣ ∗ l o g 2 ( ∣ D j ∣ ∣ D ∣ ) SplitInfo(A)=-\sum\frac{|D_j|}{|D|}*log_2(\frac{|D_j|}{|D|}) SplitInfo(A)=−∑∣D∣∣Dj∣∗log2(∣D∣∣Dj∣)
Wind 因子
Wind 是一个很常规的因子,它有两种属性 weak 和 strong。
G a i n ( D e c i s i o n , W i n d ) = E n t r o p y ( D e c i s i o n ) – ∑ ( p ( D e c i s i o n ∣ W i n d ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d ) ) Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) * Entropy(Decision|Wind) ) Gain(Decision,Wind)=Entropy(Decision)–∑(p(Decision∣Wind)∗Entropy(Decision∣Wind))
G a i n ( D e c i s i o n , W i n d ) = E n t r o p y ( D e c i s i o n ) – [ p ( D e c i s i o n ∣ W i n d = W e a k ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d = W e a k ) ] + [ p ( D e c i s i o n ∣ W i n d = S t r o n g ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d = S t r o n g ) ] Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) * Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) * Entropy(Decision|Wind=Strong) ] Gain(Decision,Wind)=Entropy(Decision)–[p(Decision∣Wind=Weak)∗Entropy(Decision∣Wind=Weak)]+[p(Decision∣Wind=Strong)∗Entropy(Decision∣Wind=Strong)]
这里有 8 个 weak 实例,2 个决策是 no,6 个决策是 yes 。
E n t r o p y ( D e c i s i o n ∣ W i n d = W e a k ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = – ( 2 / 8 ) . l o g 2 ( 2 / 8 ) – ( 6 / 8 ) . l o g 2 ( 6 / 8 ) = 0.811 Entropy(Decision|Wind=Weak) = – p(No) * log_2p(No) – p(Yes) * log_2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811 Entropy(Decision∣Wind=Weak)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=–(2/8).log2(2/8)–(6/8).log2(6/8)=0.811
E n t r o p y ( D e c i s i o n ∣ W i n d = S t r o n g ) = – ( 3 / 6 ) ∗ l o g 2 ( 3 / 6 ) – ( 3 / 6 ) ∗ l o g 2 ( 3 / 6 ) = 1 Entropy(Decision|Wind=Strong) = – (3/6) * log2(3/6) – (3/6) * log2(3/6) = 1 Entropy(Decision∣Wind=Strong)=–(3/6)∗log2(3/6)–(3/6)∗log2(3/6)=1
G a i n ( D e c i s i o n , W i n d ) = 0.940 – ( 8 / 14 ) ∗ ( 0.811 ) – ( 6 / 14 ) ∗ ( 1 ) = 0.940 – 0.463 – 0.428 = 0.049 Gain(Decision, Wind) = 0.940 – (8/14)*(0.811) – (6/14)*(1) = 0.940 – 0.463 – 0.428 = 0.049 Gain(Decision,Wind)=0.940–(8/14)∗(0.811)–(6/14)∗(1)=0.940–0.463–0.428=0.049
当 wind=weak 时,我们做了 8 个决策;当 wind=strong 时,我们做了 6 个决策。
S p l i t I n f o ( D e c i s i o n , W i n d ) = − ( 8 / 14 ) ∗ l o g 2 ( 8 / 14 ) – ( 6 / 14 ) ∗ l o g 2 ( 6 / 14 ) = 0.461 + 0.524 = 0.985 SplitInfo(Decision, Wind) = -(8/14)*log_2(8/14) – (6/14)*log_2(6/14) = 0.461 + 0.524 = 0.985 SplitInfo(Decision,Wind)=−(8/14)∗log2(8/14)–(6/14)∗log2(6/14)=0.461+0.524=0.985
G a i n R a t i o ( D e c i s i o n , W i n d ) = G a i n ( D e c i s i o n , W i n d ) / S p l i t I n f o ( D e c i s i o n , W i n d ) = 0.049 / 0.985 = 0.049 GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049 GainRatio(Decision,Wind)=Gain(Decision,Wind)/SplitInfo(Decision,Wind)=0.049/0.985=0.049
Outlook 因子
Outlook 也是一个名义上的因子,它可能的取值是 sunny,overcast 和 rain。
G a i n ( D e c i s i o n , O u t l o o k ) = E n t r o p y ( D e c i s i o n ) – ∑ ( p ( D e c i s i o n ∣ O u t l o o k ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k ) ) Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) * Entropy(Decision|Outlook) ) Gain(Decision,Outlook)=Entropy(Decision)–∑(p(Decision∣Outlook)∗Entropy(Decision∣Outlook))
G a i n ( D e c i s i o n , O u t l o o k ) = E n t r o p y ( D e c i s i o n ) – p ( D e c i s i o n ∣ O u t l o o k = S u n n y ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = S u n n y ) – p ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) – p ( D e c i s i o n ∣ O u t l o o k = R a i n ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = R a i n ) Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) * Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) * Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) * Entropy(Decision|Outlook=Rain) Gain(Decision,Outlook)=Entropy(Decision)–p(Decision∣Outlook=Sunny)∗Entropy(Decision∣Outlook=Sunny)–p(Decision∣Outlook=Overcast)∗Entropy(Decision∣Outlook=Overcast)–p(Decision∣Outlook=Rain)∗Entropy(Decision∣Outlook=Rain)
这里有 5 个实例是 sunny,其中有 3 个实例的决策是 no,2 个实例的决策是 yes。
E n t r o p y ( D e c i s i o n ∣ O u t l o o k = S u n n y ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 3 / 5 ) ∗ l o g 2 ( 3 / 5 ) – ( 2 / 5 ) ∗ l o g 2 ( 2 / 5 ) = 0.441 + 0.528 = 0.970 Entropy(Decision|Outlook=Sunny) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(3/5)*log2(3/5) – (2/5)*log2(2/5) = 0.441 + 0.528 = 0.970 Entropy(Decision∣Outlook=Sunny)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(3/5)∗log2(3/5)–(2/5)∗log2(2/5)=0.441+0.528=0.970
E n t r o p y ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 0 / 4 ) ∗ l o g 2 ( 0 / 4 ) – ( 4 / 4 ) ∗ l o g 2 ( 4 / 4 ) = 0 Entropy(Decision|Outlook=Overcast) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/4)*log2(0/4) – (4/4)*log2(4/4) = 0 Entropy(Decision∣Outlook=Overcast)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/4)∗log2(0/4)–(4/4)∗log2(4/4)=0
E n t r o p y ( D e c i s i o n ∣ O u t l o o k = R a i n ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 2 / 5 ) ∗ l o g 2 ( 2 / 5 ) – ( 3 / 5 ) ∗ l o g 2 ( 3 / 5 ) = 0.528 + 0.441 = 0.970 Entropy(Decision|Outlook=Rain) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(2/5)*log2(2/5) – (3/5)*log2(3/5) = 0.528 + 0.441 = 0.970 Entropy(Decision∣Outlook=Rain)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(2/5)∗log2(2/5)–(3/5)∗log2(3/5)=0.528+0.441=0.970
G a i n ( D e c i s i o n , O u t l o o k ) = 0.940 – ( 5 / 14 ) ∗ ( 0.970 ) – ( 4 / 14 ) ∗ ( 0 ) – ( 5 / 14 ) ∗ ( 0.970 ) – ( 5 / 14 ) ∗ ( 0.970 ) = 0.246 Gain(Decision, Outlook) = 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970) – (5/14)*(0.970) = 0.246 Gain(Decision,Outlook)=0.940–(5/14)∗(0.970)–(4/14)∗(0)–(5/14)∗(0.970)–(5/14)∗(0.970)=0.246
这里 sunny 的实例是 5 个,overcast 的实例是 4 个,rain 的实例是 5 个。
S p l i t I n f o ( D e c i s i o n , O u t l o o k ) = − ( 5 / 14 ) ∗ l o g 2 ( 5 / 14 ) − ( 4 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 5 / 14 ) ∗ l o g 2 ( 5 / 14 ) = 1.577 SplitInfo(Decision, Outlook) = -(5/14)*log2(5/14) -(4/14)*log2(4/14) -(5/14)*log2(5/14) = 1.577 SplitInfo(Decision,Outlook)=−(5/14)∗log2(5/14)−(4/14)∗log2(4/14)−(5/14)∗log2(5/14)=1.577
G a i n R a t i o ( D e c i s i o n , O u t l o o k ) = G a i n ( D e c i s i o n , O u t l o o k ) / S p l i t I n f o ( D e c i s i o n , O u t l o o k ) = 0.246 / 1.577 = 0.155 GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155 GainRatio(Decision,Outlook)=Gain(Decision,Outlook)/SplitInfo(Decision,Outlook)=0.246/1.577=0.155
Humidity 因子
作为一个例外,humidity 是一个连续的因子。我们需要将连续值转换为标签数据。C4.5 算法建议基于阈值执行二进制分割。阈值应该是为该属性提供最大增益的值。让我们关注 humidity 因子。首先,我们需要将 Humidity 因子从最小到最大排序。
Day | Humidity | Decision |
---|---|---|
7 | 65 | Yes |
6 | 70 | No |
9 | 70 | Yes |
11 | 70 | Yes |
13 | 75 | Yes |
3 | 78 | Yes |
5 | 80 | Yes |
10 | 80 | Yes |
14 | 80 | No |
1 | 85 | No |
2 | 90 | No |
12 | 90 | Yes |
8 | 95 | No |
4 | 96 | Yes |
现在,我们需要迭代所有 humidity 值并将数据集分为两部分,作为小于或等于当前值的实例,以及大于当前值的实例。我们将计划每一步的增益或者增益比。使得增益最大化的值将是阈值。
我们先假设 65 作为 humidity 阈值,那么我们可以来计算信息增益比,如下:
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y < = 65 ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 0 / 1 ) ∗ l o g 2 ( 0 / 1 ) – ( 1 / 1 ) ∗ l o g 2 ( 1 / 1 ) = 0 Entropy(Decision|Humidity<=65) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/1)*log2(0/1) – (1/1)*log2(1/1) = 0 Entropy(Decision∣Humidity<=65)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/1)∗log2(0/1)–(1/1)∗log2(1/1)=0
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y > 65 ) = − ( 5 / 13 ) ∗ l o g 2 ( 5 / 13 ) – ( 8 / 13 ) ∗ l o g 2 ( 8 / 13 ) = 0.530 + 0.431 = 0.961 Entropy(Decision|Humidity>65) = -(5/13)*log2(5/13) – (8/13)*log2(8/13) =0.530 + 0.431 = 0.961 Entropy(Decision∣Humidity>65)=−(5/13)∗log2(5/13)–(8/13)∗log2(8/13)=0.530+0.431=0.961
G a i n ( D e c i s i o n , H u m i d i t y < > 65 ) = 0.940 – ( 1 / 14 ) ∗ 0 – ( 13 / 14 ) ∗ ( 0.961 ) = 0.048 Gain(Decision, Humidity<> 65) = 0.940 – (1/14)*0 – (13/14)*(0.961) = 0.048 Gain(Decision,Humidity<>65)=0.940–(1/14)∗0–(13/14)∗(0.961)=0.048
上面这个符号指决策树的分支小于或者等于65,大于65,两种情况,并不是指 humidity 不等于 65
S p l i t I n f o ( D e c i s i o n , H u m i d i t y < > 65 ) = − ( 1 / 14 ) ∗ l o g 2 ( 1 / 14 ) − ( 13 / 14 ) ∗ l o g 2 ( 13 / 14 ) = 0.371 SplitInfo(Decision, Humidity<> 65) = -(1/14)*log2(1/14) -(13/14)*log2(13/14) = 0.371 SplitInfo(Decision,Humidity<>65)=−(1/14)∗log2(1/14)−(13/14)∗log2(13/14)=0.371
G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 65 ) = 0.126 GainRatio(Decision, Humidity<> 65) = 0.126 GainRatio(Decision,Humidity<>65)=0.126
再次检查 70 作为 humidity 的阈值。
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y < = 70 ) = – ( 1 / 4 ) ∗ l o g 2 ( 1 / 4 ) – ( 3 / 4 ) ∗ l o g 2 ( 3 / 4 ) = 0.811 Entropy(Decision|Humidity<=70) = – (1/4)*log2(1/4) – (3/4)*log2(3/4) = 0.811 Entropy(Decision∣Humidity<=70)=–(1/4)∗log2(1/4)–(3/4)∗log2(3/4)=0.811
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y > 70 ) = – ( 4 / 10 ) ∗ l o g 2 ( 4 / 10 ) – ( 6 / 10 ) ∗ l o g 2 ( 6 / 10 ) = 0.970 Entropy(Decision|Humidity>70) = – (4/10)*log2(4/10) – (6/10)*log2(6/10) = 0.970 Entropy(Decision∣Humidity>70)=–(4/10)∗log2(4/10)–(6/10)∗log2(6/10)=0.970
G a i n ( D e c i s i o n , H u m i d i t y < > 70 ) = 0.940 – ( 4 / 14 ) ∗ ( 0.811 ) – ( 10 / 14 ) ∗ ( 0.970 ) = 0.940 – 0.231 – 0.692 = 0.014 Gain(Decision, Humidity<> 70) = 0.940 – (4/14)*(0.811) – (10/14)*(0.970) = 0.940 – 0.231 – 0.692 = 0.014 Gain(Decision,Humidity<>70)=0.940–(4/14)∗(0.811)–(10/14)∗(0.970)=0.940–0.231–0.692=0.014
S p l i t I n f o ( D e c i s i o n , H u m i d i t y < > 70 ) = − ( 4 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 10 / 14 ) ∗ l o g 2 ( 10 / 14 ) = 0.863 SplitInfo(Decision, Humidity<> 70) = -(4/14)*log2(4/14) -(10/14)*log2(10/14) = 0.863 SplitInfo(Decision,Humidity<>70)=−(4/14)∗log2(4/14)−(10/14)∗log2(10/14)=0.863
G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 70 ) = 0.016 GainRatio(Decision, Humidity<> 70) = 0.016 GainRatio(Decision,Humidity<>70)=0.016
再次检查 75 作为 humidity 的阈值。
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y < = 75 ) = – ( 1 / 5 ) ∗ l o g 2 ( 1 / 5 ) – ( 4 / 5 ) ∗ l o g 2 ( 4 / 5 ) = 0.721 Entropy(Decision|Humidity<=75) = – (1/5)*log2(1/5) – (4/5)*log2(4/5) = 0.721 Entropy(Decision∣Humidity<=75)=–(1/5)∗log2(1/5)–(4/5)∗log2(4/5)=0.721
E n t r o p y ( D e c i s i o n ∣ H u m i d i t y > 75 ) = – ( 4 / 9 ) ∗ l o g 2 ( 4 / 9 ) – ( 5 / 9 ) ∗ l o g 2 ( 5 / 9 ) = 0.991 Entropy(Decision|Humidity>75) = – (4/9)*log2(4/9) – (5/9)*log2(5/9) = 0.991 Entropy(Decision∣Humidity>75)=–(4/9)∗log2(4/9)–(5/9)∗log2(5/9)=0.991
G a i n ( D e c i s i o n , H u m i d i t y < > 75 ) = 0.940 – ( 5 / 14 ) ∗ ( 0.721 ) – ( 9 / 14 ) ∗ ( 0.991 ) = 0.940 – 0.2575 – 0.637 = 0.045 Gain(Decision, Humidity<> 75) = 0.940 – (5/14)*(0.721) – (9/14)*(0.991) = 0.940 – 0.2575 – 0.637 = 0.045 Gain(Decision,Humidity<>75)=0.940–(5/14)∗(0.721)–(9/14)∗(0.991)=0.940–0.2575–0.637=0.045
S p l i t I n f o ( D e c i s i o n , H u m i d i t y < > 75 ) = − ( 5 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 9 / 14 ) ∗ l o g 2 ( 10 / 14 ) = 0.940 SplitInfo(Decision, Humidity<> 75) = -(5/14)*log2(4/14) -(9/14)*log2(10/14) = 0.940 SplitInfo(Decision,Humidity<>75)=−(5/14)∗log2(4/14)−(9/14)∗log2(10/14)=0.940
G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 75 ) = 0.047 GainRatio(Decision, Humidity<> 75) = 0.047 GainRatio(Decision,Humidity<>75)=0.047
例子讲的大家应该都明白了。现在,我们跳过计算并直接写出结果。
G a i n ( D e c i s i o n , H u m i d i t y < > 78 ) = 0.090 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 78 ) = 0.090 Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090 Gain(Decision,Humidity<>78)=0.090,GainRatio(Decision,Humidity<>78)=0.090
G a i n ( D e c i s i o n , H u m i d i t y < > 80 ) = 0.101 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 80 ) = 0.107 Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107 Gain(Decision,Humidity<>80)=0.101,GainRatio(Decision,Humidity<>80)=0.107
G a i n ( D e c i s i o n , H u m i d i t y < > 85 ) = 0.024 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 85 ) = 0.027 Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027 Gain(Decision,Humidity<>85)=0.024,GainRatio(Decision,Humidity<>85)=0.027
G a i n ( D e c i s i o n , H u m i d i t y < > 90 ) = 0.010 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 90 ) = 0.016 Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016 Gain(Decision,Humidity<>90)=0.010,GainRatio(Decision,Humidity<>90)=0.016
G a i n ( D e c i s i o n , H u m i d i t y < > 95 ) = 0.048 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y < > 95 ) = 0.128 Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128 Gain(Decision,Humidity<>95)=0.048,GainRatio(Decision,Humidity<>95)=0.128
因为 humidity 不能大于 96 ,所以我们不再继续往上计算了。
如上面计算的结果,当阈值等于 80 时,增益比最大化。这意味着我们需要在 humidity = 80 来创建信息增益比。
让我们总结一下计算出的增益和增益比。 Outlook 因子具有最大化的增益和增益比。这意味着我们需要将 Outlook 决策放在决策树的根目录中。
Attribute | Gain | GainRatio |
---|---|---|
Wind | 0.049 | 0.049 |
Outlook | 0.246 | 0.155 |
Humidity <> 80 | 0.101 | 0.107 |
在那以后,我们将像 ID3 一样应用类似的步骤并创建以下决策树。Outlook 被放入根节点。现在,我们应该为不同的分支寻找策略。
Outlook = Sunny
我们将 humidity 的值分为大于 80 和 小于等于 80。令人惊讶的是,如果在 Outlook = sunny 时,humidity 大于 80,那么决策一定是 no。同样,如果 Outlook = sunny 时,humidity 小于等于 80,那么决策一定是 yes。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | Yes | Weak | No |
2 | Sunny | 80 | Yes | Strong | No |
8 | Sunny | 72 | Yes | Weak | No |
9 | Sunny | 69 | No | Weak | Yes |
11 | Sunny | 75 | No | Strong | Yes |
Outlook = overcast
如果 outlook=overcast ,那么不管别的参数是什么取值,决策都是 yes 。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
3 | Overcast | 83 | No | Weak | Yes |
7 | Overcast | 64 | No | Strong | Yes |
12 | Overcast | 72 | Yes | Strong | Yes |
13 | Overcast | 81 | No | Weak | Yes |
outlook=rain
我们刚刚看了 outlook = rain 的场景,如果 wind = weak 时,那么决策是 yes。如果 wind = strong 时,那么决策是 no。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
4 | Rain | 70 | Yes | Weak | Yes |
5 | Rain | 68 | No | Weak | Yes |
6 | Rain | 65 | No | Strong | No |
10 | Rain | 75 | No | Weak | Yes |
14 | Rain | 71 | No | Strong | No |
决策表的最终形式如下所示:
结论
因此,C4.5 算法解决了 ID3 中的大多数问题。该算法使用信息增益比而不是信息增益。通过这种方式,它可以创建爱你更多通用的树,而不会陷入过度拟合。此外,该算法基于增益最大化将连续属性转换为标签属性,并且以这种方式它可以处理连续数据。此外,它可以忽略包括缺少数据和处理丢失数据集的实例。