一步一步学习 C4.5 算法

作者:chen_h
微信号 & QQ:862251340
微信公众号:coderpai


决策树现在仍然是数据科学界的热门话题。 在这里,ID3是最常见的传统决策树算法,但它有瓶颈。 属性必须是名义值,数据集不得包含缺失数据,最后算法往往会过度拟合。 在这里,ID3的发明者Ross Quinlan对这些瓶颈做了一些改进,并创建了一个名为C4.5的新算法。**(ID3 还有一些什么瓶颈,需要调研一下。) **现在,该算法可以创建更通用的模型,包括连续数据,并可以处理丢失的数据。 此外,一些资源如 Weka 将此算法命名为 J48。 实际上,它指的是C4.5版本8的重新实现。

我们将为以下数据集创建决策表。 它告知决策因素。 如果你学习了上面的 ID3 算法可能对数据集很熟悉。 不同之处在于温度和湿度列具有连续值而不是标称值。

DayOutlookTemp.HumidityWindDecision
1Sunny8585WeakNo
2Sunny8090StrongNo
3Overcast8378WeakYes
4Rain7096WeakYes
5Rain6880WeakYes
6Rain6570StrongNo
7Overcast6465StrongYes
8Sunny7295WeakNo
9Sunny6970WeakYes
10Rain7580WeakYes
11Sunny7570StrongYes
12Overcast7290StrongYes
13Overcast8175WeakYes
14Rain7180StrongNo

我们将在 ID3 示例中完成我们所做的工作。首先,我们需要计算全局熵。上表中有 14 个实例,9 个实例指的是 yes 的决定,5 个实例指的是 no 的决定。

E n e r o p y ( D e c i s i o n ) = ∑ − p ( x ) ∗ l o g 2 p ( x ) = − p ( y e s ) ∗ l o g 2 p ( y e s ) − p ( n o ) ∗ l o g 2 p ( n o ) = − 9 14 ∗ l o g 2 ( 9 14 ) − 5 14 ∗ l o g 2 ( 5 14 ) = 0.940 Eneropy(Decision)=\sum-p(x)*log_2{p(x)} = -p(yes)*log_2{p(yes)}-p(no)*log_2{p(no)}=-\frac{9}{14}*log_2(\frac{9}{14})-\frac{5}{14}*log_2(\frac{5}{14})=0.940 Eneropy(Decision)=p(x)log2p(x)=p(yes)log2p(yes)p(no)log2p(no)=149log2(149)145log2(145)=0.940

在 ID3 算法中,我们计算了每个属性的信息增益。在这里,我们需要计算信息增益比,而不是信息增益。

G a i n R a t i o ( A ) = G a i n ( A ) / S p l i t I n f o ( A ) GainRatio(A)=Gain(A)/SplitInfo(A) GainRatio(A)=Gain(A)/SplitInfo(A)

S p l i t I n f o ( A ) = − ∑ ∣ D j ∣ ∣ D ∣ ∗ l o g 2 ( ∣ D j ∣ ∣ D ∣ ) SplitInfo(A)=-\sum\frac{|D_j|}{|D|}*log_2(\frac{|D_j|}{|D|}) SplitInfo(A)=DDjlog2(DDj)

Wind 因子

Wind 是一个很常规的因子,它有两种属性 weak 和 strong。

G a i n ( D e c i s i o n , W i n d ) = E n t r o p y ( D e c i s i o n ) – ∑ ( p ( D e c i s i o n ∣ W i n d ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d ) ) Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) * Entropy(Decision|Wind) ) Gain(Decision,Wind)=Entropy(Decision)(p(DecisionWind)Entropy(DecisionWind))

G a i n ( D e c i s i o n , W i n d ) = E n t r o p y ( D e c i s i o n ) – [ p ( D e c i s i o n ∣ W i n d = W e a k ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d = W e a k ) ] + [ p ( D e c i s i o n ∣ W i n d = S t r o n g ) ∗ E n t r o p y ( D e c i s i o n ∣ W i n d = S t r o n g ) ] Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) * Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) * Entropy(Decision|Wind=Strong) ] Gain(Decision,Wind)=Entropy(Decision)[p(DecisionWind=Weak)Entropy(DecisionWind=Weak)]+[p(DecisionWind=Strong)Entropy(DecisionWind=Strong)]

这里有 8 个 weak 实例,2 个决策是 no,6 个决策是 yes 。

E n t r o p y ( D e c i s i o n ∣ W i n d = W e a k ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = – ( 2 / 8 ) . l o g 2 ( 2 / 8 ) – ( 6 / 8 ) . l o g 2 ( 6 / 8 ) = 0.811 Entropy(Decision|Wind=Weak) = – p(No) * log_2p(No) – p(Yes) * log_2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811 Entropy(DecisionWind=Weak)=p(No)log2p(No)p(Yes)log2p(Yes)=(2/8).log2(2/8)(6/8).log2(6/8)=0.811

E n t r o p y ( D e c i s i o n ∣ W i n d = S t r o n g ) = – ( 3 / 6 ) ∗ l o g 2 ( 3 / 6 ) – ( 3 / 6 ) ∗ l o g 2 ( 3 / 6 ) = 1 Entropy(Decision|Wind=Strong) = – (3/6) * log2(3/6) – (3/6) * log2(3/6) = 1 Entropy(DecisionWind=Strong)=(3/6)log2(3/6)(3/6)log2(3/6)=1

G a i n ( D e c i s i o n , W i n d ) = 0.940 – ( 8 / 14 ) ∗ ( 0.811 ) – ( 6 / 14 ) ∗ ( 1 ) = 0.940 – 0.463 – 0.428 = 0.049 Gain(Decision, Wind) = 0.940 – (8/14)*(0.811) – (6/14)*(1) = 0.940 – 0.463 – 0.428 = 0.049 Gain(Decision,Wind)=0.940(8/14)(0.811)(6/14)(1)=0.9400.4630.428=0.049

当 wind=weak 时,我们做了 8 个决策;当 wind=strong 时,我们做了 6 个决策。

S p l i t I n f o ( D e c i s i o n , W i n d ) = − ( 8 / 14 ) ∗ l o g 2 ( 8 / 14 ) – ( 6 / 14 ) ∗ l o g 2 ( 6 / 14 ) = 0.461 + 0.524 = 0.985 SplitInfo(Decision, Wind) = -(8/14)*log_2(8/14) – (6/14)*log_2(6/14) = 0.461 + 0.524 = 0.985 SplitInfo(Decision,Wind)=(8/14)log2(8/14)(6/14)log2(6/14)=0.461+0.524=0.985

G a i n R a t i o ( D e c i s i o n , W i n d ) = G a i n ( D e c i s i o n , W i n d ) / S p l i t I n f o ( D e c i s i o n , W i n d ) = 0.049 / 0.985 = 0.049 GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049 GainRatio(Decision,Wind)=Gain(Decision,Wind)/SplitInfo(Decision,Wind)=0.049/0.985=0.049

Outlook 因子

Outlook 也是一个名义上的因子,它可能的取值是 sunny,overcast 和 rain。

G a i n ( D e c i s i o n , O u t l o o k ) = E n t r o p y ( D e c i s i o n ) – ∑ ( p ( D e c i s i o n ∣ O u t l o o k ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k ) ) Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) * Entropy(Decision|Outlook) ) Gain(Decision,Outlook)=Entropy(Decision)(p(DecisionOutlook)Entropy(DecisionOutlook))

G a i n ( D e c i s i o n , O u t l o o k ) = E n t r o p y ( D e c i s i o n ) – p ( D e c i s i o n ∣ O u t l o o k = S u n n y ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = S u n n y ) – p ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) – p ( D e c i s i o n ∣ O u t l o o k = R a i n ) ∗ E n t r o p y ( D e c i s i o n ∣ O u t l o o k = R a i n ) Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) * Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) * Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) * Entropy(Decision|Outlook=Rain) Gain(Decision,Outlook)=Entropy(Decision)p(DecisionOutlook=Sunny)Entropy(DecisionOutlook=Sunny)p(DecisionOutlook=Overcast)Entropy(DecisionOutlook=Overcast)p(DecisionOutlook=Rain)Entropy(DecisionOutlook=Rain)

这里有 5 个实例是 sunny,其中有 3 个实例的决策是 no,2 个实例的决策是 yes。

E n t r o p y ( D e c i s i o n ∣ O u t l o o k = S u n n y ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 3 / 5 ) ∗ l o g 2 ( 3 / 5 ) – ( 2 / 5 ) ∗ l o g 2 ( 2 / 5 ) = 0.441 + 0.528 = 0.970 Entropy(Decision|Outlook=Sunny) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(3/5)*log2(3/5) – (2/5)*log2(2/5) = 0.441 + 0.528 = 0.970 Entropy(DecisionOutlook=Sunny)=p(No)log2p(No)p(Yes)log2p(Yes)=(3/5)log2(3/5)(2/5)log2(2/5)=0.441+0.528=0.970

E n t r o p y ( D e c i s i o n ∣ O u t l o o k = O v e r c a s t ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 0 / 4 ) ∗ l o g 2 ( 0 / 4 ) – ( 4 / 4 ) ∗ l o g 2 ( 4 / 4 ) = 0 Entropy(Decision|Outlook=Overcast) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/4)*log2(0/4) – (4/4)*log2(4/4) = 0 Entropy(DecisionOutlook=Overcast)=p(No)log2p(No)p(Yes)log2p(Yes)=(0/4)log2(0/4)(4/4)log2(4/4)=0

E n t r o p y ( D e c i s i o n ∣ O u t l o o k = R a i n ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 2 / 5 ) ∗ l o g 2 ( 2 / 5 ) – ( 3 / 5 ) ∗ l o g 2 ( 3 / 5 ) = 0.528 + 0.441 = 0.970 Entropy(Decision|Outlook=Rain) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(2/5)*log2(2/5) – (3/5)*log2(3/5) = 0.528 + 0.441 = 0.970 Entropy(DecisionOutlook=Rain)=p(No)log2p(No)p(Yes)log2p(Yes)=(2/5)log2(2/5)(3/5)log2(3/5)=0.528+0.441=0.970

G a i n ( D e c i s i o n , O u t l o o k ) = 0.940 – ( 5 / 14 ) ∗ ( 0.970 ) – ( 4 / 14 ) ∗ ( 0 ) – ( 5 / 14 ) ∗ ( 0.970 ) – ( 5 / 14 ) ∗ ( 0.970 ) = 0.246 Gain(Decision, Outlook) = 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970) – (5/14)*(0.970) = 0.246 Gain(Decision,Outlook)=0.940(5/14)(0.970)(4/14)(0)(5/14)(0.970)(5/14)(0.970)=0.246

这里 sunny 的实例是 5 个,overcast 的实例是 4 个,rain 的实例是 5 个。

S p l i t I n f o ( D e c i s i o n , O u t l o o k ) = − ( 5 / 14 ) ∗ l o g 2 ( 5 / 14 ) − ( 4 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 5 / 14 ) ∗ l o g 2 ( 5 / 14 ) = 1.577 SplitInfo(Decision, Outlook) = -(5/14)*log2(5/14) -(4/14)*log2(4/14) -(5/14)*log2(5/14) = 1.577 SplitInfo(Decision,Outlook)=(5/14)log2(5/14)(4/14)log2(4/14)(5/14)log2(5/14)=1.577

G a i n R a t i o ( D e c i s i o n , O u t l o o k ) = G a i n ( D e c i s i o n , O u t l o o k ) / S p l i t I n f o ( D e c i s i o n , O u t l o o k ) = 0.246 / 1.577 = 0.155 GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155 GainRatio(Decision,Outlook)=Gain(Decision,Outlook)/SplitInfo(Decision,Outlook)=0.246/1.577=0.155

Humidity 因子

作为一个例外,humidity 是一个连续的因子。我们需要将连续值转换为标签数据。C4.5 算法建议基于阈值执行二进制分割。阈值应该是为该属性提供最大增益的值。让我们关注 humidity 因子。首先,我们需要将 Humidity 因子从最小到最大排序。

DayHumidityDecision
765Yes
670No
970Yes
1170Yes
1375Yes
378Yes
580Yes
1080Yes
1480No
185No
290No
1290Yes
895No
496Yes

现在,我们需要迭代所有 humidity 值并将数据集分为两部分,作为小于或等于当前值的实例,以及大于当前值的实例。我们将计划每一步的增益或者增益比。使得增益最大化的值将是阈值。

我们先假设 65 作为 humidity 阈值,那么我们可以来计算信息增益比,如下:

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &lt; = 65 ) = – p ( N o ) ∗ l o g 2 p ( N o ) – p ( Y e s ) ∗ l o g 2 p ( Y e s ) = − ( 0 / 1 ) ∗ l o g 2 ( 0 / 1 ) – ( 1 / 1 ) ∗ l o g 2 ( 1 / 1 ) = 0 Entropy(Decision|Humidity&lt;=65) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/1)*log2(0/1) – (1/1)*log2(1/1) = 0 Entropy(DecisionHumidity<=65)=p(No)log2p(No)p(Yes)log2p(Yes)=(0/1)log2(0/1)(1/1)log2(1/1)=0

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &gt; 65 ) = − ( 5 / 13 ) ∗ l o g 2 ( 5 / 13 ) – ( 8 / 13 ) ∗ l o g 2 ( 8 / 13 ) = 0.530 + 0.431 = 0.961 Entropy(Decision|Humidity&gt;65) = -(5/13)*log2(5/13) – (8/13)*log2(8/13) =0.530 + 0.431 = 0.961 Entropy(DecisionHumidity>65)=(5/13)log2(5/13)(8/13)log2(8/13)=0.530+0.431=0.961

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 65 ) = 0.940 – ( 1 / 14 ) ∗ 0 – ( 13 / 14 ) ∗ ( 0.961 ) = 0.048 Gain(Decision, Humidity&lt;&gt; 65) = 0.940 – (1/14)*0 – (13/14)*(0.961) = 0.048 Gain(Decision,Humidity<>65)=0.940(1/14)0(13/14)(0.961)=0.048

上面这个符号指决策树的分支小于或者等于65,大于65,两种情况,并不是指 humidity 不等于 65

S p l i t I n f o ( D e c i s i o n , H u m i d i t y &lt; &gt; 65 ) = − ( 1 / 14 ) ∗ l o g 2 ( 1 / 14 ) − ( 13 / 14 ) ∗ l o g 2 ( 13 / 14 ) = 0.371 SplitInfo(Decision, Humidity&lt;&gt; 65) = -(1/14)*log2(1/14) -(13/14)*log2(13/14) = 0.371 SplitInfo(Decision,Humidity<>65)=(1/14)log2(1/14)(13/14)log2(13/14)=0.371

G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 65 ) = 0.126 GainRatio(Decision, Humidity&lt;&gt; 65) = 0.126 GainRatio(Decision,Humidity<>65)=0.126

再次检查 70 作为 humidity 的阈值。

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &lt; = 70 ) = – ( 1 / 4 ) ∗ l o g 2 ( 1 / 4 ) – ( 3 / 4 ) ∗ l o g 2 ( 3 / 4 ) = 0.811 Entropy(Decision|Humidity&lt;=70) = – (1/4)*log2(1/4) – (3/4)*log2(3/4) = 0.811 Entropy(DecisionHumidity<=70)=(1/4)log2(1/4)(3/4)log2(3/4)=0.811

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &gt; 70 ) = – ( 4 / 10 ) ∗ l o g 2 ( 4 / 10 ) – ( 6 / 10 ) ∗ l o g 2 ( 6 / 10 ) = 0.970 Entropy(Decision|Humidity&gt;70) = – (4/10)*log2(4/10) – (6/10)*log2(6/10) = 0.970 Entropy(DecisionHumidity>70)=(4/10)log2(4/10)(6/10)log2(6/10)=0.970

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 70 ) = 0.940 – ( 4 / 14 ) ∗ ( 0.811 ) – ( 10 / 14 ) ∗ ( 0.970 ) = 0.940 – 0.231 – 0.692 = 0.014 Gain(Decision, Humidity&lt;&gt; 70) = 0.940 – (4/14)*(0.811) – (10/14)*(0.970) = 0.940 – 0.231 – 0.692 = 0.014 Gain(Decision,Humidity<>70)=0.940(4/14)(0.811)(10/14)(0.970)=0.9400.2310.692=0.014

S p l i t I n f o ( D e c i s i o n , H u m i d i t y &lt; &gt; 70 ) = − ( 4 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 10 / 14 ) ∗ l o g 2 ( 10 / 14 ) = 0.863 SplitInfo(Decision, Humidity&lt;&gt; 70) = -(4/14)*log2(4/14) -(10/14)*log2(10/14) = 0.863 SplitInfo(Decision,Humidity<>70)=(4/14)log2(4/14)(10/14)log2(10/14)=0.863

G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 70 ) = 0.016 GainRatio(Decision, Humidity&lt;&gt; 70) = 0.016 GainRatio(Decision,Humidity<>70)=0.016

再次检查 75 作为 humidity 的阈值。

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &lt; = 75 ) = – ( 1 / 5 ) ∗ l o g 2 ( 1 / 5 ) – ( 4 / 5 ) ∗ l o g 2 ( 4 / 5 ) = 0.721 Entropy(Decision|Humidity&lt;=75) = – (1/5)*log2(1/5) – (4/5)*log2(4/5) = 0.721 Entropy(DecisionHumidity<=75)=(1/5)log2(1/5)(4/5)log2(4/5)=0.721

E n t r o p y ( D e c i s i o n ∣ H u m i d i t y &gt; 75 ) = – ( 4 / 9 ) ∗ l o g 2 ( 4 / 9 ) – ( 5 / 9 ) ∗ l o g 2 ( 5 / 9 ) = 0.991 Entropy(Decision|Humidity&gt;75) = – (4/9)*log2(4/9) – (5/9)*log2(5/9) = 0.991 Entropy(DecisionHumidity>75)=(4/9)log2(4/9)(5/9)log2(5/9)=0.991

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 75 ) = 0.940 – ( 5 / 14 ) ∗ ( 0.721 ) – ( 9 / 14 ) ∗ ( 0.991 ) = 0.940 – 0.2575 – 0.637 = 0.045 Gain(Decision, Humidity&lt;&gt; 75) = 0.940 – (5/14)*(0.721) – (9/14)*(0.991) = 0.940 – 0.2575 – 0.637 = 0.045 Gain(Decision,Humidity<>75)=0.940(5/14)(0.721)(9/14)(0.991)=0.9400.25750.637=0.045

S p l i t I n f o ( D e c i s i o n , H u m i d i t y &lt; &gt; 75 ) = − ( 5 / 14 ) ∗ l o g 2 ( 4 / 14 ) − ( 9 / 14 ) ∗ l o g 2 ( 10 / 14 ) = 0.940 SplitInfo(Decision, Humidity&lt;&gt; 75) = -(5/14)*log2(4/14) -(9/14)*log2(10/14) = 0.940 SplitInfo(Decision,Humidity<>75)=(5/14)log2(4/14)(9/14)log2(10/14)=0.940

G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 75 ) = 0.047 GainRatio(Decision, Humidity&lt;&gt; 75) = 0.047 GainRatio(Decision,Humidity<>75)=0.047

例子讲的大家应该都明白了。现在,我们跳过计算并直接写出结果。

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 78 ) = 0.090 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 78 ) = 0.090 Gain(Decision, Humidity &lt;&gt; 78) =0.090, GainRatio(Decision, Humidity &lt;&gt; 78) =0.090 Gain(Decision,Humidity<>78)=0.090,GainRatio(Decision,Humidity<>78)=0.090

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 80 ) = 0.101 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 80 ) = 0.107 Gain(Decision, Humidity &lt;&gt; 80) = 0.101, GainRatio(Decision, Humidity &lt;&gt; 80) = 0.107 Gain(Decision,Humidity<>80)=0.101,GainRatio(Decision,Humidity<>80)=0.107

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 85 ) = 0.024 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 85 ) = 0.027 Gain(Decision, Humidity &lt;&gt; 85) = 0.024, GainRatio(Decision, Humidity &lt;&gt; 85) = 0.027 Gain(Decision,Humidity<>85)=0.024,GainRatio(Decision,Humidity<>85)=0.027

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 90 ) = 0.010 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 90 ) = 0.016 Gain(Decision, Humidity &lt;&gt; 90) = 0.010, GainRatio(Decision, Humidity &lt;&gt; 90) = 0.016 Gain(Decision,Humidity<>90)=0.010,GainRatio(Decision,Humidity<>90)=0.016

G a i n ( D e c i s i o n , H u m i d i t y &lt; &gt; 95 ) = 0.048 , G a i n R a t i o ( D e c i s i o n , H u m i d i t y &lt; &gt; 95 ) = 0.128 Gain(Decision, Humidity &lt;&gt; 95) = 0.048, GainRatio(Decision, Humidity &lt;&gt; 95) = 0.128 Gain(Decision,Humidity<>95)=0.048,GainRatio(Decision,Humidity<>95)=0.128

因为 humidity 不能大于 96 ,所以我们不再继续往上计算了。

如上面计算的结果,当阈值等于 80 时,增益比最大化。这意味着我们需要在 humidity = 80 来创建信息增益比。

让我们总结一下计算出的增益和增益比。 Outlook 因子具有最大化的增益和增益比。这意味着我们需要将 Outlook 决策放在决策树的根目录中。

AttributeGainGainRatio
Wind0.0490.049
Outlook0.2460.155
Humidity <> 800.1010.107

在那以后,我们将像 ID3 一样应用类似的步骤并创建以下决策树。Outlook 被放入根节点。现在,我们应该为不同的分支寻找策略。

Outlook = Sunny

我们将 humidity 的值分为大于 80 和 小于等于 80。令人惊讶的是,如果在 Outlook = sunny 时,humidity 大于 80,那么决策一定是 no。同样,如果 Outlook = sunny 时,humidity 小于等于 80,那么决策一定是 yes。

DayOutlookTemp.Hum. > 80WindDecision
1Sunny85YesWeakNo
2Sunny80YesStrongNo
8Sunny72YesWeakNo
9Sunny69NoWeakYes
11Sunny75NoStrongYes
Outlook = overcast

如果 outlook=overcast ,那么不管别的参数是什么取值,决策都是 yes 。

DayOutlookTemp.Hum. > 80WindDecision
3Overcast83NoWeakYes
7Overcast64NoStrongYes
12Overcast72YesStrongYes
13Overcast81NoWeakYes
outlook=rain

我们刚刚看了 outlook = rain 的场景,如果 wind = weak 时,那么决策是 yes。如果 wind = strong 时,那么决策是 no。

DayOutlookTemp.Hum. > 80WindDecision
4Rain70YesWeakYes
5Rain68NoWeakYes
6Rain65NoStrongNo
10Rain75NoWeakYes
14Rain71NoStrongNo

决策表的最终形式如下所示:

在这里插入图片描述

结论

因此,C4.5 算法解决了 ID3 中的大多数问题。该算法使用信息增益比而不是信息增益。通过这种方式,它可以创建爱你更多通用的树,而不会陷入过度拟合。此外,该算法基于增益最大化将连续属性转换为标签属性,并且以这种方式它可以处理连续数据。此外,它可以忽略包括缺少数据和处理丢失数据集的实例。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值