作者:chen_h
微信号 & QQ:862251340
微信公众号:coderpai
只有当人们可以清楚地阅读和理解其决策时,算法才是透明的。 尽管深度学习是当今机器学习的超级明星,但它是一种不透明的算法,我们不知道决策的原因。 在此,决策树算法仍然保持其受欢迎程度,因为它们可以产生透明的决策。 ID3使用信息增益,而C4.5使用增益比进行分割。 这里,CART是一种替代决策树构建算法。 它可以处理分类和回归任务。 此算法使用名为gini index的新度量标准为分类任务创建决策点。 我们将从头开始逐步提及CART决策树示例。
我们将在ID3中处理相同的数据集。 根据 Outlook,Temperature,humidity 和 wind 因子,有14个高尔夫球场决定。
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
3 | Overcast | Hot | High | Weak | Yes |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
7 | Overcast | Cool | Normal | Strong | Yes |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
基尼指数
基尼指数 (gini index) 是 CART 中分类任务的度量标准。它存储每个类别的平方概率的总和,我们可以得到如下:
G i n i = 1 – Σ ( P i ) 2 f o r i = 1 t o n u m b e r o f c l a s s e s Gini = 1 – Σ (Pi)^2 for i=1 to number of classes Gini=1–Σ(Pi)2fori=1tonumberofclasses
Outlook 因子
Outlook 是一个标签因子。它可以是 sunny,overcast 或者 rain。我们来总结一下 Outlook 的特征属性和决策。
Outlook | Yes | No | Number of instances |
---|---|---|---|
Sunny | 2 | 3 | 5 |
Overcast | 4 | 0 | 4 |
Rain | 3 | 2 | 5 |
G i n i ( O u t l o o k = S u n n y ) = 1 – ( 2 / 5 ) 2 – ( 3 / 5 ) 2 = 1 – 0.16 – 0.36 = 0.48 Gini(Outlook=Sunny) = 1 – (2/5)^2 – (3/5)^2 = 1 – 0.16 – 0.36 = 0.48 Gini(Outlook=Sunny)=1–(2/5)2–(3/5)2=1–0.16–0.36=0.48
G i n i ( O u t l o o k = O v e r c a s t ) = 1 – ( 4 / 4 ) 2 – ( 0 / 4 ) 2 = 0 Gini(Outlook=Overcast) = 1 – (4/4)^2 – (0/4)^2 = 0 Gini(Outlook=Overcast)=1–(4/4)2–(0/4)2=0
G i n i ( O u t l o o k = R a i n ) = 1 – ( 3 / 5 ) 2 – ( 2 / 5 ) 2 = 1 – 0.36 – 0.16 = 0.48 Gini(Outlook=Rain) = 1 – (3/5)^2 – (2/5)^2 = 1 – 0.36 – 0.16 = 0.48 Gini(Outlook=Rain)=1–(3/5)2–(2/5)2=1–0.36–0.16=0.48
然后,我们将计算 Outlook 特征的基尼指数的加权和。
G i n i ( O u t l o o k ) = ( 5 / 14 ) x 0.48 + ( 4 / 14 ) x 0 + ( 5 / 14 ) x 0.48 = 0.171 + 0 + 0.171 = 0.342 Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342 Gini(Outlook)=(5/14)x0.48+(4/14)x0+(5/14)x0.48=0.171+0+0.171=0.342
Temperature 因子
同样,Temperature 也是一个标签因子,它可能有3个不同的值:cool,hot 和 mild。让我们总结 Temperature 特征的决策。
Temperature | Yes | No | Number of instances |
---|---|---|---|
Hot | 2 | 2 | 4 |
Cool | 3 | 1 | 4 |
Mild | 4 | 2 | 6 |
G i n i ( T e m p = H o t ) = 1 – ( 2 / 4 ) 2 – ( 2 / 4 ) 2 = 0.5 Gini(Temp=Hot) = 1 – (2/4)^2 – (2/4)^2 = 0.5 Gini(Temp=Hot)=1–(2/4)2–(2/4)2=0.5
G i n i ( T e m p = C o o l ) = 1 – ( 3 / 4 ) 2 – ( 1 / 4 ) 2 = 1 – 0.5625 – 0.0625 = 0.375 Gini(Temp=Cool) = 1 – (3/4)^2 – (1/4)^2 = 1 – 0.5625 – 0.0625 = 0.375 Gini(Temp=Cool)=1–(3/4)2–(1/4)2=1–0.5625–0.0625=0.375
G i n i ( T e m p = M i l d ) = 1 – ( 4 / 6 ) 2 – ( 2 / 6 ) 2 = 1 – 0.444 – 0.111 = 0.445 Gini(Temp=Mild) = 1 – (4/6)^2 – (2/6)^2 = 1 – 0.444 – 0.111 = 0.445 Gini(Temp=Mild)=1–(4/6)2–(2/6)2=1–0.444–0.111=0.445
我们将计算 Temperature 特征的基尼指数的加权和。
G i n i ( T e m p ) = ( 4 / 14 ) x 0.5 + ( 4 / 14 ) x 0.375 + ( 6 / 14 ) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439 Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439 Gini(Temp)=(4/14)x0.5+(4/14)x0.375+(6/14)x0.445=0.142+0.107+0.190=0.439
Humidity 因子
humidity 因子是二元特征,它可以是 high 或者 normal。
Humidity | Yes | No | Number of instances |
---|---|---|---|
High | 3 | 4 | 7 |
Normal | 6 | 1 | 7 |
G i n i ( H u m i d i t y = H i g h ) = 1 – ( 3 / 7 ) 2 – ( 4 / 7 ) 2 = 1 – 0.183 – 0.326 = 0.489 Gini(Humidity=High) = 1 – (3/7)^2 – (4/7)^2 = 1 – 0.183 – 0.326 = 0.489 Gini(Humidity=High)=1–(3/7)2–(4/7)2=1–0.183–0.326=0.489
G i n i ( H u m i d i t y = N o r m a l ) = 1 – ( 6 / 7 ) 2 – ( 1 / 7 ) 2 = 1 – 0.734 – 0.02 = 0.244 Gini(Humidity=Normal) = 1 – (6/7)^2 – (1/7)^2 = 1 – 0.734 – 0.02 = 0.244 Gini(Humidity=Normal)=1–(6/7)2–(1/7)2=1–0.734–0.02=0.244
我们将计算 Humidity 特征的基尼指数的加权和。
G i n i ( H u m i d i t y ) = ( 7 / 14 ) ∗ 0.489 + ( 7 / 14 ) ∗ 0.244 = 0.367 Gini(Humidity) = (7/14) * 0.489 + (7/14) * 0.244 = 0.367 Gini(Humidity)=(7/14)∗0.489+(7/14)∗0.244=0.367
wind 因子
wind 也是一个二元标签,它可以是 weak 或者 strong。
Wind | Yes | No | Number of instances |
---|---|---|---|
Weak | 6 | 2 | 8 |
Strong | 3 | 3 | 6 |
G i n i ( W i n d = W e a k ) = 1 – ( 6 / 8 ) 2 – ( 2 / 8 ) 2 = 1 – 0.5625 – 0.062 = 0.375 Gini(Wind=Weak) = 1 – (6/8)^2 – (2/8)^2 = 1 – 0.5625 – 0.062 = 0.375 Gini(Wind=Weak)=1–(6/8)2–(2/8)2=1–0.5625–0.062=0.375
G i n i ( W i n d = S t r o n g ) = 1 – ( 3 / 6 ) 2 – ( 3 / 6 ) 2 = 1 – 0.25 – 0.25 = 0.5 Gini(Wind=Strong) = 1 – (3/6)^2 – (3/6)^2 = 1 – 0.25 – 0.25 = 0.5 Gini(Wind=Strong)=1–(3/6)2–(3/6)2=1–0.25–0.25=0.5
G i n i ( W i n d ) = ( 8 / 14 ) ∗ 0.375 + ( 6 / 14 ) ∗ 0.5 = 0.428 Gini(Wind) = (8/14) * 0.375 + (6/14) * 0.5 = 0.428 Gini(Wind)=(8/14)∗0.375+(6/14)∗0.5=0.428
决策树构建
我们计算每个特征的基尼指数的值,根节点是 Outlook ,因为基尼指数最低。
Feature | Gini index |
---|---|
Outlook | 0.342 |
Temperature | 0.439 |
Humidity | 0.367 |
Wind | 0.428 |
我们可以得到决策树为:
你可能从图中也注意到了,如果是 overcast ,那么决策肯定是 yes,也就是说 overcast 叶子节点已经结束分裂了。
我们将在以下步骤中将上面相同的原则应用于这些子数据集。
我们来集中注意力来看看 Outlook = sunny 的情况,我们需要分别计算 Temperature,humidity 和 wind 因子的基尼指数得分。
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
Outlook=sunny时,Temp 的基尼指数
Temperature | Yes | No | Number of instances |
---|---|---|---|
Hot | 0 | 2 | 2 |
Cool | 1 | 0 | 1 |
Mild | 1 | 1 | 2 |
G i n i ( O u t l o o k = S u n n y a n d T e m p . = H o t ) = 1 – ( 0 / 2 ) 2 – ( 2 / 2 ) 2 = 0 Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)^2 – (2/2)^2 = 0 Gini(Outlook=SunnyandTemp.=Hot)=1–(0/2)2–(2/2)2=0
G i n i ( O u t l o o k = S u n n y a n d T e m p . = C o o l ) = 1 – ( 1 / 1 ) 2 – ( 0 / 1 ) 2 = 0 Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)^2 – (0/1)^2 = 0 Gini(Outlook=SunnyandTemp.=Cool)=1–(1/1)2–(0/1)2=0
G i n i ( O u t l o o k = S u n n y a n d T e m p . = M i l d ) = 1 – ( 1 / 2 ) 2 – ( 1 / 2 ) 2 = 1 – 0.25 – 0.25 = 0.5 Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)^2 – (1/2)^2 = 1 – 0.25 – 0.25 = 0.5 Gini(Outlook=SunnyandTemp.=Mild)=1–(1/2)2–(1/2)2=1–0.25–0.25=0.5
G i n i ( O u t l o o k = S u n n y a n d T e m p . ) = ( 2 / 5 ) ∗ 0 + ( 1 / 5 ) ∗ 0 + ( 2 / 5 ) ∗ 0.5 = 0.2 Gini(Outlook=Sunny and Temp.) = (2/5)*0 + (1/5)*0 + (2/5)*0.5 = 0.2 Gini(Outlook=SunnyandTemp.)=(2/5)∗0+(1/5)∗0+(2/5)∗0.5=0.2
outlook=sunny 时,humidity 的基尼指数
Humidity | Yes | No | Number of instances |
---|---|---|---|
High | 0 | 3 | 3 |
Normal | 2 | 0 | 2 |
G i n i ( O u t l o o k = S u n n y a n d H u m i d i t y = H i g h ) = 1 – ( 0 / 3 ) 2 – ( 3 / 3 ) 2 = 0 Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)^2 – (3/3)^2 = 0 Gini(Outlook=SunnyandHumidity=High)=1–(0/3)2–(3/3)2=0
G i n i ( O u t l o o k = S u n n y a n d H u m i d i t y = N o r m a l ) = 1 – ( 2 / 2 ) 2 – ( 0 / 2 ) 2 = 0 Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)^2 – (0/2)^2 = 0 Gini(Outlook=SunnyandHumidity=Normal)=1–(2/2)2–(0/2)2=0
G i n i ( O u t l o o k = S u n n y a n d H u m i d i t y ) = ( 3 / 5 ) ∗ 0 + ( 2 / 5 ) ∗ 0 = 0 Gini(Outlook=Sunny and Humidity) = (3/5)*0 + (2/5)*0 = 0 Gini(Outlook=SunnyandHumidity)=(3/5)∗0+(2/5)∗0=0
outlook = sunny 时,wind 的基尼指数
Wind | Yes | No | Number of instances |
---|---|---|---|
Weak | 1 | 2 | 3 |
Strong | 1 | 1 | 2 |
G i n i ( O u t l o o k = S u n n y a n d W i n d = W e a k ) = 1 – ( 1 / 3 ) 2 – ( 2 / 3 ) 2 = 0.266 Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)^2 – (2/3)^2 = 0.266 Gini(Outlook=SunnyandWind=Weak)=1–(1/3)2–(2/3)2=0.266
G i n i ( O u t l o o k = S u n n y a n d W i n d = S t r o n g ) = 1 − ( 1 / 2 ) 2 – ( 1 / 2 ) 2 = 0.2 Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)^2 – (1/2)^2 = 0.2 Gini(Outlook=SunnyandWind=Strong)=1−(1/2)2–(1/2)2=0.2
G i n i ( O u t l o o k = S u n n y a n d W i n d ) = ( 3 / 5 ) ∗ 0.266 + ( 2 / 5 ) ∗ 0.2 = 0.466 Gini(Outlook=Sunny and Wind) = (3/5)*0.266 + (2/5)*0.2 = 0.466 Gini(Outlook=SunnyandWind)=(3/5)∗0.266+(2/5)∗0.2=0.466
确定 outlook=sunny 时的分支
我们在上面计算了 outlook = sunny 时的各个特征的基尼指数,发现 humidity 的基尼指数是最低的,所以选择 humidity 作为节点。
Feature | Gini index |
---|---|
Temperature | 0.2 |
Humidity | 0 |
Wind | 0.466 |
我们将 humidity 作为一个节点。
如图所示,对于 humidity = high 的情况,决策永远是 no。另一方面,对于 humidity=normal 的情况,决策永远是 yes,所以这个分支就结束了。
接下来,我们来看看 outlook=rain 的情况。
outlook=rain
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
10 | Rain | Mild | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
当 outlook=rain 时,我们来计算 Temp, Humidity 和 wind 的基尼指数得分。
outlook=rain 时,temprature 的基尼指数
Temperature | Yes | No | Number of instances |
---|---|---|---|
Cool | 1 | 1 | 2 |
Mild | 2 | 1 | 3 |
G i n i ( O u t l o o k = R a i n a n d T e m p . = C o o l ) = 1 – ( 1 / 2 ) 2 – ( 1 / 2 ) 2 = 0.5 Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)^2 – (1/2)^2 = 0.5 Gini(Outlook=RainandTemp.=Cool)=1–(1/2)2–(1/2)2=0.5
G i n i ( O u t l o o k = R a i n a n d T e m p . = M i l d ) = 1 – ( 2 / 3 ) 2 – ( 1 / 3 ) 2 = 0.444 Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)^2 – (1/3)^2 = 0.444 Gini(Outlook=RainandTemp.=Mild)=1–(2/3)2–(1/3)2=0.444
G i n i ( O u t l o o k = R a i n a n d T e m p . ) = ( 2 / 5 ) ∗ 0.5 + ( 3 / 5 ) ∗ 0.444 = 0.466 Gini(Outlook=Rain and Temp.) = (2/5)*0.5 + (3/5)*0.444 = 0.466 Gini(Outlook=RainandTemp.)=(2/5)∗0.5+(3/5)∗0.444=0.466
outlook=rain 时,humidity 的基尼指数
Humidity | Yes | No | Number of instances |
---|---|---|---|
High | 1 | 1 | 2 |
Normal | 2 | 1 | 3 |
G i n i ( O u t l o o k = R a i n a n d H u m i d i t y = H i g h ) = 1 – ( 1 / 2 ) 2 – ( 1 / 2 ) 2 = 0.5 Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)^2 – (1/2)^2 = 0.5 Gini(Outlook=RainandHumidity=High)=1–(1/2)2–(1/2)2=0.5
G i n i ( O u t l o o k = R a i n a n d H u m i d i t y = N o r m a l ) = 1 – ( 2 / 3 ) 2 – ( 1 / 3 ) 2 = 0.444 Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)^2 – (1/3)^2 = 0.444 Gini(Outlook=RainandHumidity=Normal)=1–(2/3)2–(1/3)2=0.444
G i n i ( O u t l o o k = R a i n a n d H u m i d i t y ) = ( 2 / 5 ) ∗ 0.5 + ( 3 / 5 ) ∗ 0.444 = 0.466 Gini(Outlook=Rain and Humidity) = (2/5)*0.5 + (3/5)*0.444 = 0.466 Gini(Outlook=RainandHumidity)=(2/5)∗0.5+(3/5)∗0.444=0.466
outlook=rain 时,wind 的基尼指数
Wind | Yes | No | Number of instances |
---|---|---|---|
Weak | 3 | 0 | 3 |
Strong | 0 | 2 | 2 |
G i n i ( O u t l o o k = R a i n a n d W i n d = W e a k ) = 1 – ( 3 / 3 ) 2 – ( 0 / 3 ) 2 = 0 Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)^2 – (0/3)^2 = 0 Gini(Outlook=RainandWind=Weak)=1–(3/3)2–(0/3)2=0
G i n i ( O u t l o o k = R a i n a n d W i n d = S t r o n g ) = 1 – ( 0 / 2 ) 2 – ( 2 / 2 ) 2 = 0 Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)^2 – (2/2)^2 = 0 Gini(Outlook=RainandWind=Strong)=1–(0/2)2–(2/2)2=0
G i n i ( O u t l o o k = R a i n a n d W i n d ) = ( 3 / 5 ) ∗ 0 + ( 2 / 5 ) ∗ 0 = 0 Gini(Outlook=Rain and Wind) = (3/5)*0 + (2/5)*0 = 0 Gini(Outlook=RainandWind)=(3/5)∗0+(2/5)∗0=0
outlook=rain 时的决策过程
当 outlook=rain 时,我们发现 wind 特征的基尼指数是最小的,所以我们选择它作为我们的分节点。
Feature | Gini index |
---|---|
Temperature | 0.466 |
Humidity | 0.466 |
Wind | 0 |
接下里,我们可以重新画我们的树结构了。
如图所示,当 wind=weak 时,决策一直是 yes。当 wind=strong 时,决策一直是 no。这就意味着决策到此结束了。
因此,决策树的构建已经结束了。我们手工制作了这颗决策树。顺便说一下,你可能会发现这个跟我们在 ID3 例子中创建了完全相同的树。这并不意味着 ID3 和 CART 算法始终生成的树是相同的。我们从这三篇教程中发现,CART 比 ID3 和 C4.5 都简单,不是吗???