10 篇文章 0 订阅
5 篇文章 12 订阅

# C4.5

## 决策树构建分析

1. 解决了信息增益（IG）的缺点
2. 解决了连续变量问题

### IG 的缺点及解决方案

I G R = I G I V {IGR = \frac{IG}{IV}} (其中，IG 为信息增益，IV 为分裂信息)
I V = − ∑ i p ( v i ) l o g 2 p ( v i ) {IV = -\sum_i{p(v_i)}log_2{p(v_i)}} (其中， v i {v_i} 为某一特征属性下的第 i 个分支属性)

## 训练数据集

DayOutLookTemperatureHumidityWindPlayGolf
1Sunny8585FalseNo
2Sunny8090TrueNo
3Overcast8378FalseYes
4Rainy7096FalseYes
5Rainy6880FalseYes
6Rainy6570TrueNo
7Overcast6465TrueYes
8Sunny7295FalseNo
9Sunny6970FalseYes
10Rainy7580FalseYes
11Sunny7570TrueYes
12Overcast7290TrueYes
13Overcast8175FalseYes
14Rainy7180TrueNo

## 计算步骤

### IV(T) & IGR(T)

SunnyOvercastRainy

${IV(OutLook) = -\frac{5}{14}log_2{\frac{5}{14}} - \frac{4}{14}log_2{\frac{4}{14}} - \frac{5}{14}log_2{\frac{5}{14}} = 1.577406}$

${IGR(OutLook) = \frac{IG}{IV} = \frac{0.24675}{1.577406} = 0.156428}$

### 连续变量的域值

[[64, Yes], [65, No], [68, Yes], [69, Yes], [70, Yes], [71, No], [72, No], [72, Yes], [75, Yes], [75, Yes], [80, No], [81, Yes], [83, Yes], [85, No]]

${IV(v_4) = IV([4, 1], [5, 4]) = \frac{5}{14}{IV([4, 1])} + \frac{9}{14}{IV([5, 4])}}$
${IV(v_4) = \frac{5}{14}{(-\frac{4}{5}log_2\frac{4}{5} - \frac{1}{5}log_2\frac{1}{5})} + \frac{9}{14}{(-\frac{5}{9}log_2\frac{5}{9} - \frac{4}{9}log_2\frac{4}{9})} = 0.89}$

# Ref

• http://blog.csdn.net/acdreamers/article/details/44664571
• 《数据挖掘十大算法》

• https://github.com/MachineLeanring/MachineLearningC4.5

# 征集

https://www.processon.com/i/56205c2ee4b0f6ed10838a6d

• 5
点赞
• 9
评论
• 4
收藏
• 一键三连
• 扫一扫，分享海报

11-09
09-17 1万+

04-24 3万+
11-13 146
11-18 2万+
07-11 3233
02-05 8406
06-02 1万+
04-18 6万+
11-14 354
08-13 4万+
10-28 4016
04-20 1123
08-12 6382
12-11 9501