ID3决策树与C4.5决策树分类算法简述

最新推荐文章于 2024-04-30 13:14:31 发布

weixin_33724659

最新推荐文章于 2024-04-30 13:14:31 发布

阅读量148

点赞数

文章标签：数据结构与算法

原文链接：https://yq.aliyun.com/articles/91215

版权

Let’s begin with ID3 decision tree:
The ID3 algorithm tries to get the most information gain when grow the decision trees. The information gain is defined as

Gain (A) = I (s 1, s 2, \dots, s m) - E (A)

where

I is the information entropy of a given sample setting,

I (s 1, s 2, \dots, s m) = - \sum i = 1 m p i log 2 (p i)

E(A) is the information entropy of the subset classified by attribute

A=(a1,a2,…,av),

E (A) = \sum j = 1 v s i j + s 2 j + \dots + s m j s I (s 1, s 2, \dots, s m)

Moreover,

pi is the probability of an sample belonging to class

Ci, which can be estimated as

pi=si|S| and

pij is the probability an sample belonging to class

Ci with attribute

A=aj, i.e.

pij+sij|Sj|.
ID3 algorithm can be simplified as follows:

For every attribute A, we calculate its information gain E(A).
Pick up the attribute who is of the largest E(A) as the root node or internal node.
Get rid of the grown attribute A, and for every value aj of attribute A, calculate the next node to be grown.
Keep steps 1~3 until each subset has only one label/class Ci.

ID3 algorithm is an old machine learning algorithm created in 1979 based on information entropy, however, there are several problems of it:

ID3 prefers the attribute with more values, though it turns out not to be the optimal one.
ID3 has to calculate the information entropy of every value of every attribute. Hence it always leads to many levels and branches with very little probability, as a result of which it tends to overfit classification in the test set.

C4.5 decision tree
C4,.5 algorithm makes use of Grain Ratio instead of Gain to select attributes.

GainRatio (S, A) = Gain ( S , A ) SplitInfo ( S , A )

where

Gain(S,A) is nothing more than

Gain(A) in ID3, and

SplitInfo(S,A) is defined as

SplitInfo (S, A) = - \sum i = 1 c | s i | | S | log 2 (| S | | s i |)

in which

si to

sc are the sample sets divided by

c values of attribute

A.

weixin_33724659

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

weixin_33724659 CSDN认证博客专家 CSDN认证企业博客

码龄9年

146: 原创

-: 周排名

53万+: 总排名

127万+: 访问

: 等级

7377: 积分

4022: 粉丝

250: 获赞

24: 评论

1172: 收藏

私信

关注

热门文章

最新评论

变量名存放在哪里？
糖炒Li儿: 讲得好清楚
变量名存放在哪里？
Be_yourself113: 由于我没学过最最基础的知识,我的理解是,由于是"copy",复制出来的那个5和待在常量区的那个5是"同一个".计算机是知道常量5所在的内存地址的,所以a需要用到5的时候,只需要知道5的地址就能够把5给copy过来,完成赋值操作,意思就是a里面放的其实是5的地址(这么说可能不太准,具体地址是不是存在a这块空间,我也不知道), 但可以肯定的是 a 是肯定知道本尊5 的地址的.这也能解释下面场景: int a = 5; int b = 5; 当我们打印 a 和 b的地址时,会发现完全相同. 简言之,计算机只需要告诉a (a这块空间) 常量5所在的地址,就能够完成赋值操作.
变量名存放在哪里？
Be_yourself113: 我的理解是,只存在一个常量5,然后这个5被copy两份分别给了a空间和b空间,由于是copy,所以a和b都指向了本尊5.学识尚浅(没真正的去学最基础的东西),只能作为理解,勿喷
java里如何实现循环打印出字符串或字符串数组里的内容
F椎: 这也不循环输出啊
Element源码分析系列4-Radio(单选框)
朝着大前端冲鸭: nextTick的作用就是增加异步，以便组件渲染后再执行逻辑

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。