ID3决策树与C4.5决策树分类算法简述

Let’s begin with ID3 decision tree:
The ID3 algorithm tries to get the most information gain when grow the decision trees. The information gain is defined as

Gain(A)=I(s1,s2,,sm)E(A)

where I is the information entropy of a given sample setting,
I(s1,s2,,sm)=i=1mpilog2(pi)

E(A) is the information entropy of the subset classified by attribute A=(a1,a2,,av),
E(A)=j=1vsij+s2j++smjsI(s1,s2,,sm)

Moreover, pi is the probability of an sample belonging to class Ci, which can be estimated as pi=si|S| and pij is the probability an sample belonging to class Ci with attribute A=aj, i.e. pij+sij|Sj|.
ID3 algorithm can be simplified as follows:
  1. For every attribute A, we calculate its information gain E(A).
  2. Pick up the attribute who is of the largest E(A) as the root node or internal node.
  3. Get rid of the grown attribute A, and for every value aj of attribute A, calculate the next node to be grown.
  4. Keep steps 1~3 until each subset has only one label/class Ci.

ID3 algorithm is an old machine learning algorithm created in 1979 based on information entropy, however, there are several problems of it:

  1. ID3 prefers the attribute with more values, though it turns out not to be the optimal one.
  2. ID3 has to calculate the information entropy of every value of every attribute. Hence it always leads to many levels and branches with very little probability, as a result of which it tends to overfit classification in the test set.

C4.5 decision tree
C4,.5 algorithm makes use of Grain Ratio instead of Gain to select attributes.

GainRatio(S,A)=Gain(S,A)SplitInfo(S,A)

where Gain(S,A) is nothing more than Gain(A) in ID3, and SplitInfo(S,A) is defined as
SplitInfo(S,A)=i=1c|si||S|log2(|S||si|)

in which si to sc are the sample sets divided by c values of attribute A.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值