Let’s begin with ID3 decision tree:
The ID3 algorithm tries to get the most information gain when grow the decision trees. The information gain is defined as
Gain(A)=I(s1,s2,…,sm)−E(A)
where I is the information entropy of a given sample setting,
E(A) is the information entropy of the subset classified by attribute A=(a1,a2,…,av) ,
E(A)=∑j=1vsij+s2j+⋯+smjsI(s1,s2,…,sm)
Moreover, pi is the probability of an sample belonging to class Ci , which can be estimated as pi=si|S| and pij is the probability an sample belonging to class Ci with attribute A=aj , i.e. pij+sij|Sj| .
ID3 algorithm can be simplified as follows:
- For every attribute
A
, we calculate its information gain
E(A) . - Pick up the attribute who is of the largest E(A) as the root node or internal node.
- Get rid of the grown attribute
A
, and for every value
aj of attribute A , calculate the next node to be grown. - Keep steps 1~3 until each subset has only one label/class
Ci .
ID3 algorithm is an old machine learning algorithm created in 1979 based on information entropy, however, there are several problems of it:
- ID3 prefers the attribute with more values, though it turns out not to be the optimal one.
- ID3 has to calculate the information entropy of every value of every attribute. Hence it always leads to many levels and branches with very little probability, as a result of which it tends to overfit classification in the test set.
C4.5 decision tree
C4,.5 algorithm makes use of Grain Ratio instead of Gain to select attributes.
GainRatio(S,A)=Gain(S,A)SplitInfo(S,A)
where Gain(S,A) is nothing more than Gain(A) in ID3, and SplitInfo(S,A) is defined as
SplitInfo(S,A)=−∑i=1c|si||S|log2(|S||si|)
in which si to sc are the sample sets divided by c values of attribute