Model4: Decision Tree

目录

Example

1.Build a decision tree 

(1)Purity should be maximized when sorting.

(2)Reasonably choose the time to stop the classification.

2.Information entropy(信息熵) 

3.Select classification features

Information gain

4.Split continuous features

5.Regression trees

6.Pruning

(1) Pre-pruning

(2) Post-pruning

7.More Decision Tree Algorithms

(1) ID3 decision tree

(2) C4.5 decision tree

(3) CART decision tree


Example

To introduce this model, let's take an example:

This is an example of the classification of cats and dogs.

Since its eigenvalues are all binary, you end up with a binary tree. 

Each sample entering the decision tree is assigned to a group that matches its characteristics: 

Now, we will introduce the decision tree model ( the main part takes the id3 decision tree as an example ). 

1.Build a decision tree 

The following is the process of its formation.

 

At the same time, we have three requirements in the process of building a decision tree. 

(1)Purity should be maximized when sorting.

(2)Reasonably choose the time to stop the classification.

2.Information entropy(信息熵) 

Information entropy is a concept used to measure the uncertainty of information. In information theory, information entropy is used to express the uncertainty of a random variable or the average measure of the amount of information. The higher the information entropy, the greater the uncertainty of the information, and vice versa.

Now let's calculate the information entropy of the given example.

When the purity of the sample is higher (i.e., a feature of the sample is as identical as possible), the information entropy is smaller, and vice versa.

Its more rigorous definition is given by mathematical symbols :

 

3.Select classification features

The reduction of information entropy is called information gain(信息增益). 

If we classify according to the previous three indicators.

Question: Which classification method works best:

The entropy of the root node minus the weighted average entropy of the child node is used to calculate the information gain. 

The entropy of the root node minus the weighted average entropy of the child node is used to calculate the information gain. 

Information gain

The information gain of the binary category is defined as follows:

 Now we use mathematical symbols to give its more rigorous definition :

 Therefore, the decision tree classification process is summarized as follows:

4.Split continuous features

Weight is a continuous feature, how do we classify it?

We can sort the samples according to this feature, take the midpoint value of these features as the classification criterion, and then calculate the information gain of each category to select the largest classification case. 

5.Regression trees

We can make use of the classification tree for regression.

First we train a decision tree : 

We then set predictions (the mean of the samples of that class) for several groups. 

We can also build a new decision tree on its own:

In this tree, we no longer use the previous information entropy to distinguish the classification criteria, because the information entropy can only be used to judge the purity of the classification group, which is suitable for constructing the classification tree.

But now we need to build a regression tree, the results of the classification of the numerical requirements, so the use of sample variance to determine the classification criteria.

6.Pruning

 

 

 

(1) Pre-pruning

 

(2) Post-pruning

 

7.More Decision Tree Algorithms

(1) ID3 decision tree

The case in this topic is the ID3 decision tree, which is based on information entropy and determines whether to classify according to information gain.

(2) C4.5 decision tree

The C4.5 decision tree no longer uses information gain as the classification standard, but uses the information gain rate :

Compared with ID3 algorithm, C4.5 algorithm has the following advantages : 
1.C4.5 algorithm can handle continuous-valued attributes, while ID3 algorithm can only handle discrete-valued attributes. 
2.Generate a more balanced decision tree : The C4.5 algorithm uses the information gain ratio to select attributes, which helps to generate a more balanced decision tree and reduce the risk of overfitting. 
3.Processing missing values : C4.5 algorithm can deal with missing values in the data set, while ID3 algorithm will encounter problems in the face of missing values. 
4.Pruning optimization(剪枝优化) : The C4.5 algorithm performs pruning optimization after generating the decision tree, which can improve the generalization ability and reduce over-fitting. 
 
In general, the C4.5 algorithm has certain advantages over the ID3 algorithm in dealing with continuous value attributes, generating balanced decision trees, dealing with missing values, and pruning optimization. 

(3) CART decision tree

Cart ( Classification and Regression Trees ) decision tree is a specific type of decision tree, which uses Gini impurity to select the best segmentation point.

Gini impurity is an indicator to measure the degree of data confusion. It represents the probability of randomly selecting two samples, which belong to different categories. The lower the purity of the Gini, the more pure the data, that is, the more concentrated the samples of the same category.

The Cart decision tree splits the data into subsets recursively until some stopping conditions are met. On each node, it selects the best segmentation point to minimize the Gini impurity. The final decision tree can be used for classification or regression prediction.

(Select two probabilities with different values)

Gini purity is defined as follows :

 The Gini index is defined as follows :

We need to find the minimum Gini index to improve purity.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值