决策树原理

引用:http://www.saedsayad.com/decision_tree.htm
 

Decision Tree - Classification

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the  same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes andleaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor calledroot node. Decision trees can handle both categorical and numerical data. 

 
Algorithm
The core algorithm for building decision trees calledID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 usesEntropy and Information Gain to construct a decision tree.
 
Entropy

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one.


我的理解:

  思路: 想要用一个函数 来表示信息的不确定性(混乱程度),这个函数得满足两个要求:1. 假设信息中各个元素出现的概率越高 f的取值应该越小(比如 上图中play golf列元素数越多则f取值应越小,反之若该列只有一个取值‘NO’则该列信息就不混乱了(此时概率为1) f取值应该为0.2.两个独立符号所产生的不确定性应等于各自不确定性之和,即fP1P2=fP1+fP2

同时满足这两个条件的函数f是对数函数,即

 

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:
 
a) Entropy using the frequency table of one attribute:

b) Entropy using the frequency table of two attributes:

 
Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

翻译: 信息增益是在数据集被分割到某一特定属性后  熵值的减少量。 构造决策树 就是 找出能带来最大信息增益的特定属性的过程。 

 还是不好理解, 看下面的step2会有帮助。

 
Step 1: Calculate entropy of the target. 

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. 



Step 3: Choose attribute with the largest information gain as the decision node. 

为什么要选择信息增益最大的属性来作为跟节点? 如果Gain(A)越大则属性A对于分类来说提供的信息量也就越大。 选择A来作为跟节点则剩下的 对于分类的不确定性就越小。 这使得在每一个非叶节点上进行测试时能活到关于被测试记录的最大信息类别!!

(至于这背后的原理我还不知道,如果读者有知道的,还请您告知或讨论)

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
 

 

Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by one.

 
 

Decision Trees - Issues

 
Exercise
 
Try to invent a new algorithm to construct a decision tree from data usingChi2 test.
 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值