SAS Module 5 Decision Tree

最新推荐文章于 2021-03-29 14:09:27 发布

J.G.D

最新推荐文章于 2021-03-29 14:09:27 发布

阅读量390

点赞数

分类专栏： SAS 文章标签：数据分析 sas

本文链接：https://blog.csdn.net/JGD_USC/article/details/105986652

版权

5 篇文章 0 订阅

订阅专栏

SAS

CART: Classification And Regression Tree

For classification trees, final prediction is the mode of the training observations in the region
For regression trees, final prediction is the mean of the training observations in the region

Regression Tree:

We divide the predictor space into J distinct and non-overlapping regions R1,R2,R3…Rj
For each region, we find the mean of the response values for the training observations
Calculate total RSS for each region, and find the smallest one (we can do this on Excel)
This equation is to find good cut-point “s” of predictor “Xj” to split, so it will create {Xj|Xj>=s} and {Xj|Xj<s} with smallest RSS
We take a top-down, greedy approach that is know as binary splitting, so repeat the above process again and again

When to stop? : Prune Tree

If not prune tree, it is likely to overfit the data, leading to poor test set performance
Smaller tree might lead to lower variance and better interpretation
The strategy is to grow a very large tree first, and then prune it to a subtree
Prune the tree to the number of nodes with smallest MSE (mean square error)

Classification Tree:

Similar as regression tree, except that it is used to predict qualitative response
Goal is to find most commonly occurring class (mode)
Still use binary splitting, not depends on RSS, but on Gini Index
m represents different region, and i represents different class(Yes/No,1/0,T/F). Finally, use weight of each region to calculate the final Gini index

Advantage and of Trees:

Easy to interpret
More closely mirror human decision-making process
Can be displayed graphically
Trees are good for both qualitative and quantitive response, not need to create extra dummy variables for qualitative response

Disadvantage of Trees:

Do not have same level of predictive accuracy as other approaches (but can use Bagging, Forest and Boosting to improve)
High variance: different partitions of the same data set may product quite different trees
Instability: Very minor changes can result in significantly different trees

关注