SAS
Module 5 Decision Tree
CART: Classification And Regression Tree
- For classification trees, final prediction is the mode of the training observations in the region
- For regression trees, final prediction is the mean of the training observations in the region
Regression Tree:
- We divide the predictor space into J distinct and non-overlapping regions R1,R2,R3…Rj
- For each region, we find the mean of the response values for the training observations
- Calculate total RSS for each region, and find the smallest one (we can do this on Excel)
- This equation is to find good cut-point “s” of predictor “Xj” to split, so it will create {Xj|Xj>=s} and {Xj|Xj<s} with smallest RSS
- We take a top-down, greedy approach that is know as binary splitting, so repeat the above process again and again
When to stop? : Prune Tree
- If not prune tree, it is likely to overfit the data, leading to poor test set performance
- Smaller tree might lead to lower variance and better interpretation
- The strategy is to grow a very large tree first, and then prune it to a subtree
- Prune the tree to the number of nodes with smallest MSE (mean square error)
Classification Tree:
- Similar as regression tree, except that it is used to predict qualitative response
- Goal is to find most commonly occurring class (mode)
- Still use binary splitting, not depends on RSS, but on Gini Index
- m represents different region, and i represents different class(Yes/No,1/0,T/F). Finally, use weight of each region to calculate the final Gini index
Advantage and of Trees:
- Easy to interpret
- More closely mirror human decision-making process
- Can be displayed graphically
- Trees are good for both qualitative and quantitive response, not need to create extra dummy variables for qualitative response
Disadvantage of Trees:
- Do not have same level of predictive accuracy as other approaches (but can use Bagging, Forest and Boosting to improve)
- High variance: different partitions of the same data set may product quite different trees
- Instability: Very minor changes can result in significantly different trees