1. Quality metric
Quality metric for the desicion tree is the classification error
error=number of incorrect predictions / number of examples
2. Greedy algorithm
Procedure
Step 1: Start with an empty tree
Step 2: Select a feature to split data
explanation:
Split data for each feature
Calculate classification error of this decision stump
choose the one with the lowest error
For each split of the tree:
Step 3: If all data in these nodes have same y value
Or if we already use up all the features, stop.
Step 4: Otherwise go to step 2 and continue on this split
Algorithm
predict(tree_node, input)
if current tree_node is a leaf:
return majority class of data points in leaf
else:
next_node = child node of tree_node whose feature value agrees with input
return (tree_node, input)
3 Threshold split
Threshold split is for the continous input
we just pick a threshold value for the continous input and classify the data.
Procedure:
Step 1: Sort the values of a feature hj(x) {v1, v2,...,vn}
Step 2: For i = 1 .... N-1(all the data points)
consider split ti=(vi+vi+1)/2
compute the classification error of the aplit
choose ti with the lowest classification error
4. Overfitting
As the depth increases, the overfitting could occur.
Curing Methods
1. Early Stopping
Stop learning algorithm before tree become too complex
Like:
- Limit the depth of the tree (it is difficult to choose the depth value)
- Use classification error to limit depth of the tree (can be dangerous XOR)
- Stop if number of data points is too few in the intermedate modes
2.Pruning
Simplify tree after learning algorithm terminates
Consider a specific total cost:
Total cost = classification error + lamda*number of leaf nodes
Start at bottom of tree T and traverse up apply prune_split(T,M) to each desicion node M
prune_split(T,M):
1. Compute the total cost of tree T using the formula above,
C(T) = Error(T)+λL(T)
2. Let Tsmaller be tree after pruning subtree below M
3. Compute total cost complexity of Tsmaller ,C(Tsmaller) = Error(Tsmaller)+λL(Tsmaller)
4. If C(Tsmaller ) < C(T), prune to Tsmaller
5. Missing data
1. Purification by skipping data or skipping featuires
Cons:
1. Removing data points or features may remove important info from data
2. Unclear when it is better to remove data points versus features
3. Does not help if data is missing at prediction time
2. Imputation
Filling in the missing value
1. Categorical feature
Fill in the most popular value of xi
2. Numerical feature
Fill in the average or median value of xi
Cons:
May result in systematic error
3. Adding missing value choice to every decision node
we use the classification error to decide where to put the unknowns.
Cons:
Requires modification of learning algorithm