Machine Learning Review Note - Tree-based methods

提示:英文版笔记,因为很多词我不知道中文名称,自己总结,欢迎讨论。

1. Decision Tree

We mainly focus on the CART - Classification and Regression Tree

Tree-based methods or segment the predictor space into a number of simple regions. Generally, the spliting procedure occurs until:

  • all the samples within a given node belong to the same class
  • the maximum depth of the tre is reached 
  • a split cannot be found that improves the model 

The process of growing a decision tree can be expressed as a recursive algorithm as follows:

  1. Pick a feature such that when the parent node is split, it results in the largest information gain and stopping if information gain is not postive 
  2. Stop if child nodes are pure or no improvement in class purity can be made
  3. Go back to step 1 each of the two child nodes

1.1 Information Gain

An algorithm starts at a tree root and then splits the data based on the feature that gives the largest information gain. This procedure replies on calculating the difference between an impurity measure of a parent node D_p and the impurity of its child nodes; information gain is high when the sum of the child nodes is low. We can maximise the information gain at each split using 

IG(D_p,f) = I(D_p) - \sum_{j = 1}^m\frac{N_j}{N_p}I(D_j),

where I is the impurity measure, N is the total number of samples at the parent node, and N_j is the number of samples in the j th child nodes. 

Remarks

 The CART algorithm is greedy - meaning it searches for the optimum split at each level. It does not check if this is the best split to improve impurity further down the tree. 

1.2 Classification Error 

This is simply the fraction of the training observations in a region that does belongs to the most common class.

Classification error is rarely used for information gain in practice. This is because the tree can stop growing and error doesn't improve. This is not the case for a concave function such as entropy or gini. 

1.3 Entropy Impurity 

For all non-empty classes p, entropy is given by 

The entropy is therefore 0 if all samples at the node belong to the same class and maximal if we have a uniform class distribution. 

1.4 Gini Impurity 

Gini impurity minimises the probability of misclassification. It is also a measure of node 'purity' as a small value indicates a node contains predominantly observations from a single class. 

Whether we use entropy or Gini impurity generall does not really matter, because both have the same concave shape.

Gini is likely to isolate the most frequent class to its own branch and entropy produce slightly more balance trees. 

1.5 Feature importance 

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by  the number of samples that reach the node, divided by the total number of samples. The higher the value the more important is the feature. Gini importance in using binary tree:

ni_j = w_jI_j - w_{\text{left(j)}}I_{\text{left}(j)} - w_{\text{right(j)}}I_{\text{right}(j)}

I_j is the impurity value of the node j. The importance for each feature on a decision tree is then calculated as: 

fi_i = \frac{\sum_{j:\text{node j split on feature } i}ni_j }{\sum_{k\in \text{all nodes}}ni_k}

These can be normalised to a value between 0 and 1 by dividng by the sum of all feature importance values. 

The final featue importance, at the random forest level, it that is averaged over all trees: the sum of the feature's importance value on each trees is calcualted and divided by the total number of trees. 

Feature importances can be misleading for high cardinality features; if features are highly correlated, one feature may be ranked highly while the information of the others not beging fully captured. 

Permutation importance 

The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomlly shuffled. This procedure breaks the relationship between feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. 

1.6 Pruning 

If the decision tree is not pruned, they have a high risk of overfitting  to the training data. 

1.6.1 Pre-pruning 

A prior limit on nodes or tree depth. We could also set a minimum number of data point of each node. 

1.6.2 Post-pruning 

In general, post-pruning consists of going back through the tree once it has been create and removing branches that do not significantly contribute to the error reduction and replacing them with leaf nodes. Two common approchase are:

  • Reduced-error pruning 
    • Greedily remove nodes based on validation set performance 
    • Generally improves performance but can be problematic for limited data size
  • Cost-complexity pruning 
    • Recursively finds the node with the weakest link 
    • Nodes are characterised by alpha, and nodes with the smallest effective alpha are pruned first 
    • The trees are then defined as , where I is an impurity measure, such as the total misclassification rate of the terminal nodes, alpha is a tuning parameter, and N is hte total number of nodes. 
    • Using Scikit-learn, we can recursively fit a complex tree with no prior pruning and only focus on the effecively alphas and the corresponding total leaf impurities at each step of the pruning process
    • As alpha increases, more of the tree is pruned, thus creating a decision tree that generalises better
    • We can select the alpha that reduces the distance between the train and validation scores. 

Remarks 

An early bad split may lead to a good split later, therefore, we may want to grow a large tre first and prune it back to obtain a subtree

alpha is a price to pay for having a tree with many nodes, so this tends to minimise to a smaller subtree. 

2. Ensemble Averaging

The decision tree is not competative enough for supervised learning. It typically suffers from high variance. In this case, bagging, random forest, and boosting, which producing and combining multiple trees to obtain a single prediction, can help to improve the accuracy at the expense of interpretability. 

The goal of ensemble methods is to combine different classifies into a metaclassifier that has better generalisation performance than each individual classifier alone. 

2.1 Majority Voting

 Majority voting can be done by selecting the class label that is predicted by the majority of the classifies. Accordingly, we train m different classifiers C_{1}, C_{2}, \cdots, C_{m}, combine the predicted labels of each classifier C_{j}, and selecting the class label \hat{y} that receives the most votes. 

Note: 

  • Majority voting refers to binary class decisitions; for multiple classes, we use plurality voting;
  • Selecting the most voted class is also called a 'hard voting'.

In binary classification (-1, 1) we can write the majority vote prediction as:

\begin{aligned} \hat{y} &= \text{mode} \{ C_1(x), C_2(x), \cdots, C_m(x) \}\\ C(x) &= \text{sgn}\left[ \sum_j^mC_j(x)\right] = \left\{\begin{matrix}1 \quad\text{if } \sum_jC_j(x)\geq0 \\ -1, \quad \text{otherwise} \end{matrix}\right. \end{aligned}

Sometimes, insteand, we want to use weighted votes:

\hat{y} = \text{arg max}\sum_{j = 1}^mw_j1_A(C_j(x) = i)

Especially, when the classifier returns the probability of the predicted class label, it is kowned as the 'soft voting'. Soft votings often achieve better performance than that of the hard voting since we may give more weight to the highly confident votes. 

2.2 Bagging, pasting

 A bagging classifier is an ensemble of the 'base' classifiers, each of which fita random subset of the dataset. Their predictions are then pooled or aggregated to form a final prediction. This method can: 

  •  reduce the variance of the estimator;
  • thus reduce overfitting and increase prediction accuracy. 

Algorithm 

  1. Let n be the number of bootstrapping samples 
  2. For i in 1 to n Do
  3.        Draw bootstrapping samples of size m, D_i 
  4.        Train base classifier h on D_i 
  5.  \hat{y} = \text{mode}\{h_1(x),h_2(x), \cdots, h_n(x)\}

We could use bagging by taking many separate training sets m from the population, building a separate prediction model using each training set, and average the predicting results. However, we may not have access to multiple datasets, instead we can use bootstapping to obtaining samples from one single dataset and average the prediction results. 

2.2.1 Pasting 

Bagging uses sampling with replacement; if without replacement, then it is called pasting

Pasting is designed to use smaller subsets from the full dataset in cases where the training dataset does not fit into memory. 

Note

  • Both bagging and pasting allow training to be sampled sveral times across multiple predictiors, wich bagging only allowing several samples for the same predictor. 
  • Averaging methods generally work best when the predictors are as independent as possible, so one way of achieving this is to get diversed classifiers
  • Diverse classifiers increase the chance that they make different types of errors, which, in combination, will improve the overall accuracy. 
  • In practice, bagging tends to work best with complex models, so it is particularly useful for decision trees. 

2.3 Boosting 

In machine learning, boosting is an ensember meta-algorithm for primarily reducing bias and variance in supervised learning. 

Remark 

  • A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than the random guessing)
  • A strong learner is a classifier that is arbitrarily well-correlated with the true classification.

 https://en.wikipedia.org/wiki/Boosting_(machine_learning) 

There are two broad categories of boosting: adaptive boosting and gradient boosting. Adaptive and gradient boosting rely on the same concept of boosting "weak leaners" (such as decision tree strumps) to "strong learniners". A decision tree strump refers to a decision tree that only has one level.

Boosting is an iterative process, where the training set is reweighted, as each iteration, based on mistakes a weaker learner made (i.e. misclassification); the two approaches, differ mainly regarding how the weights are updated and how the classifiers are combined.

While boosting can increase the accuracy of a base learner, such as a decision tree or linear regression, it sacrifices intelligibility and interpretability. Furthermore, its implementation is more difficult duo to the higher computational demand.  

2.3.1 Adaptive Boosting 

Intuitively, we can outline the general boosting procedure for AdaBoost is:

  • Initialise a weight vector with uniform weights 
  • Loop:
    • Apply weak learner to weighted training exaples 
      • Instead of original training set, we may draw bootstrap sampls with weighted probability
    • Increase weight for misclassifed examples 
  • (Weighted) majority voting on trained classifiers. 

Remark

Misclassified input data gain a higher weight and exampes that are classified correctly lse weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified. 

2.3.2 Gradient Boosting 

Gradient boosting is used for regression, classification and other tasks. When the decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest. 

Gradient boosting is an ensemble method that combines multiple weak learners (such as decision tree strumps) into a strong learner. Similar to AdaBoost (and in contrast to Bagging), gradient boosting is a sequential (rather than parallel) algorithm - it is powerful but rather expensive to train. 

In Gradient boosting, we optimise a differential loss function (e.g mean-squared-error for regression or negative log-likelihood for classification) of a weak learner via consecutive rounds of boosting. The output of gradient boosting is an additive model of multiple weak learners (in contrast to AdaBoost, we do not apply majorith voting to the ensemble of models.)

2.3.3 Extreme Gradient Boosting 

Salient features of XGB which make it different from other gradient boosting algorithms include

Comparisons for GBDT and XGBoost
GBDTXGBoost
ClassifierCART CART and Linear Classifiers (Logistic regression and linear model)
Training setUse all data and the trainsetUse the subset of data set as the training set
Optimisationfirst-order differentiationSecon-order Taylor series, Hessian 
RegularisationNoneAdd L2 regularisation, reducing the model varaince, reduce the change of over-fitting. 

2.4 Random Forests

Random forest are essentially baged tree classifiers, but decorrelated the trees by using a random samples of features each time a split in a tree is considered. The random forest can therefore be summarised in four steps: 

  1. Draw a random bootstrap samples of size n
  2. Grow a decision tree from the bootstrap sample. At each node: 
    1. Randomly select d features without replacement (typically the squaure root of the total number of predictiors)
    2. Split the node using the freature that provides that best split according to the objective function 
  3. Repeat the steps above k times
  4. Aggregate the prediction by each tree to assign the class label by majority vote. 

Notes

  • The feature subset to consider at each node is a hyperparameter that we can tune. 
  • Instead of using majority vote, in sk-learn the RandomForestClassifier averages the probabilistic prediction. 
  • If a random forest is built using all features, then it is simply bagging. 
  • You can also bootstrap features in the  BaggingClassifier using bootstrap_features = True

By not allowing the model to use the majority of the available predictors, we ensure the bagged trees look different from each other. If there is a paricularly strong set of predictors in the data, then without randomly selecting features, the bagged trees will look quite similar to each other and predictions will be highly correlated. 

Averaing highly correlated quanitties does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. 

3. Comparisons 

3.1 Trees

Advantages: 

  • Easy to explain: trees can be displayed graphically in an interpretable mannor 
  • Make few assumptions about the training data (non-parametric)
  • Inherently multiclass: can also handle multitask output
  • Can handle different types of predictors: independent of feature scalling 
  • Can handle missing values 

Disadvantages:

  • Comparatively poor generalisation performance 
  • Easy to overfit: requie pruning 
  • High variance: a smal change in the data can cause a large change in the estimatd tree
  • Orthogonal decision boundaries: model is affected by the rotation of the data 
  • Cannot guarantee to return the globally optimal decision tree: locally optimal decisions are made at each node. 

3.2 Forest/ExtraTrees

Advantages: 

  • Comparatively good generalisation ability 
  • Comparatively small variablity in prediction accuracy when tuning  
  • Comparatively good out-of-box performance 

Disadvantages:

  • Expensive 

未完待续,有待补充

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值