Machine Learning Review Note - Tree-based methods

最新推荐文章于 2024-05-20 19:08:38 发布

chris_wan_DS

最新推荐文章于 2024-05-20 19:08:38 发布

阅读量246

点赞数

分类专栏： Machine Learning 文章标签：人工智能机器学习

本文链接：https://blog.csdn.net/chris_wan_DS/article/details/119868484

版权

Machine Learning 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

提示：英文版笔记，因为很多词我不知道中文名称，自己总结，欢迎讨论。

1. Decision Tree

1.1 Information Gain

1.2 Classification Error

1.3 Entropy Impurity

1.4 Gini Impurity

1.5 Feature importance

1.6 Pruning

2. Ensemble Averaging

3.2 Forest/ExtraTrees

1. Decision Tree

We mainly focus on the CART - Classification and Regression Tree

Tree-based methods or segment the predictor space into a number of simple regions. Generally, the spliting procedure occurs until:

all the samples within a given node belong to the same class
the maximum depth of the tre is reached
a split cannot be found that improves the model

The process of growing a decision tree can be expressed as a recursive algorithm as follows:

Pick a feature such that when the parent node is split, it results in the largest information gain and stopping if information gain is not postive
Stop if child nodes are pure or no improvement in class purity can be made
Go back to step 1 each of the two child nodes

1.1 Information Gain

An algorithm starts at a tree root and then splits the data based on the feature that gives the largest information gain. This procedure replies on calculating the difference between an impurity measure of a parent node $D_p$ and the impurity of its child nodes; information gain is high when the sum of the child nodes is low. We can maximise the information gain at each split using

$IG(D_p,f) = I(D_p) - \sum_{j = 1}^m\frac{N_j}{N_p}I(D_j),$

where I is the impurity measure, N is the total number of samples at the parent node, and N_j is the number of samples in the j th child nodes.

Remarks

The CART algorithm is greedy - meaning it searches for the optimum split at each level. It does not check if this is the best split to improve impurity further down the tree.

1.2 Classification Error

This is simply the fraction of the training observations in a region that does belongs to the most common class.

Classification error is rarely used for information gain in practice. This is because the tree can stop growing and error doesn't improve. This is not the case for a concave function such as entropy or gini.

1.3 Entropy Impurity

For all non-empty classes p, entropy is given by

The entropy is therefore 0 if all samples at the node belong to the same class and maximal if we have a uniform class distribution.

1.4 Gini Impurity

Gini impurity minimises the probability of misclassification. It is also a measure of node 'purity' as a small value indicates a node contains predominantly observations from a single class.

Whether we use entropy or Gini impurity generall does not really matter, because both have the same concave shape.

Gini is likely to isolate the most frequent class to its own branch and entropy produce slightly more balance trees.

1.5 Feature importance

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important is the feature. Gini importance in using binary tree:

$ni_j = w_jI_j - w_{\text{left(j)}}I_{\text{left}(j)} - w_{\text{right(j)}}I_{\text{right}(j)}$

I_j is the impurity value of the node j. The importance for each feature on a decision tree is then calculated as:

$fi_i = \frac{\sum_{j:\text{node j split on feature } i}ni_j }{\sum_{k\in \text{all nodes}}ni_k}$

These can be normalised to a value between 0 and 1 by dividng by the sum of all feature importance values.

The final featue importance, at the random forest level, it that is averaged over all trees: the sum of the feature's importance value on each trees is calcualted and divided by the total number of trees.

Feature importances can be misleading for high cardinality features; if features are highly correlated, one feature may be ranked highly while the information of the others not beging fully captured.

Permutation importance

The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomlly shuffled. This procedure breaks the relationship between feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature.

1.6 Pruning

If the decision tree is not pruned, they have a high risk of overfitting to the training data.

1.6.1 Pre-pruning

A prior limit on nodes or tree depth. We could also set a minimum number of data point of each node.

1.6.2 Post-pruning

In general, post-pruning consists of going back through the tree once it has been create and removing branches that do not significantly contribute to the error reduction and replacing them with leaf nodes. Two common approchase are:

Reduced-error pruning
- Greedily remove nodes based on validation set performance
- Generally improves performance but can be problematic for limited data size
Cost-complexity pruning
- Recursively finds the node with the weakest link
- Nodes are characterised by alpha, and nodes with the smallest effective alpha are pruned first
- The trees are then defined as , where I is an impurity measure, such as the total misclassification rate of the terminal nodes, alpha is a tuning parameter, and N is hte total number of nodes.
- Using Scikit-learn, we can recursively fit a complex tree with no prior pruning and only focus on the effecively alphas and the corresponding total leaf impurities at each step of the pruning process
- As alpha increases, more of the tree is pruned, thus creating a decision tree that generalises better
- We can select the alpha that reduces the distance between the train and validation scores.

Remarks

An early bad split may lead to a good split later, therefore, we may want to grow a large tre first and prune it back to obtain a subtree

alpha is a price to pay for having a tree with many nodes, so this tends to minimise to a smaller subtree.

2. Ensemble Averaging

The decision tree is not competative enough for supervised learning. It typically suffers from high variance. In this case, bagging, random forest, and boosting, which producing and combining multiple trees to obtain a single prediction, can help to improve the accuracy at the expense of interpretability.

The goal of ensemble methods is to combine different classifies into a metaclassifier that has better generalisation performance than each individual classifier alone.

2.1 Majority Voting

Majority voting can be done by selecting the class label that is predicted by the majority of the classifies. Accordingly, we train $m$ different classifiers $C_{1}, C_{2}, \cdots, C_{m}$ , combine the predicted labels of each classifier $C_{j}$ , and selecting the class label $\hat{y}$ that receives the most votes.

Note:

Majority voting refers to binary class decisitions; for multiple classes, we use plurality voting;
Selecting the most voted class is also called a 'hard voting'.

In binary classification (-1, 1) we can write the majority vote prediction as:

$\begin{aligned} \hat{y} &= \text{mode} \{ C_1(x), C_2(x), \cdots, C_m(x) \}\\ C(x) &= \text{sgn}\left[ \sum_j^mC_j(x)\right] = \left\{\begin{matrix}1 \quad\text{if } \sum_jC_j(x)\geq0 \\ -1, \quad \text{otherwise} \end{matrix}\right. \end{aligned}$

Sometimes, insteand, we want to use weighted votes:

$\hat{y} = \text{arg max}\sum_{j = 1}^mw_j1_A(C_j(x) = i)$

Especially, when the classifier returns the probability of the predicted class label, it is kowned as the 'soft voting'. Soft votings often achieve better performance than that of the hard voting since we may give more weight to the highly confident votes.

2.2 Bagging, pasting

A bagging classifier is an ensemble of the 'base' classifiers, each of which fita random subset of the dataset. Their predictions are then pooled or aggregated to form a final prediction. This method can:

reduce the variance of the estimator;
thus reduce overfitting and increase prediction accuracy.

Algorithm

Let n be the number of bootstrapping samples
For $i$ in 1 to $n$ Do
Draw bootstrapping samples of size $m$ , $D_i$
Train base classifier $h$ on $D_i$
$\hat{y} = \text{mode}\{h_1(x),h_2(x), \cdots, h_n(x)\}$

We could use bagging by taking many separate training sets m from the population, building a separate prediction model using each training set, and average the predicting results. However, we may not have access to multiple datasets, instead we can use bootstapping to obtaining samples from one single dataset and average the prediction results.

2.2.1 Pasting

Bagging uses sampling with replacement; if without replacement, then it is called pasting.

Pasting is designed to use smaller subsets from the full dataset in cases where the training dataset does not fit into memory.

Note

Both bagging and pasting allow training to be sampled sveral times across multiple predictiors, wich bagging only allowing several samples for the same predictor.
Averaging methods generally work best when the predictors are as independent as possible, so one way of achieving this is to get diversed classifiers
Diverse classifiers increase the chance that they make different types of errors, which, in combination, will improve the overall accuracy.
In practice, bagging tends to work best with complex models, so it is particularly useful for decision trees.

2.3 Boosting

In machine learning, boosting is an ensember meta-algorithm for primarily reducing bias and variance in supervised learning.

Remark

A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than the random guessing)
A strong learner is a classifier that is arbitrarily well-correlated with the true classification.

https://en.wikipedia.org/wiki/Boosting_(machine_learning)

There are two broad categories of boosting: adaptive boosting and gradient boosting. Adaptive and gradient boosting rely on the same concept of boosting "weak leaners" (such as decision tree strumps) to "strong learniners". A decision tree strump refers to a decision tree that only has one level.

Boosting is an iterative process, where the training set is reweighted, as each iteration, based on mistakes a weaker learner made (i.e. misclassification); the two approaches, differ mainly regarding how the weights are updated and how the classifiers are combined.

While boosting can increase the accuracy of a base learner, such as a decision tree or linear regression, it sacrifices intelligibility and interpretability. Furthermore, its implementation is more difficult duo to the higher computational demand.

2.3.1 Adaptive Boosting

Intuitively, we can outline the general boosting procedure for AdaBoost is:

Initialise a weight vector with uniform weights
Loop:
- Apply weak learner to weighted training exaples
  - Instead of original training set, we may draw bootstrap sampls with weighted probability
- Increase weight for misclassifed examples
(Weighted) majority voting on trained classifiers.

Remark

Misclassified input data gain a higher weight and exampes that are classified correctly lse weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

2.3.2 Gradient Boosting

Gradient boosting is used for regression, classification and other tasks. When the decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest.

Gradient boosting is an ensemble method that combines multiple weak learners (such as decision tree strumps) into a strong learner. Similar to AdaBoost (and in contrast to Bagging), gradient boosting is a sequential (rather than parallel) algorithm - it is powerful but rather expensive to train.

In Gradient boosting, we optimise a differential loss function (e.g mean-squared-error for regression or negative log-likelihood for classification) of a weak learner via consecutive rounds of boosting. The output of gradient boosting is an additive model of multiple weak learners (in contrast to AdaBoost, we do not apply majorith voting to the ensemble of models.)

2.3.3 Extreme Gradient Boosting

Salient features of XGB which make it different from other gradient boosting algorithms include

Comparisons for GBDT and XGBoost
	GBDT	XGBoost
Classifier	CART	CART and Linear Classifiers (Logistic regression and linear model)
Training set	Use all data and the trainset	Use the subset of data set as the training set
Optimisation	first-order differentiation	Secon-order Taylor series, Hessian
Regularisation	None	Add L2 regularisation, reducing the model varaince, reduce the change of over-fitting.

2.4 Random Forests

Random forest are essentially baged tree classifiers, but decorrelated the trees by using a random samples of features each time a split in a tree is considered. The random forest can therefore be summarised in four steps:

Draw a random bootstrap samples of size n
Grow a decision tree from the bootstrap sample. At each node:
1. Randomly select d features without replacement (typically the squaure root of the total number of predictiors)
2. Split the node using the freature that provides that best split according to the objective function
Repeat the steps above k times
Aggregate the prediction by each tree to assign the class label by majority vote.

Notes

The feature subset to consider at each node is a hyperparameter that we can tune.
Instead of using majority vote, in sk-learn the RandomForestClassifier averages the probabilistic prediction.
If a random forest is built using all features, then it is simply bagging.
You can also bootstrap features in the BaggingClassifier using bootstrap_features = True

By not allowing the model to use the majority of the available predictors, we ensure the bagged trees look different from each other. If there is a paricularly strong set of predictors in the data, then without randomly selecting features, the bagged trees will look quite similar to each other and predictions will be highly correlated.

Averaing highly correlated quanitties does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

3. Comparisons

3.1 Trees

Advantages:

Easy to explain: trees can be displayed graphically in an interpretable mannor
Make few assumptions about the training data (non-parametric)
Inherently multiclass: can also handle multitask output
Can handle different types of predictors: independent of feature scalling
Can handle missing values

Disadvantages:

Comparatively poor generalisation performance
Easy to overfit: requie pruning
High variance: a smal change in the data can cause a large change in the estimatd tree
Orthogonal decision boundaries: model is affected by the rotation of the data
Cannot guarantee to return the globally optimal decision tree: locally optimal decisions are made at each node.

3.2 Forest/ExtraTrees

Advantages:

Comparatively good generalisation ability
Comparatively small variablity in prediction accuracy when tuning
Comparatively good out-of-box performance

Disadvantages:

Expensive

未完待续，有待补充

chris_wan_DS

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning Review Note - Tree-based methods

提示：英文版笔记，因为很多词我不知道中文名称，自己总结，欢迎讨论。Content Table前言一、pandas是什么？二、使用步骤 1.引入库 2.读入数据总结Summary提示：这里可以添加本文要记录的大概内容：例如：随着人工智能的不断发展，机器学习这门技术也越来越重要，很多人都开启了学习机器学习，本文就介绍了机器学习的基础内容。提示：以下是本篇文章正文内容，下面案例可供参考1. Ensemble AveragingThe decision t
复制链接

扫一扫