【机器学习算法】XGBoost

最新推荐文章于 2023-06-29 11:23:01 发布

Kendyu

最新推荐文章于 2023-06-29 11:23:01 发布

阅读量277

点赞数

分类专栏：机器学习文章标签：机器学习 python 数据分析数据挖掘

本文链接：https://blog.csdn.net/weixin_43905298/article/details/112644753

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

XGBoost

Summary at first

XGBoost中减少计算难度的地方
- 求叶节点的output value的时候用二阶泰勒展开
- 求Similarity score的时候用梯度和Hessian去算

Regression

First, an original prediction(constant) is given for all points.

1. Build the tree

From the root, calculate similarity for residuals
- $Similarity=\frac{(\sum_{i=1}^Nr_i)^2}{N+\lambda}$
- $\lambda$ is for regularization, which intends to reduce the prediction’s sensitivity to indiviudal observations. It prevent overfitting the training data.
- When $\lambda>0$ , the similarity scores are smaller, and the amount of decrease is inversely proportional to the number of residuals in the node.
- When $\lambda>0$ , it is easier to prune leaves because the values for Gain are smaller.
- When $\lambda>0$ , in some cases(such as root has two samples, their residuals are positive) we will still prune branch even $\gamma=0$ since $G a i n < 0$ . Setting $\gamma=0$ does not turn off pruning.
Cluster similar residual
- The split points can be chosen as all midpoints between two adjacent observations(try all of them and pick one in the end)
- When the residuals are similar, or there is just one of them, they do not cancel out and the similarity score is relatively large(the split is good)
Calculate the Gain of splitting the residuals into two groups

$Gain=Left_{Similarity}+Right_{Similarity}-Root_{Similarity}$

Use the threshold(split point) that gave the largest Gain
Do the above steps until it reach the tree depth setting

2. Prune

From the lowest branch in the tree, calculate $Gain-\gamma$ , if the difference is negative, then remove the branch.
- $\gamma$ is the tree complexity parameter.
But if we do not remove the lower branch, we will not remove its root branch.

3. Output Value

Calculate output value for each leaf
- $Output\ Value=\frac{\sum_{i=1}^Nr_i}{N+\lambda}$
- When $\lambda>0$ , it will reduce the amount that this individual observation adds to the overall prediction. $\lambda=0$ is the default value.
- When $\lambda>0$ , it results in smaller output values for the leaves

4. Predictions

Make new predictions by starting with the initial prediction.
Add the output of the tree scaled by a learning rate. $prediction=initial\ prediction+learning\ rate*output\ value\ of\ tree$
- Learning rate $\eta$ , default value is 0.3.
The new residual is smaller than before

5. Build another tree with smaller residuals and make new predictions and get new residuals…

About Parameters

When $\lambda>0$ , it results in more pruning by shrinking the similarity scores and it results in smaller output values for the leaves.

Classification

1. Make initial prediction first, such as the probability

2. Calculate similarity score

$Similarity\ Score=\frac{(\sum _{i=1}^Nr_i)^2}{\sum_{i=1}^Np_i(1-p_i)+\lambda}$

$p_i$ is the previous predicted probability for sample $i$ .

It is possible to split with the same feature at different thresholds.

Other steps are similary to regression case: Calculate similarity score, Calculate Gain, Prune the tree, Calculate output value

3. Stopping Criterion When Building a Single Tree

the number of levels
the number of residuals in each leaf is determined(represented) by Cover, $Cover=\sum_{i=1}^Np_i(1-p_i)$
- In regression case, $C o v e r = N$ . By default, the minimum value for Cover is 1. When we use the defult value, Cover has no effect on how we grow the tree.
- In classification case, if min_child_weight is set as 1, and Cover calculated is smaller than min_child_weight, the branch will be removed. Usually Set it as 0.

Values for $\lambda$ greater than 0 reduce the sensitivity of the tree to individual observations by pruning and combining them with other observations.

4. Output Value

$Output\ Value=\frac{\sum_{i=1}^Nr_i}{\sum_{i=1}^Np_i(1-p_i)+\lambda}$
When $\lambda>0$ , it reduces the amount that this single observation adds to the new prediction.
There is a transformation between odds and probability here:
$log(\frac{p}{1-p})=log(odds)$

$log(odds)\ predictions = log(odds)+\eta\times Output\ Value$

$New\ Probability=\frac{e^{log(odds)}}{1+e^{log(odds)}}$

5. Build new trees to fit new residuals
6. Stopping Criteron on Build Trees

residuals are super small
reach maximum number of trees

Mathematical Details

1. Loss Function

Regression
- $L\left(y_{i}, p_{i}\right)=\frac{1}{2}\left(y_{i}-p_{i}\right)^{2}$
Classification
$L\left(y_{i}, p_{i}\right)=-\left[y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right]$

The negative log-likelihood is the commonly used loss function.

2. Objective Function in building trees

$\left[\sum_{i=1}^{n} L\left(y_{i}, p_{i}\right)\right]+\frac{1}{2} \lambda O_{\text {value }}^{2}+\gamma T$
- $T$ is the number of terminal nodes, or leaves, in a tree. $\gamma$ is a user definable penalty, is meant to encourage pruning. In the following discussion, $\gamma T$ is omitted because pruning takes place after the full tree is built and plays no role in deriving optimal output values or similarity scores.
- $\frac{1}{2}\lambda O_{value}^2$ is the regularization term, just like in ridge regression, if $\lambda>0$ , then we will shrink the $O_{value}$ .
The goal is to find an output value( $O_{value}$ )that minimize the whole equation. $O_{value}$ is the variable to be optimized.

In regression case,
$\left[\sum_{i=1}^{n} L\left(y_{i}, p_{i}^{0}+O_{\text {value }}\right)\right]+\frac{1}{2} \lambda O_{\text {value }}^{2}$
- The more emphasis we give the regularization penalty by increasing $\lambda$ , the optimal $O_{value}$ gets closer to 0.
- $O_{value}$ is different for different leaf. Example here are calculated for samples in the same leaf.
To solve the optimal output value for a leaf in regression case, take the second order Taylor Approximation(XGBoost uses this for both regression and classification while unextreme Grading Boost used it for classification only)
$L\left(y_i, p_{i}+O_{\text {value }}\right) \approx L\left(y_i, p_{i}\right)+\left[\frac{d}{d p_{i}} L\left(y_i, p_{i}\right)\right] O_{\text {value }}+\frac{1}{2}\left[\frac{d^{2}}{d p_{i}^{2}} L\left(y_i, p_{i}\right)\right] O_{\text {value }}^{2}$
- $g$ to represent the derivative of the loss function, $h$ to represent the second order derivative of the loss function.
- Replace the above equation with $g$ and $h$ and omit the $L(y_i,p_i)$ which is not related to $O_{value}$
  $\left(g_{1}+g_{2}+\cdots+g_{n}\right) O_{\text {value }}+\frac{1}{2}\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right) O_{\text {value }}^{2}$
- Take the derivative and set it as 0, solve for the $O_{value}$ .(No matter whether it is for regression or classification)
  $O_{\text {value }}=\frac{-\left(g_{1}+g_{2}+\cdots+g_{n}\right)}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)}$
  This the optimal output value for the leaf.
  
  $g_i=-(y_i-p_i)$ is the negative residual. $h_i=1$
  $O_{value}=\frac{\sum_{i=1}^Nr_i}{N+\lambda}$
In classification case
$L\left(y_{i}, p_{i}\right)=-\left[y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right]$
Similarity Score
$-\left(g_{1}+g_{2}+\cdots+g_{n}\right) O_{\text {value }}-\frac{1}{2}\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right) O_{\text {value }}^{2}$
XGBoost use the above simplified equation to determine the similarity score.(used in the manuscreipt) They are used while building trees.
$\text { Similarity Score }=\frac{1}{2} \frac{\left(g_{1}+g_{2}+\cdots+g_{n}\right)^{2}}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)}$
But in implementation, similarity socre used is actually 2 times that number.(Because it is just a relative measure)
$\text { Similarity Score }=\frac{\left(g_{1}+g_{2}+\cdots+g_{n}\right)^{2}}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)}$
An example of how XGBoost reduce the amount of computation. $g_i$ and $h_i$ can be directly calculated with the gradients and Hessian.
Cover is the sum of the Hessians.

Optimization

Make XGBoost relatively efficient/fast with relatively large training datasets.

1. Approximate Greedy Algorithm

Greedy algorithm: make a decision without looking ahead to see if it is the absolute best choice in the long term. Just for the best Gain at the time.
- XGBoost can build a tree relatively quickly in this way but if the data is complex and large, it is still not efficient.
Approximate greedy algorithm: instead of testing every single threshold, divide the data into quantiles and only use the quantiles as candidate thresholds to split the observations. By default, XGBoost uses about 33 quantiles.

2. Sketches(Parallel Learning)

Steps:
- Splitting the dataset into small pieces and putting the pieces on different computers on a network(work on it at the same time)
- The quantile sketch algorithm combines the values from each computer to make an approximate histogram
- Then the approximate histogram is used to calculate approximate weighted quantiles(weights are calculated after building each tree):
  - With weighted quantiles, each observation has a corresponding weight, and the sum of the weights are the same in each quantile
  - The weights are derived from the Cover metric(the weight for each observation is the Hessian of the Loss Function)
    - For regression, the weights are equal to 1
    - For classification, if we don’t have a lot of confidence in the classification, the weights are relatively small. By dividing the observations into quantiles where the sum of the weights are similar, we split the two observations with low confidence predictions into seperate bins(We get smaller quantiles when we need them)
- The approximate greedy algorithm uses approximate quantiles
XGBoost only uses the Approximate Greedy Algorithm, Parallel Learning and Weighted Quantile Sketch when the training dataset is huge. When the dataset is small, XGBoost just uses a normal, everyday Greedy Algorithm.

3. Sparsity-Aware Split Finding

For dataset with missing values in features
- Split data with or without missing feature values into two tables
- Calculate $Gain_{Left}$ and $Gain_{Right}$ with diffferent split points when put the residuals of missing feature values in the left or right leaf.
- Finally choose the threshold that gives us the largest for $G a i n$ overall.
- And all future observations(for test) without feature values will go through this path by default.

4. Cache-Aware Access

CPU: Cache Memory(fastest), Main Memory(fast), Hard Drive(slow)
To run the program fast, just maximize what we can do with the cache memory
XGBoost puts the gradients and Hessians in the cache so that it can rapidly calculate similarity scores and output values

5. Blocks for out-of-core computation

When the dataset is too large, some of the data must be stored in the hard drive， because read and write data to hard drive is slow, XGBoost tries to minimize these actions by compressing the data
By spending a little bit of CPU time uncompressing the data, we can aviod spending a lot of time accessing the hard drive
Use a database technique called Sharding to speed up disk access. When more than one drive are used, when CPU needs data, all drives can read data as the same time

4 and 5 take the computer hardware into account.

6. Other Techniques

XGBoost can also speed things up by allowing you to build each tree with only a random subset of the data, or by only looking at a random subset of features when deciding how to split the data

Implementation Details

1. Drop the data that are unneeded and reformat the columns

直接删除列，不用另外复制dataframe

dataframe,drop([],axis = 1, inplace = True)

如果列只拥有一个值，不用它来进行预测
如果想要绘制树，需要把特征值以及列名里的空格去掉，换成下划线

df["City"].replace(" ", "_", regex = True, inplace = True)
df.columns = df.columns.str.replace(" ", "_")
df.replace(" ", "_", regex = True, inplace = True)

Deal with missing data

XGBoost has default behavior for missing data, so we have to identify missing values and make sure they are set to 0.
- XGBoost uses sparse matricies, it only keeps track of the 1s and it doesn’t allocate memory for the 0s.(memory efficient)

Format data for XGBoost model

用One-Hot encoding去编码categorical variable而直接用数字代替的原因是，XGBoost不会倾向于把数值相近的类别划分到一起
One-Hot is not for linear or logistic regresskon, but it is great for trees.

Build a preliminary XGBoost Model

Use stratify to ensure that y has same distribution in both training and test set

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state = 42, stratify = y)

To see how the model performs on the test set, draw the confusion matrix
XGBoost has a parameter, scale_pos_weight, that helps with imbalanced data, it adds penalty to incorrectly classified minority class so the tree will try hard to classify minority class.
If the dataset is imbalanced, balance the positive and negative weights via scale_pos_weight(the ratio of negative instances to positive instances) and use AUC(overall performance metric) for evaluation
Use subsample and colsample_bytree to sample only part of the rows and columns for training a tree

Notes taken from Youtube channel StatQuest with Josh Starmer.

Kendyu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【机器学习算法】XGBoost

XGBoostSummary at firstXGBoost中减少计算难度的地方求叶节点的output value的时候用二阶泰勒展开求Similarity score的时候用梯度和Hessian去算RegressionFirst, an original prediction(constant) is given for all points.1. Build the treeFrom the root, calculate similarity for residual
复制链接

扫一扫