【机器学习算法】XGBoost

XGBoost

Summary at first

  • XGBoost中减少计算难度的地方
    • 求叶节点的output value的时候用二阶泰勒展开
    • 求Similarity score的时候用梯度和Hessian去算

Regression

  • First, an original prediction(constant) is given for all points.

1. Build the tree

  • From the root, calculate similarity for residuals

    • S i m i l a r i t y = ( ∑ i = 1 N r i ) 2 N + λ Similarity=\frac{(\sum_{i=1}^Nr_i)^2}{N+\lambda} Similarity=N+λ(i=1Nri)2

    • λ \lambda λ is for regularization, which intends to reduce the prediction’s sensitivity to indiviudal observations. It prevent overfitting the training data.

    • When λ > 0 \lambda>0 λ>0, the similarity scores are smaller, and the amount of decrease is inversely proportional to the number of residuals in the node.

    • When λ > 0 \lambda>0 λ>0, it is easier to prune leaves because the values for Gain are smaller.

    • When λ > 0 \lambda>0 λ>0, in some cases(such as root has two samples, their residuals are positive) we will still prune branch even γ = 0 \gamma=0 γ=0 since G a i n < 0 Gain<0 Gain<0. Setting γ = 0 \gamma=0 γ=0 does not turn off pruning.

  • Cluster similar residual

    • The split points can be chosen as all midpoints between two adjacent observations(try all of them and pick one in the end)
    • When the residuals are similar, or there is just one of them, they do not cancel out and the similarity score is relatively large(the split is good)
  • Calculate the Gain of splitting the residuals into two groups

G a i n = L e f t S i m i l a r i t y + R i g h t S i m i l a r i t y − R o o t S i m i l a r i t y Gain=Left_{Similarity}+Right_{Similarity}-Root_{Similarity} Gain=LeftSimilarity+RightSimilarityRootSimilarity

  • Use the threshold(split point) that gave the largest Gain
  • Do the above steps until it reach the tree depth setting

2. Prune

  • From the lowest branch in the tree, calculate G a i n − γ Gain-\gamma Gainγ, if the difference is negative, then remove the branch.
    • γ \gamma γ is the tree complexity parameter.
  • But if we do not remove the lower branch, we will not remove its root branch.

3. Output Value

  • Calculate output value for each leaf

    • O u t p u t   V a l u e = ∑ i = 1 N r i N + λ Output\ Value=\frac{\sum_{i=1}^Nr_i}{N+\lambda} Output Value=N+λi=1Nri

    • When λ > 0 \lambda>0 λ>0, it will reduce the amount that this individual observation adds to the overall prediction. λ = 0 \lambda=0 λ=0 is the default value.

    • When λ > 0 \lambda>0 λ>0, it results in smaller output values for the leaves

4. Predictions

  • Make new predictions by starting with the initial prediction.

  • Add the output of the tree scaled by a learning rate. p r e d i c t i o n = i n i t i a l   p r e d i c t i o n + l e a r n i n g   r a t e ∗ o u t p u t   v a l u e   o f   t r e e prediction=initial\ prediction+learning\ rate*output\ value\ of\ tree prediction=initial prediction+learning rateoutput value of tree

    • Learning rate η \eta η, default value is 0.3.
  • The new residual is smaller than before

5. Build another tree with smaller residuals and make new predictions and get new residuals…

About Parameters

  • When λ > 0 \lambda>0 λ>0, it results in more pruning by shrinking the similarity scores and it results in smaller output values for the leaves.

Classification

1. Make initial prediction first, such as the probability

2. Calculate similarity score

S i m i l a r i t y   S c o r e = ( ∑ i = 1 N r i ) 2 ∑ i = 1 N p i ( 1 − p i ) + λ Similarity\ Score=\frac{(\sum _{i=1}^Nr_i)^2}{\sum_{i=1}^Np_i(1-p_i)+\lambda} Similarity Score=i=1Npi(1pi)+λ(i=1Nri)2

p i p_i pi is the previous predicted probability for sample i i i.

It is possible to split with the same feature at different thresholds.

Other steps are similary to regression case: Calculate similarity score, Calculate Gain, Prune the tree, Calculate output value

3. Stopping Criterion When Building a Single Tree

  • the number of levels

  • the number of residuals in each leaf is determined(represented) by Cover, C o v e r = ∑ i = 1 N p i ( 1 − p i ) Cover=\sum_{i=1}^Np_i(1-p_i) Cover=i=1Npi(1pi)

    • In regression case, C o v e r = N Cover=N Cover=N. By default, the minimum value for Cover is 1. When we use the defult value, Cover has no effect on how we grow the tree.
    • In classification case, if min_child_weight is set as 1, and Cover calculated is smaller than min_child_weight, the branch will be removed. Usually Set it as 0.

Values for λ \lambda λ greater than 0 reduce the sensitivity of the tree to individual observations by pruning and combining them with other observations.

4. Output Value

  • O u t p u t   V a l u e = ∑ i = 1 N r i ∑ i = 1 N p i ( 1 − p i ) + λ Output\ Value=\frac{\sum_{i=1}^Nr_i}{\sum_{i=1}^Np_i(1-p_i)+\lambda} Output Value=i=1Npi(1pi)+λi=1Nri

  • When λ > 0 \lambda>0 λ>0, it reduces the amount that this single observation adds to the new prediction.

  • There is a transformation between odds and probability here:

  • l o g ( p 1 − p ) = l o g ( o d d s ) log(\frac{p}{1-p})=log(odds) log(1pp)=log(odds)

    l o g ( o d d s )   p r e d i c t i o n s = l o g ( o d d s ) + η × O u t p u t   V a l u e log(odds)\ predictions = log(odds)+\eta\times Output\ Value log(odds) predictions=log(odds)+η×Output Value

    N e w   P r o b a b i l i t y = e l o g ( o d d s ) 1 + e l o g ( o d d s ) New\ Probability=\frac{e^{log(odds)}}{1+e^{log(odds)}} New Probability=1+elog(odds)elog(odds)

5. Build new trees to fit new residuals
6. Stopping Criteron on Build Trees

  • residuals are super small
  • reach maximum number of trees

Mathematical Details

1. Loss Function

  • Regression

    • L ( y i , p i ) = 1 2 ( y i − p i ) 2 L\left(y_{i}, p_{i}\right)=\frac{1}{2}\left(y_{i}-p_{i}\right)^{2} L(yi,pi)=21(yipi)2
  • Classification

  • L ( y i , p i ) = − [ y i log ⁡ ( p i ) + ( 1 − y i ) log ⁡ ( 1 − p i ) ] L\left(y_{i}, p_{i}\right)=-\left[y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right] L(yi,pi)=[yilog(pi)+(1yi)log(1pi)]

    The negative log-likelihood is the commonly used loss function.

2. Objective Function in building trees

  • [ ∑ i = 1 n L ( y i , p i ) ] + 1 2 λ O value  2 + γ T \left[\sum_{i=1}^{n} L\left(y_{i}, p_{i}\right)\right]+\frac{1}{2} \lambda O_{\text {value }}^{2}+\gamma T [i=1nL(yi,pi)]+21λOvalue 2+γT

    • T T T is the number of terminal nodes, or leaves, in a tree. γ \gamma γ is a user definable penalty, is meant to encourage pruning. In the following discussion, γ T \gamma T γT is omitted because pruning takes place after the full tree is built and plays no role in deriving optimal output values or similarity scores.
    • 1 2 λ O v a l u e 2 \frac{1}{2}\lambda O_{value}^2 21λOvalue2 is the regularization term, just like in ridge regression, if λ > 0 \lambda>0 λ>0, then we will shrink the O v a l u e O_{value} Ovalue.
  • The goal is to find an output value( O v a l u e O_{value} Ovalue)that minimize the whole equation. O v a l u e O_{value} Ovalue is the variable to be optimized.

    In regression case,
    [ ∑ i = 1 n L ( y i , p i 0 + O value  ) ] + 1 2 λ O value  2 \left[\sum_{i=1}^{n} L\left(y_{i}, p_{i}^{0}+O_{\text {value }}\right)\right]+\frac{1}{2} \lambda O_{\text {value }}^{2} [i=1nL(yi,pi0+Ovalue )]+21λOvalue 2

    • The more emphasis we give the regularization penalty by increasing λ \lambda λ, the optimal O v a l u e O_{value} Ovalue gets closer to 0.
    • O v a l u e O_{value} Ovalue is different for different leaf. Example here are calculated for samples in the same leaf.
  • To solve the optimal output value for a leaf in regression case, take the second order Taylor Approximation(XGBoost uses this for both regression and classification while unextreme Grading Boost used it for classification only)
    L ( y i , p i + O value  ) ≈ L ( y i , p i ) + [ d d p i L ( y i , p i ) ] O value  + 1 2 [ d 2 d p i 2 L ( y i , p i ) ] O value  2 L\left(y_i, p_{i}+O_{\text {value }}\right) \approx L\left(y_i, p_{i}\right)+\left[\frac{d}{d p_{i}} L\left(y_i, p_{i}\right)\right] O_{\text {value }}+\frac{1}{2}\left[\frac{d^{2}}{d p_{i}^{2}} L\left(y_i, p_{i}\right)\right] O_{\text {value }}^{2} L(yi,pi+Ovalue )L(yi,pi)+[dpidL(yi,pi)]Ovalue +21[dpi2d2L(yi,pi)]Ovalue 2

    • g g g to represent the derivative of the loss function, h h h to represent the second order derivative of the loss function.

    • Replace the above equation with g g g and h h h and omit the L ( y i , p i ) L(y_i,p_i) L(yi,pi) which is not related to O v a l u e O_{value} Ovalue
      ( g 1 + g 2 + ⋯ + g n ) O value  + 1 2 ( h 1 + h 2 + ⋯ + h n + λ ) O value  2 \left(g_{1}+g_{2}+\cdots+g_{n}\right) O_{\text {value }}+\frac{1}{2}\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right) O_{\text {value }}^{2} (g1+g2++gn)Ovalue +21(h1+h2++hn+λ)Ovalue 2

    • Take the derivative and set it as 0, solve for the O v a l u e O_{value} Ovalue.(No matter whether it is for regression or classification)
      O value  = − ( g 1 + g 2 + ⋯ + g n ) ( h 1 + h 2 + ⋯ + h n + λ ) O_{\text {value }}=\frac{-\left(g_{1}+g_{2}+\cdots+g_{n}\right)}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)} Ovalue =(h1+h2++hn+λ)(g1+g2++gn)
      This the optimal output value for the leaf.

      g i = − ( y i − p i ) g_i=-(y_i-p_i) gi=(yipi) is the negative residual. h i = 1 h_i=1 hi=1
      O v a l u e = ∑ i = 1 N r i N + λ O_{value}=\frac{\sum_{i=1}^Nr_i}{N+\lambda} Ovalue=N+λi=1Nri

  • In classification case
    L ( y i , p i ) = − [ y i log ⁡ ( p i ) + ( 1 − y i ) log ⁡ ( 1 − p i ) ] L\left(y_{i}, p_{i}\right)=-\left[y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right] L(yi,pi)=[yilog(pi)+(1yi)log(1pi)]

  • Similarity Score
    − ( g 1 + g 2 + ⋯ + g n ) O value  − 1 2 ( h 1 + h 2 + ⋯ + h n + λ ) O value  2 -\left(g_{1}+g_{2}+\cdots+g_{n}\right) O_{\text {value }}-\frac{1}{2}\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right) O_{\text {value }}^{2} (g1+g2++gn)Ovalue 21(h1+h2++hn+λ)Ovalue 2
    XGBoost use the above simplified equation to determine the similarity score.(used in the manuscreipt) They are used while building trees.
     Similarity Score  = 1 2 ( g 1 + g 2 + ⋯ + g n ) 2 ( h 1 + h 2 + ⋯ + h n + λ ) \text { Similarity Score }=\frac{1}{2} \frac{\left(g_{1}+g_{2}+\cdots+g_{n}\right)^{2}}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)}  Similarity Score =21(h1+h2++hn+λ)(g1+g2++gn)2
    But in implementation, similarity socre used is actually 2 times that number.(Because it is just a relative measure)
     Similarity Score  = ( g 1 + g 2 + ⋯ + g n ) 2 ( h 1 + h 2 + ⋯ + h n + λ ) \text { Similarity Score }=\frac{\left(g_{1}+g_{2}+\cdots+g_{n}\right)^{2}}{\left(h_{1}+h_{2}+\cdots+h_{n}+\lambda\right)}  Similarity Score =(h1+h2++hn+λ)(g1+g2++gn)2
    An example of how XGBoost reduce the amount of computation. g i g_i gi and h i h_i hi can be directly calculated with the gradients and Hessian.

  • Cover is the sum of the Hessians.

Optimization

  • Make XGBoost relatively efficient/fast with relatively large training datasets.

1. Approximate Greedy Algorithm

  • Greedy algorithm: make a decision without looking ahead to see if it is the absolute best choice in the long term. Just for the best Gain at the time.
    • XGBoost can build a tree relatively quickly in this way but if the data is complex and large, it is still not efficient.
  • Approximate greedy algorithm: instead of testing every single threshold, divide the data into quantiles and only use the quantiles as candidate thresholds to split the observations. By default, XGBoost uses about 33 quantiles.

2. Sketches(Parallel Learning)

  • Steps:
    • Splitting the dataset into small pieces and putting the pieces on different computers on a network(work on it at the same time)
    • The quantile sketch algorithm combines the values from each computer to make an approximate histogram
    • Then the approximate histogram is used to calculate approximate weighted quantiles(weights are calculated after building each tree):
      • With weighted quantiles, each observation has a corresponding weight, and the sum of the weights are the same in each quantile
      • The weights are derived from the Cover metric(the weight for each observation is the Hessian of the Loss Function)
        • For regression, the weights are equal to 1
        • For classification, if we don’t have a lot of confidence in the classification, the weights are relatively small. By dividing the observations into quantiles where the sum of the weights are similar, we split the two observations with low confidence predictions into seperate bins(We get smaller quantiles when we need them)
    • The approximate greedy algorithm uses approximate quantiles
  • XGBoost only uses the Approximate Greedy Algorithm, Parallel Learning and Weighted Quantile Sketch when the training dataset is huge. When the dataset is small, XGBoost just uses a normal, everyday Greedy Algorithm.

3. Sparsity-Aware Split Finding

  • For dataset with missing values in features
    • Split data with or without missing feature values into two tables
    • Calculate G a i n L e f t Gain_{Left} GainLeft and G a i n R i g h t Gain_{Right} GainRight with diffferent split points when put the residuals of missing feature values in the left or right leaf.
    • Finally choose the threshold that gives us the largest for G a i n Gain Gain overall.
    • And all future observations(for test) without feature values will go through this path by default.

4. Cache-Aware Access

  • CPU: Cache Memory(fastest), Main Memory(fast), Hard Drive(slow)
  • To run the program fast, just maximize what we can do with the cache memory
  • XGBoost puts the gradients and Hessians in the cache so that it can rapidly calculate similarity scores and output values

5. Blocks for out-of-core computation

  • When the dataset is too large, some of the data must be stored in the hard drive, because read and write data to hard drive is slow, XGBoost tries to minimize these actions by compressing the data
  • By spending a little bit of CPU time uncompressing the data, we can aviod spending a lot of time accessing the hard drive
  • Use a database technique called Sharding to speed up disk access. When more than one drive are used, when CPU needs data, all drives can read data as the same time

4 and 5 take the computer hardware into account.

6. Other Techniques

  • XGBoost can also speed things up by allowing you to build each tree with only a random subset of the data, or by only looking at a random subset of features when deciding how to split the data

Implementation Details

1. Drop the data that are unneeded and reformat the columns

  • 直接删除列,不用另外复制dataframe
dataframe,drop([],axis = 1, inplace = True)
  • 如果列只拥有一个值,不用它来进行预测
  • 如果想要绘制树,需要把特征值以及列名里的空格去掉,换成下划线
df["City"].replace(" ", "_", regex = True, inplace = True)
df.columns = df.columns.str.replace(" ", "_")
df.replace(" ", "_", regex = True, inplace = True)
  1. Deal with missing data
  • XGBoost has default behavior for missing data, so we have to identify missing values and make sure they are set to 0.
    • XGBoost uses sparse matricies, it only keeps track of the 1s and it doesn’t allocate memory for the 0s.(memory efficient)
  1. Format data for XGBoost model
  • 用One-Hot encoding去编码categorical variable而直接用数字代替的原因是,XGBoost不会倾向于把数值相近的类别划分到一起
  • One-Hot is not for linear or logistic regresskon, but it is great for trees.
  1. Build a preliminary XGBoost Model
  • Use stratify to ensure that y has same distribution in both training and test set
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state = 42, stratify = y)
  • To see how the model performs on the test set, draw the confusion matrix

  • XGBoost has a parameter, scale_pos_weight, that helps with imbalanced data, it adds penalty to incorrectly classified minority class so the tree will try hard to classify minority class.

  • If the dataset is imbalanced, balance the positive and negative weights via scale_pos_weight(the ratio of negative instances to positive instances) and use AUC(overall performance metric) for evaluation

  • Use subsample and colsample_bytree to sample only part of the rows and columns for training a tree

Notes taken from Youtube channel StatQuest with Josh Starmer.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值