【机器学习算法】梯度提升方法

Comparison with Adabost

Gradient BoostAdaBoost
Initialbuild a very short tree called stumpmake a single leaf, representing the initial guess for the target value of all samples
Loopbuild another stump based on the errors made by the previous stumpbuild a tree based on the errors made by the previous tree(usually larger than the stump)
Scalethe amount of say that the stump has on the final output is based on how well it compensated for those previous errorsscales all trees by the same amount with lerning rate

Classification

Rough Map
在这里插入图片描述
1. Initial Prediction

  • l o g ( o d d s ) log(odds) log(odds) caculated from the training data➡️probability

  • Come up an optimial initial prediction g a m m a = l o g ( o d d s ) gamma=log(odds) gamma=log(odds) to minimize F 0 ( x ) F_0(x) F0(x)
    F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmini=1nL(yi,γ)

  • To make the calculation easier, just replace g a m m a gamma gamma as p p p and get p p p first.

  • Probability is used to get the residual

2. Train a Tree

  • In practice people often set the maximum number of leaves to be between 8 and 32.

  • Loss Function

    The log(likelihood) of the observed data given the prediction:(used in logistic regression)
    ∑ i = 1 N y i × log ⁡ ( p i ) + ( 1 − y i ) × log ⁡ ( 1 − p i ) \sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i) i=1Nyi×log(pi)+(1yi)×log(1pi)
    The better the prediction, the larger the log(likelihood). (try to maximize it)

    So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1

    • This is the function of p p p, and we convert it into the function of l o g ( o d d s ) log(odds) log(odds)

    − y i × log ⁡ ( odds ) + log ⁡ ( 1 + e log ⁡ ( odds ) ) -y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right) yi×log(odds)+log(1+elog(odds))

    • Take the derivative of loss function with respect to the predicted l o g ( o d d s ) log(odds) log(odds)

    y i + e log(odds  ) 1 + e log ⁡ (  odds  ) = − y i + p y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p yi+1+elog( odds )elog(odds )=yi+p

    • So this loss function is differentiable.
  • Details for building a tree

    For m m m in 1... M 1...M 1...M,

    • Calculate residuals for each sample

      • r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) = ( y i − e log(odds  ) 1 + e log(odds  ) ) r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right) rim=[F(xi)L(yi,F(xi))]F(x)=Fm1(x)=(yi1+elog(odds )elog(odds ))

      • Multiplying -1 with the derivative of loss function to l o g ( o d d s ) log(odds) log(odds).

      • i i i is the sample number and m m m is the tree that we are building.

      • F m − 1 ( x ) F_{m-1}(x) Fm1(x) is the most recent predicted l o g ( o d d s ) log(odds) log(odds)

      • After simplification, y i − p = Pseudo Residual y_i-p=\text {Pseudo Residual} yip=Pseudo Residual

    • Fit a regression tree to the r i m r_{im} rim values and create terminal regions R j m R_{jm} Rjm, for j = 1... J m j = 1...J_m j=1...Jm

      • J m J_m Jm is the number of leaves.
    • For j = 1... J m j=1...J_m j=1...Jm, compute the output value for each leaf

      • γ j m = a r g m i n γ ∑ x ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma) γjm=argminγxRijL(yi,Fm1(xi)+γ)

      • In this formula we need to calculate γ \gamma γ, but taking the first order derivative and calculating γ \gamma γ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.

        If there is only one sample in the leaf:

        L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ L ( y 1 , F m − i ( x 1 ) ) + d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) γ + 1 2 d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ 2 L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2} L(y1,Fm1(x1)+γ)L(y1,Fmi(x1))+dF()d(y1,Fm1(x1))γ+21dF()2d2(y1,Fm1(x1))γ2

      • Take the first order derivative of γ \gamma γ

        d d γ L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) + d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ \frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma dγdL(y1,Fm1(x1)+γ)dF()d(y1,Fm1(x1))+dF()2d2(y1,Fm1(x1))γ

      • Solve for γ \gamma γ

        γ = − d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) \gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)} γ=dF()2d2(y1,Fm1(x1))dF()d(y1,Fm1(x1))

      • After simplification, γ 1 , 1 =  Residual  p × ( 1 − p ) \gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)} γ1,1=p×(1p) Residual 

    • Update predictions for each sample , F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm1(x)+νj=1JmγjmI(xRjm)

  • The output value of each leaf is calculated from the following formula. Since the predictions are in terms of l o g ( o d d s ) log(odds) log(odds). So the transformations has to be made.

∑  Residual  i ∑ [  Previous Probability  i × ( 1 −  Previous Probability  i ) ] \frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]} [ Previous Probability i×(1 Previous Probability i)] Residual i

3. Make Predictions

  • Update l o g ( o d d s ) log(odds) log(odds) with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.

  • Compare the probability result with threshold 0.5 to decide which class the sample belongs to.

Regression

  • Loss Function

    1 2 ( O b s e r v e d − P r e d i c t e d ) 2 \frac{1}{2} (Observed - Predicted) ^{2} 21(ObservedPredicted)2

    1 2 \frac{1}{2} 21 is to make the calculation of derivative easier.

1. Initialization

  • Average target value in training set

    F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmini=1nL(yi,γ)

    Take the first order derivative of it and get the average value given the above loss function.

2. Build Trees

For m = 1 , 2... M : m=1,2...M: m=1,2...M:

  • Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)

  • Compute Residuals r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} rim=[F(xi)L(yi,F(xi))]F(x)=Fm1(x) for i = 1 , … , n i=1, \ldots, n i=1,,n

    • This is the residual O b s e r v e d − P r e d i c t e d Observed-Predicted ObservedPredicted, the gradient is used here that Gradient Boost is named after
    • Pseudo residual ( r i m (r_{im} (rim) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without 1 2 \frac{1}{2} 21, then it s gradient is similar to residual then is called pseudo residual)
  • Fit a regression tree to the r i m r_{im} rim values and create terminal regions $R_{jm}

  • For j = 1... J m j=1...J_m j=1...Jm, compute output value of each leaf. γ j m = argmin ⁡ γ ∑ x i ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right) γjm=γargminxiRijL(yi,Fm1(xi)+γ)

    The output value of each leaf given this loss function is the average of the residuals on that leaf.

  • Update F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm1(x)+νj=1JmγjmI(xRjm)

3. Make Predictions

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值