【机器学习算法】梯度提升方法

最新推荐文章于 2023-12-24 17:32:38 发布

Kendyu

最新推荐文章于 2023-12-24 17:32:38 发布

阅读量272

点赞数

分类专栏：机器学习文章标签：算法机器学习数据挖掘

本文链接：https://blog.csdn.net/weixin_43905298/article/details/112752786

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Comparison with Adabost

	Gradient Boost	AdaBoost
Initial	build a very short tree called stump	make a single leaf, representing the initial guess for the target value of all samples
Loop	build another stump based on the errors made by the previous stump	build a tree based on the errors made by the previous tree(usually larger than the stump)
Scale	the amount of say that the stump has on the final output is based on how well it compensated for those previous errors	scales all trees by the same amount with lerning rate

Classification

Rough Map
在这里插入图片描述
1. Initial Prediction

$l o g (o d d s)$ caculated from the training data➡️probability
Come up an optimial initial prediction $g a m m a = l o g (o d d s)$ to minimize $F_0(x)$
$F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right)$
To make the calculation easier, just replace $g a m m a$ as $p$ and get $p$ first.
Probability is used to get the residual

2. Train a Tree

In practice people often set the maximum number of leaves to be between 8 and 32.
Loss Function

The log(likelihood) of the observed data given the prediction:(used in logistic regression)
$\sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i)$
The better the prediction, the larger the log(likelihood). (try to maximize it)

So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1
- This is the function of $p$ , and we convert it into the function of $l o g (o d d s)$
$-y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right)$
- Take the derivative of loss function with respect to the predicted $l o g (o d d s)$
$y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p$
- So this loss function is differentiable.
Details for building a tree

For $m$ in $1 . . . M$ ,
- Calculate residuals for each sample
  - $r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right)$
  - Multiplying -1 with the derivative of loss function to $l o g (o d d s)$ .
  - $i$ is the sample number and $m$ is the tree that we are building.
  - $F_{m-1}(x)$ is the most recent predicted $l o g (o d d s)$
  - After simplification, $y_i-p=\text {Pseudo Residual}$
- Fit a regression tree to the $r_{im}$ values and create terminal regions $R_{jm}$ , for $j = 1...J_m$
  - $J_m$ is the number of leaves.
- For $j=1...J_m$ , compute the output value for each leaf
  - $\gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma)$
  - In this formula we need to calculate $\gamma$ , but taking the first order derivative and calculating $\gamma$ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.
    
    If there is only one sample in the leaf:
    
    $L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2}$
  - Take the first order derivative of $\gamma$
    
    $\frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma$
  - Solve for $\gamma$
    
    $\gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}$
  - After simplification, $\gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)}$
- Update predictions for each sample , $F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right)$
The output value of each leaf is calculated from the following formula. Since the predictions are in terms of $l o g (o d d s)$ . So the transformations has to be made.

$\frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]}$

3. Make Predictions

Update $l o g (o d d s)$ with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.
Compare the probability result with threshold 0.5 to decide which class the sample belongs to.

Regression

Loss Function

$\frac{1}{2} (Observed - Predicted) ^{2}$

$\frac{1}{2}$ is to make the calculation of derivative easier.

1. Initialization

Average target value in training set

$F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right)$

Take the first order derivative of it and get the average value given the above loss function.

2. Build Trees

For $m = 1, 2 . . . M :$

Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)
Compute Residuals $r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}$ for $\ldots, n$
- This is the residual $O b s e r v e d - P r e d i c t e d$ , the gradient is used here that Gradient Boost is named after
- Pseudo residual $r_{im}$ ) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without $\frac{1}{2}$ , then it s gradient is similar to residual then is called pseudo residual)
Fit a regression tree to the $r_{im}$ values and create terminal regions $R_{jm}
For $j=1...J_m$ , compute output value of each leaf. $\gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right)$

The output value of each leaf given this loss function is the average of the residuals on that leaf.
Update $F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right)$