Comparison with Adabost
Gradient Boost | AdaBoost | |
---|---|---|
Initial | build a very short tree called stump | make a single leaf, representing the initial guess for the target value of all samples |
Loop | build another stump based on the errors made by the previous stump | build a tree based on the errors made by the previous tree(usually larger than the stump) |
Scale | the amount of say that the stump has on the final output is based on how well it compensated for those previous errors | scales all trees by the same amount with lerning rate |
Classification
Rough Map
1. Initial Prediction
-
l o g ( o d d s ) log(odds) log(odds) caculated from the training data➡️probability
-
Come up an optimial initial prediction g a m m a = l o g ( o d d s ) gamma=log(odds) gamma=log(odds) to minimize F 0 ( x ) F_0(x) F0(x)
F 0 ( x ) = argmin γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmini=1∑nL(yi,γ) -
To make the calculation easier, just replace g a m m a gamma gamma as p p p and get p p p first.
-
Probability is used to get the residual
2. Train a Tree
-
In practice people often set the maximum number of leaves to be between 8 and 32.
-
Loss Function
The log(likelihood) of the observed data given the prediction:(used in logistic regression)
∑ i = 1 N y i × log ( p i ) + ( 1 − y i ) × log ( 1 − p i ) \sum_{i=1}^{N} y_{i} \times \log (p_i)+\left(1-y_{i}\right) \times \log (1-p_i) i=1∑Nyi×log(pi)+(1−yi)×log(1−pi)
The better the prediction, the larger the log(likelihood). (try to maximize it)So to use the log(likelihood) as loss function that we want to minimize, just multiply it with -1
- This is the function of p p p, and we convert it into the function of l o g ( o d d s ) log(odds) log(odds)
− y i × log ( odds ) + log ( 1 + e log ( odds ) ) -y_i\times \log (\text {odds})+\log \left(1+e^{\log (\text{odds})}\right) −yi×log(odds)+log(1+elog(odds))
- Take the derivative of loss function with respect to the predicted l o g ( o d d s ) log(odds) log(odds)
y i + e log(odds ) 1 + e log ( odds ) = − y i + p y_i+\frac{e^{\text {log(odds })}}{1+e^{\log (\text { odds })}} =-y_i+p yi+1+elog( odds )elog(odds )=−yi+p
- So this loss function is differentiable.
-
Details for building a tree
For m m m in 1... M 1...M 1...M,
-
Calculate residuals for each sample
-
r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) = ( y i − e log(odds ) 1 + e log(odds ) ) r_{im}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)}=\left(y_i-\frac{e^{\text {log(odds })}}{1+e^{\text {log(odds })}}\right) rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)=(yi−1+elog(odds )elog(odds ))
-
Multiplying -1 with the derivative of loss function to l o g ( o d d s ) log(odds) log(odds).
-
i i i is the sample number and m m m is the tree that we are building.
-
F m − 1 ( x ) F_{m-1}(x) Fm−1(x) is the most recent predicted l o g ( o d d s ) log(odds) log(odds)
-
After simplification, y i − p = Pseudo Residual y_i-p=\text {Pseudo Residual} yi−p=Pseudo Residual
-
-
Fit a regression tree to the r i m r_{im} rim values and create terminal regions R j m R_{jm} Rjm, for j = 1... J m j = 1...J_m j=1...Jm
- J m J_m Jm is the number of leaves.
-
For j = 1... J m j=1...J_m j=1...Jm, compute the output value for each leaf
-
γ j m = a r g m i n γ ∑ x ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=argmin_{\gamma}\sum_{x\in\R_{ij}}L(y_i, F_m-1(x_i)+\gamma) γjm=argminγ∑x∈RijL(yi,Fm−1(xi)+γ)
-
In this formula we need to calculate γ \gamma γ, but taking the first order derivative and calculating γ \gamma γ is too hard, So we approximate the loss function with a secnod order Taylor Polynomial.
If there is only one sample in the leaf:
L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ L ( y 1 , F m − i ( x 1 ) ) + d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) γ + 1 2 d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ 2 L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx L\left(y_{1}, F_{m-i}\left(x_{1}\right)\right)+\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma+\frac{1}{2} \frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma^{2} L(y1,Fm−1(x1)+γ)≈L(y1,Fm−i(x1))+dF()d(y1,Fm−1(x1))γ+21dF()2d2(y1,Fm−1(x1))γ2
-
Take the first order derivative of γ \gamma γ
d d γ L ( y 1 , F m − 1 ( x 1 ) + γ ) ≈ d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) + d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) γ \frac{d}{d \gamma} L\left(y_{1}, F_{m-1}\left(x_{1}\right)+\gamma\right) \approx \frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)+\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right) \gamma dγdL(y1,Fm−1(x1)+γ)≈dF()d(y1,Fm−1(x1))+dF()2d2(y1,Fm−1(x1))γ
-
Solve for γ \gamma γ
γ = − d d F ( ) ( y 1 , F m − 1 ( x 1 ) ) d 2 d F ( ) 2 ( y 1 , F m − 1 ( x 1 ) ) \gamma=\frac{-\frac{d}{d F()}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)}{\frac{d^{2}}{d F()^{2}}\left(y_{1}, F_{m-1}\left(x_{1}\right)\right)} γ=dF()2d2(y1,Fm−1(x1))−dF()d(y1,Fm−1(x1))
-
After simplification, γ 1 , 1 = Residual p × ( 1 − p ) \gamma_{1,1}=\frac{\text { Residual }}{p \times(1-p)} γ1,1=p×(1−p) Residual
-
-
Update predictions for each sample , F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)
-
-
The output value of each leaf is calculated from the following formula. Since the predictions are in terms of l o g ( o d d s ) log(odds) log(odds). So the transformations has to be made.
∑ Residual i ∑ [ Previous Probability i × ( 1 − Previous Probability i ) ] \frac{\sum \text { Residual }_{i}}{\sum\left[\text { Previous Probability }_{i} \times\left(1-\text { Previous Probability }_{i}\right)\right]} ∑[ Previous Probability i×(1− Previous Probability i)]∑ Residual i
3. Make Predictions
-
Update l o g ( o d d s ) log(odds) log(odds) with learning rate and output value of the leaf, convert it into probability and new residual can be calculated, then the next tree can be built.
-
Compare the probability result with threshold 0.5 to decide which class the sample belongs to.
Regression
-
Loss Function
1 2 ( O b s e r v e d − P r e d i c t e d ) 2 \frac{1}{2} (Observed - Predicted) ^{2} 21(Observed−Predicted)2
1 2 \frac{1}{2} 21 is to make the calculation of derivative easier.
1. Initialization
-
Average target value in training set
F 0 ( x ) = argmin γ ∑ i = 1 n L ( y i , γ ) F_{0}(x)=\underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L\left(y_{i}, \gamma\right) F0(x)=γargmin∑i=1nL(yi,γ)
Take the first order derivative of it and get the average value given the above loss function.
2. Build Trees
For m = 1 , 2... M : m=1,2...M: m=1,2...M:
-
Use learning rate to scale the contribution from the new tree to prevent overfitting = taking lots of small steps in the right direction results in better predictions with a test set(low variance)
-
Compute Residuals r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{i m}=-\left[\frac{\partial L\left(y_{i}, F\left(x_{i}\right)\right)}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x) for i = 1 , … , n i=1, \ldots, n i=1,…,n
- This is the residual O b s e r v e d − P r e d i c t e d Observed-Predicted Observed−Predicted, the gradient is used here that Gradient Boost is named after
- Pseudo residual ( r i m (r_{im} (rim) is the difference between observed values and the predicted value. Just a different say.(When we use another loss function, such as without 1 2 \frac{1}{2} 21, then it s gradient is similar to residual then is called pseudo residual)
-
Fit a regression tree to the r i m r_{im} rim values and create terminal regions $R_{jm}
-
For j = 1... J m j=1...J_m j=1...Jm, compute output value of each leaf. γ j m = argmin γ ∑ x i ∈ R i j L ( y i , F m − 1 ( x i ) + γ ) \gamma_{j m}=\underset{\gamma}{\operatorname{argmin}} \sum_{x_{i} \in R_{i j}} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma\right) γjm=γargmin∑xi∈RijL(yi,Fm−1(xi)+γ)
The output value of each leaf given this loss function is the average of the residuals on that leaf.
-
Update F m ( x ) = F m − 1 ( x ) + ν ∑ j = 1 J m γ j m I ( x ∈ R j m ) F_{m}(x)=F_{m-1}(x)+\nu \sum_{j=1}^{J_{m}} \gamma_{j m} I\left(x \in R_{j m}\right) Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm)
3. Make Predictions