经典算法 | Something about XGBoost

Something about XGBoost

XGBoost is one of the most widely used machine learning algorithm. This passage talks about the main idea of XGBoost and my conprehension about the model.

1 Background Knowledge

To understand XGBoost, firstly we need to understand some points about it.

1.1 Boosting

Boosting means to improve a set of weak learning algorithms to a strong learning algorithm. This term belongs to the category of Ensemble Learning. To simplify it, boosting can be compared with team members cooperating to solve a problem.

1.2 Gradient Boost Machine

Gradient Boost Machine(GBM) refers to the learning algorithm that improves based on gradient. In Gradient Boosting, the negative gradient is regarded as the error measurement of the previous basis learner, and the error made in the last round is corrected by fitting the negative gradient in the next round of learning.
GBDT is a kind of GBM, the basis learner of GBDT is decision tree. And XGBoost makes the improvement in many aspects. We use decision tree as its basis learner due to its high interpretation capability and its fast speed.

2 Mathematical Principle

Before referring to its mathematical principle, some basic rules and representations should be mentioned.

2.1 Additive model

XGBoost or GBDT can be regarded as an additive model combined with a number of trees:

y i ^ = ∑ k = 1 K f k ( x i ) , f k ∈ F \hat{y_i}=\sum^K_{k=1}f_k(x_i),f_k\in F yi^=k=1Kfk(xi),fkF

Usually we construct an additive model by forward stagewise algorithm. From the front to the end, the additive model learns one basis function each time and approximates objective function steps by steps.(This method is called boosting)

y i 0 ^ = 0 y i 1 ^ = f 1 ( x i ) = y i 0 ^ + f 1 ( x i ) y i 2 ^ = f 1 ( x i ) + f 2 ( x i ) = y i 1 ^ + f 2 ( x i ) . . . y i t ^ = ∑ k = 1 t f k ( x i ) = y i t − 1 ^ + f t ( x i ) \begin{aligned} \hat{y^0_i}&=0\\ \hat{y^1_i}&=f_1(x_i)=\hat{y^0_i}+f_1(x_i)\\ \hat{y^2_i}&=f_1(x_i)+f_2(x_i)=\hat{y^1_i}+f_2(x_i)\\ ...\\ \hat{y^t_i}&=\sum_{k=1}^tf_k(x_i)=\hat{y^{t-1}_i}+f_t(x_i) \end{aligned} yi0^yi1^yi2^...yit^=0=f1(xi)=yi0^+f1(xi)=f1(xi)+f2(xi)=yi1^+f2(xi)=k=1tfk(xi)=yit1^+ft(xi)

2.2 Taylor Expansion

Usually a certain function can be written in this Second order Taylor Expansion form:

f ( x + Δ x ) ≈ f ( x ) + f ′ ( x ) Δ x + 1 2 f ′ ′ ( x ) Δ x 2 f(x+\Delta x)\approx f(x)+f'(x)\Delta x +\frac{1}{2}f''(x)\Delta x^2 f(x+Δx)f(x)+f(x)Δx+21f(x)Δx2

According to the identities of additive model that stated above, the object function of a certain problem can be written as:

F t = ∑ i = 1 n l ( y i , y i t ^ ) + ∑ i = 1 t Ω ( f i ) = ∑ i = 1 n l ( y i , y i t − 1 ^ + f t ( x i ) ) + Ω ( f t ) + C \begin{aligned} F^t&=\sum^n_{i=1}l(y_i,\hat{y^t_i})+\sum_{i=1}^t\Omega (f_i)\\ &=\sum_{i=1}^n l(y_i,\hat{y^{t-1}_i}+f_t(x_i))+\Omega(f_t)+C \end{aligned} Ft=i=1nl(yi,yit^)+i=1tΩ(fi)=i=1nl(yi,yit1^+ft(xi))+Ω(ft)+C

Here l l l means loss function and C C C is constant. Ω \Omega Ω is regularized term. f t ( x i ) f_t(x_i) ft(xi) is the function that we need to learn this term.

Consequently, we can apply Taylor Expansion Rules to it, the formula can be written as:

F t = ∑ i = 1 n [ l ( y i , y i t − 1 ^ ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) + C F^t=\sum_{i=1}^n[l(y_i,\hat{y_i^{t-1}})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t)+C Ft=i=1n[l(yi,yit1^)+gift(xi)+21hift2(xi)]+Ω(ft)+C

Suppose our loss function is a squared loss function, then we just need to optimize the term:

F t ≈ ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) F^t\approx \sum_{i=1}^n[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t) Fti=1n[gift(xi)+21hift2(xi)]+Ω(ft)

Here g i = ∂ l ( y i , y ^ t − 1 ) ∂ y ^ t − 1 g_i=\frac{\partial l(y_i,\hat{y}^{t-1})}{\partial \hat{y}^{t-1}} gi=y^t1l(yi,y^t1) and h i = ∂ 2 l ( y i , y ^ t − 1 ) ∂ ( y ^ t − 1 ) 2 h_i=\frac{\partial^2 l(y_i,\hat{y}^{t-1})}{\partial (\hat{y}^{t-1})^2} hi=(y^t1)22l(yi,y^t1).

3 Algorithm of XGBoost

Now we suppose there’s a decision tree with T T T leaf nodes. This decision tree is a vector( w ∈ R T w\in R^T wRT)made up of the values of the leaf nodes. So the decision tree can be represented as f t ( x ) = w q ( x ) f_t(x)=w_q(x) ft(x)=wq(x). The complexity of the decision tree is determined by the regularize term Ω = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega=\gamma T+\frac{1}{2}\lambda\sum_{j=1}^T w_j^2 Ω=γT+21λj=1Twj2. Define the set A = { I j = i ∥ q ( x i ) = j } A=\{I_j=i\|q(x_i)=j\} A={Ij=iq(xi)=j} as the set of all the training points divided to the leaf node j j j, so the objective function can be rewritten as:

F t ≈ ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) = ∑ i = 1 n [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T = ∑ j = 1 T [ G i w j + 1 2 ( H i + λ ) w j 2 ] + γ T \begin{aligned} F^t&\approx \sum_{i=1}^n[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t)\\ &=\sum_{i=1}^n[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2\\ &=\sum_{j=1}^T[(\sum_{i\in I_j}g_i)w_j+\frac{1}{2}(\sum_{i\in I_j}h_i+\lambda)w_j^2]+\gamma T\\ &=\sum_{j=1}^T[G_iw_j+\frac{1}{2}(H_i+\lambda)w_j^2]+\gamma T \end{aligned} Fti=1n[gift(xi)+21hift2(xi)]+Ω(ft)=i=1n[giwq(xi)+21hiwq(xi)2]+γT+21λj=1Twj2=j=1T[(iIjgi)wj+21(iIjhi+λ)wj2]+γT=j=1T[Giwj+21(Hi+λ)wj2]+γT

So when the tree has a fixed structure ( q ( x ) q(x) q(x) fixed),let ∂ F t = 0 \partial F^t=0 Ft=0, we have

w j ∗ = − G j H j + λ w^*_j=-\frac{G_j}{H_j+\lambda} wj=Hj+λGj

Consequently the value of objective function is
F = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T F=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T F=21j=1THj+λGj2+γT

Normally, we use greedy strategy to generate every node of the decision tree. And here’s how we calculate the GAIN of each division: G a i n = F c u r r e n t   n o d e − F l e f t   c h i l d − F r i g h t   c h i l d Gain=F_{current\ node}-F_{left\ child}-F_{right\ child} Gain=Fcurrent nodeFleft childFright child, specifically:

G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ − ( G R + G L ) 2 H R + H L + λ ] − γ Gain=\frac{1}{2}[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_R+G_L)^2}{H_R+H_L+\lambda}]-\gamma Gain=21[HL+λGL2+HR+λGR2HR+HL+λ(GR+GL)2]γ

4 Differences between GBDT and XGBoost

  • GBDT uses CART as basis classifer while XGBoost can hold different kinds of classifers.
  • GBDT uses first-order derivative only but XGBoost uses the second order.
  • XGBoost enables multithreading when selecting the best sharding point, greatly increasing the speed.
  • To avoid over-fitting, XGBoost introduces Shrinkage and column subsampling.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值