XGBoost math Derivation 通俗易懂的详细推导

往期文章链接目录

Bagging v.s. Boosting:

Bagging:
Leverages unstable base learners that are weak because of overfitting.

Boosting:
Leverages stable base learners that are weak because of underfitting.

XGBoost

Learning Process through XGBoost:

  1. How to set a objective function.
  2. Hard to solve directly, how to approximate.
  3. How to add tree structures into the objective function.
  4. Still hard to optimize, then have to use greedy algorithms.

Step 1. How to set an objective function

Suppose we trained K K K trees, then the prediction for the ith sample is
y ^ i = ∑ k = 1 K f k ( x i ) , f k ∈ F \hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F} y^i=k=1Kfk(xi),fkF
where K K K is the number of trees, f f f is a function in the functional space F \mathcal{F} F, and F \mathcal{F} F is the set of all possible CARTs.

The objective function to be optimized is given by
obj = ∑ i n l ( y i , y ^ i ) + ∑ k = 1 K Ω ( f k ) \text{obj} = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k) obj=inl(yi,y^i)+k=1KΩ(fk)
The first term is the loss function and the second term controls trees’ complexity. We see the undefined terms in this objective function are the loss function l l l and model complexity Ω ( f i ) \Omega (f_i) Ω(fi). Functions f i f_i fi are what we need to learn, each containing the structure of the tree and the leaf scores.

we use an additive strategy to build the model. That is, fix what we have learned, and add one new tree at a time. This is called Additive Training.

Note that given x i x_i xi, the prediction y ^ i ( K ) = y ^ i ( K − 1 ) + f K ( x i ) \hat{y}_i^{(K)} = \hat{y}_i^{(K-1)} + f_K(x_i) y^i(K)=y^i(K1)+fK(xi), where the term y ^ i ( K − 1 ) \hat{y}_i^{(K-1)} y^i(K1) is known and only f K ( x i ) f_K(x_i) fK(xi) is unknown.

Now, we can rewrite the objective function as

Step 2. Hard to solve directly, how to approximate

If the loss function is the MSE, then the objective function has a nice form. But for a general loss function, it is hard to optimize directly. Therefore, we need to approximate it. Here, we use second order Taylor approximation

So we can re-write the objective function as

If we define constant term g i g_i gi, h i h_i hi as
g i = ∂ y ^ i ( K − 1 ) l ( y i , y ^ i ( K − 1 ) ) f i = ∂ y ^ i ( K − 1 ) 2 l ( y i , y ^ i ( K − 1 ) ) g_i = \partial_{\hat{y}_i^{(K-1)}} l(y_i, \hat{y}_i^{(K-1)}) \\ f_i = \partial_{\hat{y}_i^{(K-1)}}^2 l(y_i, \hat{y}_i^{(K-1)}) gi=y^i(K1)l(yi,y^i(K1))fi=y^i(K1)2l(yi,y^i(K1))
Also, since previous K − 1 K-1 K1 trees are fixed, we have ∑ i = 1 n l ( y i , y ^ i ( K − 1 ) ) \sum_{i=1}^n l(y_i, \hat{y}_i^{(K-1)}) i=1nl(yi,y^i(K1)) fixed. Then we can write the objective function as

Now, all previous trees’ information is stored in terms g i g_i gi and h i h_i hi. That’s how the K K Kth tree is related to previous trees. Our new objective function now is
obj K = ∑ i = 1 n [ g i f K ( x i ) + 1 2 h i f K 2 ( x i ) ] + Ω ( f K ) \text{obj}_K = \sum_{i=1}^n [g_if_K(x_i) + \frac{1}{2}h_if_K^2(x_i)] + \Omega(f_K) objK=i=1n[gifK(xi)+21hifK2(xi)]+Ω(fK)
The unknown at this point are f K f_K fK and Ω ( f K ) \Omega(f_K) Ω(fK).

Step 3. How to add tree structures into the objective function

Redefine a tree:

  • Define I j = { i ∣ q ( x i ) = j } I_j = \{i|q(x_i)=j\} Ij={iq(xi)=j} to be the set of indices of data points assigned to the 𝑗-th leaf.
  • T T T is the number of leaves.
  • q ( x i ) q(x_i) q(xi) is which leaf the i i i-th sample is assigned to. Term q ( x ) q(x) q(x) defines the tree structure.
  • w q ( x i ) w_{q(x_i)} wq(xi) is the score of the assigned leaf.
  • In XGBoost, we define the complexity as Ω ( f ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 Ω(f)=γT+21λj=1Twj2

Now the revised objective function is

Define G j = ∑ i ∈ I j g i G_j = \sum_{i\in I_j} g_i Gj=iIjgi and H j = ∑ i ∈ I j h i H_j = \sum_{i\in I_j} h_i Hj=iIjhi constants. The the objective function is
obj K = ∑ j = 1 T [ G j w j + 1 2 ( H j + λ ) w j 2 ] + γ T \text{obj}_{K} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T objK=j=1T[Gjwj+21(Hj+λ)wj2]+γT

Right now, the only unknown is w j w_j wj, which is independent respect to other terms. Take the derivative and set it to zero, we have
w j = − G j H j + λ w_j = -\frac{G_j}{H_j+\lambda} wj=Hj+λGj
Then the objective function achieve its minimum at
obj K = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T \text{obj}_K = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T objK=21j=1THj+λGj2+γT

Step 4. Still hard to optimize, then have to use greedy algorithms

For any tree structure, we are now able to find the minimum value of the objective function. Ideally we would brute force all possible tree structures and pick the one with the minimum objective function. However, this is impossible in practice and we use greedy algorithm instead. When we build decision trees, we use entropy to calculate the information gain. Now we are still going to use information gain, but the formula changes to

where L L L is the left leaf and R R R is the right leaf.


Reference:
https://xgboost.readthedocs.io/en/latest/tutorials/model.html

往期文章链接目录

  • 11
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值