《推荐系统笔记(八)》GBDT和XgBoost的原理(内含详细数学推导)

前言

GBDT和Xgboost都是常用的树模型,也是常见的boosting方法的代表。尤其是Xgboost,更是被誉为kaggle神器。

本篇博客将从加法模型角度,对GBDT和XgBoost的数学原理进行推导。二者都是加法模型,非常相似,同时又因为用泰勒展开拟合残差的阶数,进而有所区别。

加法模型

给定数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)},其中, x i ∈ R n x_i\in\mathbb{R}^n xiRn y i y_i yi为标签。

我们想要得到一个模型 y ^ = F ( x ) \hat y=F(x) y^=F(x),来拟合数据集 T T T

boosting的思想是,先随便选择一个分类器,然后在这个分类器基础上,再选择一个分类器,使得这两个分类器合起来,能表现得比其中任何一个分类器要好。然后继续添加分类器,直到最后满足终止条件。

加法模型指的是,这个boosting过程中,不同分类器合起来的方式,我们用加法。

具体过程如下,

  • 初始分类器: f 0 ( x ) = 0 f_0(x)=0 f0(x)=0(这里意思是,我们任意选择其中一种标签作为结果),毫无疑问,这种分类器分类效果很差
  • 我们选择下一个分类器 h 1 ( x ) h_1(x) h1(x),使得 f 0 ( x ) + h 1 ( x ) f_0(x)+h_1(x) f0(x)+h1(x)能够更好地分类,具体来说, h 1 ( x ) = arg min ⁡ h ∑ i = 1 N L o s s ( y i , f 0 ( x i ) + h ( x i ) ) h_1(x)=\argmin\limits_{h}\sum_{i=1}^NLoss(y_i, f_0(x_i)+h(x_i)) h1(x)=hargmini=1NLoss(yi,f0(xi)+h(xi))
    这时,我们产生了一个新的分类器 f 1 ( x ) = f 0 ( x ) + h 1 ( x ) f_1(x)=f_0(x)+h_1(x) f1(x)=f0(x)+h1(x)
    实际上,我们往往把 h 1 ( x ) h_1(x) h1(x)叫做残差。为什么这么叫,因为它衡量了新分类器 f 1 ( x ) f_1(x) f1(x)和旧分类器 f 0 ( x ) f_0(x) f0(x)之间的差别。
  • 接下来,我们继续选择下一个残差 h 2 ( x ) h_2(x) h2(x),类似的,
    h 2 ( x ) = arg min ⁡ h ∑ i = 1 N L o s s ( y i , f 1 ( x i ) + h ( x i ) ) h_2(x)=\argmin\limits_{h}\sum_{i=1}^NLoss(y_i, f_1(x_i)+h(x_i)) h2(x)=hargmini=1NLoss(yi,f1(xi)+h(xi))
    这样,我们又产生了一个更好的分类器 f 2 ( x ) = f 1 ( x ) + h 2 ( x ) f_2(x)=f_1(x)+h_2(x) f2(x)=f1(x)+h2(x)
  • 类似于上面的过程不停做下去,我们选择第 k k k个残差 h k ( x ) h_k(x) hk(x)
    h k ( x ) = arg min ⁡ h ∑ i = 1 N L o s s ( y i , f k − 1 ( x i ) + h ( x i ) ) h_k(x)=\argmin\limits_{h}\sum_{i=1}^NLoss(y_i, f_{k-1}(x_i)+h(x_i)) hk(x)=hargmini=1NLoss(yi,fk1(xi)+h(xi))
    产生的分类器 f k ( x ) = f k − 1 ( x ) + h k ( x ) f_k(x)=f_{k-1}(x)+h_k(x) fk(x)=fk1(x)+hk(x)

从上面的加法模型的过程可以看出,实际需要计算的是在每一次迭代的时候,如何选择残差 h k ( x ) h_k(x) hk(x)

残差 h k ( x ) h_k(x) hk(x)是根据下式得到的
h k ( x ) = arg min ⁡ h ∑ i = 1 N L o s s ( y i , f k − 1 ( x i ) + h ( x i ) ) h_k(x)=\argmin\limits_{h}\sum_{i=1}^NLoss(y_i, f_{k-1}(x_i)+h(x_i)) hk(x)=hargmini=1NLoss(yi,fk1(xi)+h(xi))

为了得到这个优化问题的解,

  • 我们只需要确保对于每一组数据 ( x i , y i ) ∈ T (x_i, y_i)\in T (xi,yi)T,都能选择 h ( x i ) h(x_i) h(xi)使得 L o s s ( y i , f k − 1 ( x i ) + h ( x i ) ) Loss(y_i, f_{k-1}(x_i)+h(x_i)) Loss(yi,fk1(xi)+h(xi))最小
  • 我们对形如 L o s s ( y , f k − 1 ( x ) + h ( x ) ) Loss(y, f_{k-1}(x)+h(x)) Loss(y,fk1(x)+h(x))的损失函数进行泰勒展开,即一阶展开 L o s s ( y , f k − 1 ( x ) + h ( x ) ) ≈ L o s s ( y , f k − 1 ( x ) ) + ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ⋅ h ( x ) Loss(y, f_{k-1}(x)+h(x))\approx Loss(y, f_{k-1}(x))+\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\cdot h(x) Loss(y,fk1(x)+h(x))Loss(y,fk1(x))+fk1(x)Loss(y,fk1(x))h(x) 或者二阶展开 L o s s ( y , f k − 1 ( x ) + h ( x ) ) ≈ L o s s ( y , f k − 1 ( x ) ) + ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ⋅ h ( x ) + 1 2 ∂ 2 L o s s ( y , f k − 1 ( x ) ) ∂ ( f k − 1 ( x ) ) 2 ⋅ h ( x ) 2 Loss(y, f_{k-1}(x)+h(x))\approx Loss(y, f_{k-1}(x))+\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\cdot h(x)+\frac{1}{2}\frac{\partial^2 Loss(y, f_{k-1}(x))}{\partial (f_{k-1}(x))^2}\cdot h(x)^2 Loss(y,fk1(x)+h(x))Loss(y,fk1(x))+fk1(x)Loss(y,fk1(x))h(x)+21(fk1(x))22Loss(y,fk1(x))h(x)2
  • 一阶展开可以导出GBDT,二阶展开则导出XgBoost

GBDT

如果对损失函数 L o s s ( y , f k − 1 ( x ) + h ( x ) ) Loss(y, f_{k-1}(x)+h(x)) Loss(y,fk1(x)+h(x))进行一阶展开,我们有
L o s s ( y , f k − 1 ( x ) + h ( x ) ) ≈ L o s s ( y , f k − 1 ( x ) ) + ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ⋅ h ( x ) Loss(y, f_{k-1}(x)+h(x))\approx Loss(y, f_{k-1}(x))+\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\cdot h(x) Loss(y,fk1(x)+h(x))Loss(y,fk1(x))+fk1(x)Loss(y,fk1(x))h(x)

注意到,当令 h ( x ) = − ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) h(x)=-\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)} h(x)=fk1(x)Loss(y,fk1(x))时,
L o s s ( y , f k − 1 ( x ) ) + ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ⋅ h ( x ) = L o s s ( y , f k − 1 ( x ) ) − ( ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ) 2 ≤ L o s s ( y , f k − 1 ( x ) ) \begin{array}{lll} &&Loss(y, f_{k-1}(x))+\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\cdot h(x)\\ &=&Loss(y, f_{k-1}(x))-\left(\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\right)^2\\ &\le&Loss(y, f_{k-1}(x)) \end{array} =Loss(y,fk1(x))+fk1(x)Loss(y,fk1(x))h(x)Loss(y,fk1(x))(fk1(x)Loss(y,fk1(x)))2Loss(y,fk1(x))

这就意味着当 h ( x ) = − ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) h(x)=-\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)} h(x)=fk1(x)Loss(y,fk1(x))时, L o s s ( y , f k − 1 ( x ) + h ( x ) ) ≤ L o s s ( y , f k − 1 ( x ) ) Loss(y, f_{k-1}(x)+h(x))\le Loss(y, f_{k-1}(x)) Loss(y,fk1(x)+h(x))Loss(y,fk1(x))

即损失函数变小了!这正是我们想要的!每次迭代我们始终取 h ( x ) = − ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) h(x)=-\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)} h(x)=fk1(x)Loss(y,fk1(x)),我们就能确保损失函数是一直减小的!

GBDT是一系列CART树的加起来的结果,这也就意味着 h k ( x ) h_k(x) hk(x)实际上也是一棵CART树。我们现在知道,对于任意的 x i ∈ T x_i\in T xiT h k ( x i ) = − ∂ L o s s ( y i , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ∣ f k − 1 ( x ) = f k − 1 ( x i ) h_k(x_i)=-\frac{\partial Loss(y_i, f_{k-1}(x))}{\partial f_{k-1}(x)}|_{f_{k-1}(x)=f_{k-1}(x_i)} hk(xi)=fk1(x)Loss(yi,fk1(x))fk1(x)=fk1(xi) i = 1 , 2 , . . . , N i=1, 2, ..., N i=1,2,...,N。根据这样的数据集 T k = ( x 1 , h k ( x 1 ) ) , ( x 2 , h k ( x 2 ) ) , . . . , ( x N , h k ( x N ) ) T_k={(x_1, h_k(x_1)), (x_2, h_k(x_2)), ..., (x_N, h_k(x_N))} Tk=(x1,hk(x1)),(x2,hk(x2)),...,(xN,hk(xN)),我们是可以生成CART树的。

Xgboost

如果对损失函数 L o s s ( y , f k − 1 ( x ) + h ( x ) ) Loss(y, f_{k-1}(x)+h(x)) Loss(y,fk1(x)+h(x))进行一阶展开,我们有
L o s s ( y , f k − 1 ( x ) + h ( x ) ) ≈ L o s s ( y , f k − 1 ( x ) ) + ∂ L o s s ( y , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ⋅ h ( x ) + 1 2 ∂ 2 L o s s ( y , f k − 1 ( x ) ) ∂ ( f k − 1 ( x ) ) 2 ⋅ h ( x ) 2 Loss(y, f_{k-1}(x)+h(x))\approx Loss(y, f_{k-1}(x))+\frac{\partial Loss(y, f_{k-1}(x))}{\partial f_{k-1}(x)}\cdot h(x)+\frac{1}{2}\frac{\partial^2 Loss(y, f_{k-1}(x))}{\partial (f_{k-1}(x))^2}\cdot h(x)^2 Loss(y,fk1(x)+h(x))Loss(y,fk1(x))+fk1(x)Loss(y,fk1(x))h(x)+21(fk1(x))22Loss(y,fk1(x))h(x)2

这也就意味着总体损失函数为x
min ⁡ h ( x ) ∑ i = 1 N L o s s ( y i , f k − 1 ( x i ) + h ( x i ) ) = ∑ i = 1 N ( L o s s ( y i , f k − 1 ( x i ) ) + ∂ L o s s ( y i , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ∣ f k − 1 ( x i ) ⋅ h ( x i ) + 1 2 ∂ 2 L o s s ( y i , f k − 1 ( x ) ) ∂ ( f k − 1 ( x ) ) 2 ∣ f k − 1 ( x i ) ⋅ h ( x i ) 2 ) = ∑ i = 1 N ( L o s s ( y i , f k − 1 ( x i ) ) + g ( x i , y i ) ⋅ h ( x i ) + 1 2 u ( x i , y i ) ⋅ h ( x i ) 2 ) \begin{array}{lll} &\min\limits_{h(x)}&\sum_{i=1}^NLoss(y_i, f_{k-1}(x_i)+h(x_i))\\ &=&\sum_{i=1}^N \left(Loss(y_i, f_{k-1}(x_i))+\frac{\partial Loss(y_i, f_{k-1}(x))}{\partial f_{k-1}(x)}|_{f_{k-1}(x_i)}\cdot h(x_i)+\frac{1}{2}\frac{\partial^2 Loss(y_i, f_{k-1}(x))}{\partial (f_{k-1}(x))^2}|_{f_{k-1}(x_i)}\cdot h(x_i)^2\right)\\ &=& \sum_{i=1}^N(Loss(y_i, f_{k-1}(x_i))+g(x_i, y_i)\cdot h(x_i)+\frac{1}{2}u(x_i, y_i)\cdot h(x_i)^2) \end{array} h(x)min==i=1NLoss(yi,fk1(xi)+h(xi))i=1N(Loss(yi,fk1(xi))+fk1(x)Loss(yi,fk1(x))fk1(xi)h(xi)+21(fk1(x))22Loss(yi,fk1(x))fk1(xi)h(xi)2)i=1N(Loss(yi,fk1(xi))+g(xi,yi)h(xi)+21u(xi,yi)h(xi)2)

其中, g ( x i , y i ) = ∂ L o s s ( y i , f k − 1 ( x ) ) ∂ f k − 1 ( x ) ∣ f k − 1 ( x i ) g(x_i, y_i)=\frac{\partial Loss(y_i, f_{k-1}(x))}{\partial f_{k-1}(x)}|_{f_{k-1}(x_i)} g(xi,yi)=fk1(x)Loss(yi,fk1(x))fk1(xi) u ( x i , y i ) = ∂ 2 L o s s ( y i , f k − 1 ( x ) ) ∂ ( f k − 1 ( x ) ) 2 ∣ f k − 1 ( x i ) u(x_i, y_i)=\frac{\partial^2 Loss(y_i, f_{k-1}(x))}{\partial (f_{k-1}(x))^2}|_{f_{k-1}(x_i)} u(xi,yi)=(fk1(x))22Loss(yi,fk1(x))fk1(xi)

我们说xgboost也是一系列树的加法,因此,这里的 h ( x ) h(x) h(x)实际上也是一棵树,这棵树使得总体损失 ∑ i = 1 N L o s s ( y i , f k − 1 ( x i ) + h ( x i ) ) \sum_{i=1}^NLoss(y_i, f_{k-1}(x_i)+h(x_i)) i=1NLoss(yi,fk1(xi)+h(xi))最小。

更近一步的,我们还要考虑树 h ( x ) h(x) h(x)的复杂度,我们不希望树叶节点过多,因此,我们将树的复杂度计入损失函数里面,损失函数变成为
∑ i = 1 N ( L o s s ( y i , f k − 1 ( x i ) ) + g ( x i , y i ) ⋅ h ( x i ) + 1 2 u ( x i , y i ) ⋅ h ( x i ) 2 ) + γ ∣ l e a f s ∣ + 1 2 λ ∑ l ∈ l e a f s w l 2 \begin{array}{lll} && \sum_{i=1}^N(Loss(y_i, f_{k-1}(x_i))+g(x_i, y_i)\cdot h(x_i)+\frac{1}{2}u(x_i, y_i)\cdot h(x_i)^2)\\ &+&\gamma|leafs|+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2 \end{array} +i=1N(Loss(yi,fk1(xi))+g(xi,yi)h(xi)+21u(xi,yi)h(xi)2)γleafs+21λlleafswl2

其中, l e a f s leafs leafs为树 h ( x ) h(x) h(x)的叶节点集合, w l w_l wl为叶节点 l l l处的权重或者值。

注意到,我们是对树 h ( x ) h(x) h(x)做优化,而 L o s s ( y i , f k − 1 ( x i ) ) Loss(y_i, f_{k-1}(x_i)) Loss(yi,fk1(xi))与树 h ( x ) h(x) h(x)无关,因此,可以从上面的目标函数中去掉,这样,我们的目标函数可以写为
∑ i = 1 N ( g ( x i , y i ) ⋅ h ( x i ) + 1 2 u ( x i , y i ) ⋅ h ( x i ) 2 ) + γ ∣ l e a f s ∣ + 1 2 λ ∑ l ∈ l e a f s w l 2 \begin{array}{lll} && \sum_{i=1}^N(g(x_i, y_i)\cdot h(x_i)+\frac{1}{2}u(x_i, y_i)\cdot h(x_i)^2)\\ &+&\gamma|leafs|+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2 \end{array} +i=1N(g(xi,yi)h(xi)+21u(xi,yi)h(xi)2)γleafs+21λlleafswl2

这里 h ( x i ) h(x_i) h(xi)意味着 x i x_i xi最终总会沿着树的路径落到某一叶节点 q ( x i ) q(x_i) q(xi)中,取得值 w q ( x i ) w_{q(x_i)} wq(xi)。这样,可以重写目标函数为
∑ i = 1 N ( g ( x i , y i ) ⋅ h ( x i ) + 1 2 u ( x i , y i ) ⋅ h ( x i ) 2 ) + γ ∣ l e a f s ∣ + 1 2 λ ∑ l ∈ l e a f s w l 2 = ∑ i = 1 N ( g ( x i , y i ) ⋅ w q ( x i ) + 1 2 u ( x i , y i ) ⋅ w q ( x i ) 2 ) + γ ∣ l e a f s ∣ + 1 2 λ ∑ l ∈ l e a f s w l 2 = ∑ l ∈ l e a f s ( w l ∑ q ( x i ) = l g ( x i ) + 1 2 w l 2 ∑ q ( x i ) = l u ( x i ) ) + 1 2 λ ∑ l ∈ l e a f s w l 2 + γ ∣ l e a f s ∣ = ∑ l ∈ l e a f s ( G l w l + 1 2 U l w l 2 ) + 1 2 λ ∑ l ∈ l e a f s w l 2 + γ ∣ l e a f s ∣ = ∑ l ∈ l e a f s ( 1 2 ( λ + U l ) w l 2 + G l w l ) + γ ∣ l e a f s ∣ \begin{array}{lll} && \sum_{i=1}^N(g(x_i, y_i)\cdot h(x_i)+\frac{1}{2}u(x_i, y_i)\cdot h(x_i)^2)+\gamma|leafs|+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2\\ &=&\sum_{i=1}^N(g(x_i, y_i)\cdot w_{q(x_i)}+\frac{1}{2}u(x_i, y_i)\cdot w_{q(x_i)}^2)+\gamma|leafs|+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2\\ &=&\sum_{l\in leafs}\left(w_l\sum_{q(x_i)=l}g(x_i)+\frac{1}{2}w_l^2\sum_{q(x_i)=l}u(x_i) \right)+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2+\gamma|leafs|\\ &=&\sum_{l\in leafs}\left(G_lw_l+\frac{1}{2}U_lw_l^2 \right)+\frac{1}{2}\lambda\sum_{l\in leafs}w_l^2+\gamma|leafs|\\ &=&\sum_{l\in leafs}\left(\frac{1}{2}(\lambda+U_l)w_l^2+G_lw_l\right)+\gamma|leafs| \end{array} ====i=1N(g(xi,yi)h(xi)+21u(xi,yi)h(xi)2)+γleafs+21λlleafswl2i=1N(g(xi,yi)wq(xi)+21u(xi,yi)wq(xi)2)+γleafs+21λlleafswl2lleafs(wlq(xi)=lg(xi)+21wl2q(xi)=lu(xi))+21λlleafswl2+γleafslleafs(Glwl+21Ulwl2)+21λlleafswl2+γleafslleafs(21(λ+Ul)wl2+Glwl)+γleafs

求上式关于 w l w_l wl的最小值,有
w l = − G l λ + U l w_l=-\frac{G_l}{\lambda+U_l} wl=λ+UlGl

代回上式中,目标函数可以重写为
− ∑ l ∈ l e a f s G l 2 2 ( λ + U l ) + γ ∣ l e a f s ∣ -\sum_{l\in leafs}\frac{G_l^2}{2(\lambda+U_l)}+\gamma|leafs| lleafs2(λ+Ul)Gl2+γleafs

类似于GBDT中,CART树根据基尼系数来生成树,这里,xgboost就是根据上式来判断是否切割叶节点,是否剪枝等问题生成树。

总结

GBDT和Xgboost都是树的加法模型,两者都是通过对残差的拟合来生成树;但不同的是,GBDT通过泰勒一阶展开来拟合残差,而xgboost是通过泰勒二阶展开来拟合残差,并根据自己的准则来生成树。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页