# 原理

$Obj\left(\mathrm{\Theta }\right)=L\left(\mathrm{\Theta }\right)+\mathrm{\Omega }\left(\mathrm{\Theta }\right)$$\Large Obj(\Theta)=L(\Theta)+\Omega(\Theta)$

L1正则：$\mathrm{\Omega }\left(w\right)=\lambda ||w|{|}_{1}$$\Large \Omega(w)=\lambda||w||_1$
L2正则：$\mathrm{\Omega }\left(w\right)=\lambda ||w|{|}_{2}$$\Large \Omega(w)=\lambda||w||_2$

$L\left({y}_{i},{\stackrel{^}{y}}_{i}\right)+\sum _{k=1}^{K}\mathrm{\Omega }\left({f}_{k}\left({x}_{i}\right)\right)$$\Large L(y_i,\hat{y}_i)+\sum_{k=1}^K\Omega\left(f_k(x_i)\right)$

${\stackrel{^}{y}}_{i}^{0}=常数$$\Large \hat{y}_i^0=常数$

${\stackrel{^}{y}}_{i}^{1}={\stackrel{^}{y}}_{i}^{0}+{f}_{1}\left({x}_{i}\right)$$\Large \hat{y}_i^1=\hat{y}_i^0+f_1(x_i)$

${\stackrel{^}{y}}_{i}^{2}={\stackrel{^}{y}}_{i}^{1}+{f}_{2}\left({x}_{i}\right)$$\Large \hat{y}_i^2=\hat{y}_i^1+f_2(x_i)$

$\begin{array}{}\text{(0)}& {\stackrel{^}{y}}_{i}^{K}={\stackrel{^}{y}}_{i}^{K-1}+{f}_{K}\left({x}_{i}\right)\end{array}$$\Large \hat{y}_i^K=\hat{y}_i^{K-1}+f_K(x_i) \tag {0}$

$Ob{j}^{K}=\sum _{i}L\left({y}_{i},{\stackrel{^}{y}}_{i}^{K}\right)+\mathrm{\Omega }\left({f}_{K}\right)+constant$$Obj^K=\sum_iL(y_i,\hat{y}_i^K)+\Omega(f_K)+constant$

==>$Ob{j}^{K}=\sum _{i}L\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}+{f}_{K}\left({x}_{i}\right)\right)+\mathrm{\Omega }\left({f}_{K}\right)+constant$$Obj^K=\sum_iL\left(y_i,\hat{y}_i^{K-1}+f_K(x_i)\right)+\Omega(f_K)+constant$

（其思想主要来自于文章：Additive logistic regression a statistical view of boosting也是Friedman大牛的作品）

$f\left(x+\mathrm{\Delta }x\right)=f\left(x\right)+{f}^{\prime }\left(x\right)\mathrm{\Delta }x+\frac{1}{2}{f}^{″}\left(x\right){\mathrm{\Delta }x}^{2}$$f(x+\Delta x)=f(x)+f'(x)\Delta x+\frac{1}{2}f''(x){\Delta x}^2$

$\sum _{i}L\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}+{f}_{K}\left({x}_{i}\right)\right)=\sum _{i}\left[L\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right)+{L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right){f}_{K}\left({x}_{i}\right)+\frac{1}{2}{L}^{″}\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right){f}_{K}^{2}\left({x}_{i}\right)\right]$$\sum_iL\left(y_i,\hat{y}_i^{K-1}+f_K(x_i)\right)=\sum_i\left[L(y_i,\hat{y}_i^{K-1})+L'(y_i,\hat{y}_i^{K-1})f_K(x_i)+\frac{1}{2}L''(y_i,\hat{y}_i^{K-1})f_K^2(x_i)\right]$

$\begin{array}{}\text{(1)}& {g}_{i}={L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right)\end{array}$$g_i=L'(y_i,\hat{y}_i^{K-1}) \tag 1$
$\begin{array}{}\text{(2)}& {h}_{i}={L}^{″}\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right)\end{array}$$h_i=L''(y_i,\hat{y}_i^{K-1}) \tag 2$

(1)式和(2)非常的重要，贯穿了整个树的构建（分裂，叶子节点值的计算）。以及(2)式是我们利用xgboost做特征选择时的其中一个评价指标。

$\sum _{i}\left[L\left({y}_{i},{\stackrel{^}{y}}_{i}^{K-1}\right)+{g}_{i}{f}_{K}\left({x}_{i}\right)+\frac{1}{2}{h}_{i}{f}_{K}^{2}\left({x}_{i}\right)\right]+\mathrm{\Omega }\left({f}_{K}\right)+constant$$\sum_i\left[L(y_i,\hat{y}_i^{K-1})+g_if_K(x_i)+\frac{1}{2}h_if_K^2(x_i)\right]+\Omega(f_K)+constant$

$f\left(x\right)=\left\{\begin{array}{ll}0.444444& x1<10\\ -0.4& x1>=10\end{array}$

$\mathrm{\Omega }\left({f}_{K}\right)=\frac{1}{2}\lambda \sum _{j}^{T}||{w}_{j}|{|}_{2}+\gamma T$$\Omega(f_K)=\frac{1}{2}\lambda \sum_j^{T}||w_j||_2+\gamma T$

$\begin{array}{}\text{(3)}& \sum _{i}\left[{g}_{i}{w}_{q}\left({x}_{i}\right)+\frac{1}{2}{h}_{i}{w}_{q\left({x}_{i}\right)}^{2}\right]+\frac{1}{2}\lambda \sum _{j}^{T}||{w}_{j}|{|}_{2}+\gamma T\end{array}$$\sum_i\left[g_iw_q(x_i)+\frac{1}{2}h_iw_{q(x_i)}^2\right]+\frac{1}{2}\lambda \sum_j^{T}||w_j||_2+\gamma T \tag 3$

$\begin{array}{}\text{(4)}& \sum _{j=1}^{T}\left[\left(\sum _{\left(i\in {I}_{j}\right)}{g}_{i}\right){w}_{j}+\frac{1}{2}\left(\sum _{\left(i\in {I}_{j}\right)}{h}_{i}+\lambda \right){w}_{j}^{2}\right]+\gamma T\end{array}$$\sum_{j=1}^{T}\left[(\sum_{(i \in I_j)}g_i)w_j+\frac{1}{2}(\sum_{(i \in I_j)}h_i+\lambda )w_{j}^2\right]+\gamma T \tag 4$

(3)式子展开之后按照叶子节点编号进行合并后可以得到(4)。可以自己举T=2的例子推导一下。

$\begin{array}{}\text{(5)}& {G}_{j}=\sum _{\left(i\in {I}_{j}\right)}{g}_{i}\end{array}$$G_j=\sum_{(i \in I_j)}g_i \tag 5$
$\begin{array}{}\text{(6)}& {H}_{j}=\sum _{\left(i\in {I}_{j}\right)}{h}_{i}\end{array}$$H_j=\sum_{(i \in I_j)}h_i \tag 6$

$\begin{array}{}\text{(7)}& \sum _{j=1}^{T}\left[{G}_{j}{w}_{j}+\frac{1}{2}\left({H}_{j}+\lambda \right){w}_{j}^{2}\right]+\gamma T\end{array}$$\sum_{j=1}^{T}\left[G_jw_j+\frac{1}{2}(H_j+\lambda )w_{j}^2\right]+\gamma T \tag 7$

$\begin{array}{}\text{(8)}& {w}^{\ast }=-\frac{{G}_{j}}{{H}_{j}+\lambda }\end{array}$$w^*=-\frac{G_j}{H_j+\lambda} \tag 8$
$\begin{array}{}\text{(9)}& Obj=-\frac{1}{2}\sum _{j=1}^{T}\frac{{G}_{j}^{2}}{{H}_{j}+\lambda }+\gamma T\end{array}$$Obj=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T \tag 9$

$\begin{array}{}\text{(10)}& Gain=\frac{1}{2}\left[\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }+\frac{{G}_{R}^{2}}{{H}_{R}^{2}+\lambda }-\frac{\left(GL+{G}_{R}{\right)}^{2}}{\left({H}_{L}+{H}_{R}{\right)}^{2}+\lambda }\right]-\gamma \end{array}$$Gain=\frac{1}{2}\left[\frac{G_L^2}{H_L^2+\lambda}+\frac{G_R^2}{H_R^2+\lambda}-\frac{(GL+G_R)^2}{(H_L+H_R)^2+\lambda}\right]-\gamma \tag {10}$

1.在损失函数的基础上加入了正则项。
2.对目标函数进行二阶泰勒展开。
3.利用推导得到的表达式作为分裂准确，来构建每一颗树。

# xgboost算法流程总结

xgboost核心部分的算法流程图如下。

（这里的m貌似是d）

# 手动计算还原xgboost的过程

ID x1 x2 y
1 1 -5 0
2 2 5 0
3 3 -2 1
4 1 2 1
5 2 0 1
6 6 -5 1
7 7 5 1
8 6 -2 0
9 7 2 0
10 6 0 1
11 8 -5 1
12 9 5 1
13 10 -2 0
14 8 2 0
15 9 0 1

$L\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i}ln\left(1+{e}^{-{\stackrel{^}{y}}_{i}}\right)+\left(1-{y}_{i}\right)ln\left(1+{e}^{{\stackrel{^}{y}}_{i}}\right)$$\large L(y_i,\hat{y}_i)=y_iln(1+e^{-\hat{y}_i})+(1-y_i)ln(1+e^{\hat{y}_i})$

${L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i}\frac{-{e}^{-{\stackrel{^}{y}}_{i}}}{1+{e}^{-{\stackrel{^}{y}}_{i}}}+\left(1-{y}_{i}\right)\frac{{e}^{{\stackrel{^}{y}}_{i}}}{1+{e}^{{\stackrel{^}{y}}_{i}}}$$\large L'(y_i,\hat{y}_i)=y_i\frac{-e^{-\hat{y}_i}}{1+e^{-\hat{y}_i}}+(1-y_i)\frac{e^{\hat{y}_i}}{1+e^{\hat{y}_i}}$
==>${L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i}\frac{-1}{1+{e}^{{\stackrel{^}{y}}_{i}}}+\left(1-{y}_{i}\right)\frac{1}{1+{e}^{-{\stackrel{^}{y}}_{i}}}$$\large L'(y_i,\hat{y}_i)=y_i\frac{-1}{1+e^{\hat{y}_i}}+(1-y_i)\frac{1}{1+e^{-\hat{y}_i}}$
==>${L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i}\ast \left({y}_{i,pred}-1\right)+\left(1-{y}_{i}\right)\ast {y}_{i,pred}$$\large L'(y_i,\hat{y}_i)=y_i*(y_{i,pred}-1)+(1-y_i)*y_{i,pred}$
==>${L}^{\prime }\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i,pred}-{y}_{i}$$\large L'(y_i,\hat{y}_i)=y_{i,pred}-y_i$,其中${y}_{i,pred}=\frac{1}{1+{e}^{-{\stackrel{^}{y}}_{i}}}$$\large y_{i,pred}=\frac{1}{1+e^{-\hat{y}_i}}$

${L}^{″}\left({y}_{i},{\stackrel{^}{y}}_{i}\right)={y}_{i,pred}\ast \left(1-{y}_{i,pred}\right)$$\large L’'(y_i,\hat{y}_i)=y_{i,pred}*(1-y_{i,pred})$

## 建立第一颗树(k=1)

$\begin{array}{}\text{(10)}& Gain=\frac{1}{2}\left[\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }+\frac{{G}_{R}^{2}}{{H}_{R}^{2}+\lambda }-\frac{\left(GL+{G}_{R}{\right)}^{2}}{\left({H}_{L}+{H}_{R}{\right)}^{2}+\lambda }\right]-\gamma \end{array}$$Gain=\frac{1}{2}\left[\frac{G_L^2}{H_L^2+\lambda}+\frac{G_R^2}{H_R^2+\lambda}-\frac{(GL+G_R)^2}{(H_L+H_R)^2+\lambda}\right]-\gamma \tag {10}$

(值得注意的是base_score是一个经过sigmod映射的值，可以理解为一个概率值，提这个原因是后面建第二颗树会用到，需要注意这个细节）

${h}_{1}={y}_{1,pred}\ast \left(1-{y}_{1,pred}\right)=0.5\ast \left(1-0.5\right)=0.25$$h_1=y_{1,pred}*(1-y_{1,pred})=0.5*(1-0.5)=0.25$

${G}_{R}=\sum _{\left(i\in {I}_{R}\right)}{g}_{i}=\left(0.5+0.5+....-0.5\right)=-1.5$$G_R=\sum_{(i \in I_R)}g_i=(0.5+0.5+....-0.5)=-1.5$

${H}_{R}=\sum _{\left(i\in {I}_{R}\right)}{h}_{i}=\left(0.25+0.25...0.25\right)=3.75$$H_R=\sum_{(i \in I_R)}h_i=(0.25+0.25...0.25)=3.75$

${G}_{L}=\sum _{\left(i\in {I}_{L}\right)}{g}_{i}=\left(0.5-0.5\right)=0$$G_L=\sum_{(i \in I_L)}g_i=(0.5-0.5)=0$

${H}_{L}=\sum _{\left(i\in {I}_{L}\right)}{h}_{i}=\left(0.25+0.25\right)=0.5$$H_L=\sum_{(i \in I_L)}h_i=(0.25+0.25)=0.5$

${G}_{R}=\sum _{\left(i\in {I}_{R}\right)}{g}_{i}=-1.5$$G_R=\sum_{(i \in I_R)}g_i=-1.5$

${H}_{R}=\sum _{\left(i\in {I}_{R}\right)}{h}_{i}=3.25$$H_R=\sum_{(i \in I_R)}h_i=3.25$

$Gain=\left[\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }+\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }-\frac{\left(GL+{G}_{R}{\right)}^{2}}{\left({H}_{L}+{H}_{R}{\right)}^{2}+\lambda }\right]=0.0557275541796$$Gain=\left[\frac{G_L^2}{H_L^2+\lambda}+\frac{G_L^2}{H_L^2+\lambda}-\frac{(GL+G_R)^2}{(H_L+H_R)^2+\lambda}\right]=0.0557275541796$

${G}_{L}=\sum _{\left(i\in {I}_{L}\right)}{g}_{i}=0$$G_L=\sum_{(i \in I_L)}g_i=0$

${H}_{L}=\sum _{\left(i\in {I}_{L}\right)}{h}_{i}=1$$H_L=\sum_{(i \in I_L)}h_i=1$

${G}_{R}=\sum _{\left(i\in {I}_{R}\right)}{g}_{i}=-1.5$$G_R=\sum_{(i \in I_R)}g_i=-1.5$

${H}_{R}=\sum _{\left(i\in {I}_{R}\right)}{h}_{i}=2.75$$H_R=\sum_{(i \in I_R)}h_i=2.75$

$Gain=\left[\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }+\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }-\frac{\left(GL+{G}_{R}{\right)}^{2}}{\left({H}_{L}+{H}_{R}{\right)}^{2}+\lambda }\right]=0.126315789474$$Gain=\left[\frac{G_L^2}{H_L^2+\lambda}+\frac{G_L^2}{H_L^2+\lambda}-\frac{(GL+G_R)^2}{(H_L+H_R)^2+\lambda}\right]=0.126315789474$

$\begin{array}{}\text{(8)}& {w}^{\ast }=-\frac{{G}_{j}}{{H}_{j}+\lambda }\end{array}$$w^*=-\frac{G_j}{H_j+\lambda} \tag 8$

${w}_{1}=-\frac{{G}_{R}}{{H}_{R}+\lambda }=-\frac{{g}_{13}}{{h}_{13}+1}=-\frac{0.5}{1+0.25}=-0.4$$w_1=-\frac{G_R}{H_R+\lambda}=-\frac{g_{13}}{h_{13}+1}=-\frac{0.5}{1+0.25}=-0.4$

${G}_{L}=\sum _{\left(i\in {I}_{L}\right)}{g}_{i}=\left(0.5-0.5\right)=0$$G_L=\sum_{(i \in I_L)}g_i=(0.5-0.5)=0$

${H}_{L}=\sum _{\left(i\in {I}_{L}\right)}{h}_{i}=\left(0.25+0.25\right)=0.5$$H_L=\sum_{(i \in I_L)}h_i=(0.25+0.25)=0.5$

${G}_{R}=\sum _{\left(i\in {I}_{R}\right)}{g}_{i}=-2$$G_R=\sum_{(i \in I_R)}g_i=-2$

${H}_{R}=\sum _{\left(i\in {I}_{R}\right)}{h}_{i}=3$$H_R=\sum_{(i \in I_R)}h_i=3$

$Gain=\left[\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }+\frac{{G}_{L}^{2}}{{H}_{L}^{2}+\lambda }-\frac{\left(GL+{G}_{R}{\right)}^{2}}{\left({H}_{L}+{H}_{R}{\right)}^{2}+\lambda }\right]=0.111111111111$$Gain=\left[\frac{G_L^2}{H_L^2+\lambda}+\frac{G_L^2}{H_L^2+\lambda}-\frac{(GL+G_R)^2}{(H_L+H_R)^2+\lambda}\right]=0.111111111111$

${w}_{2}=-\frac{{G}_{R}}{{H}_{R}+\lambda }=-\frac{{g}_{1}}{{h}_{1}+1}=-\frac{0.5}{1+0.25}=-0.4$$w_2=-\frac{G_R}{H_R+\lambda}=-\frac{g_{1}}{h_{1}+1}=-\frac{0.5}{1+0.25}=-0.4$

${w}_{3}=-\frac{{G}_{R}}{{H}_{R}+\lambda }=-\frac{{g}_{3}+{g}_{5}+{g}_{6}+{g}_{8}+{g}_{10}+{g}_{11}+{g}_{15}}{{h}_{3}+{h}_{5}+{h}_{6}+{h}_{8}+{h}_{10}+{h}_{11}+{h}_{15}+1}=-\frac{-2.5}{1+1.75}=0.909$$w_3=-\frac{G_R}{H_R+\lambda}=-\frac{g_{3}+g_{5}+g_{6}+g_{8}+g_{10}+g_{11}+g_{15}}{h_{3}+h_{5}+h_{6}+h_{8}+h_{10}+h_{11}+h_{15}+1}=-\frac{-2.5}{1+1.75}=0.909$

## 建立第2颗树(k=2)

$\begin{array}{}\text{(13)}& {y}_{i}^{K}=\sum _{k=0}^{K}{f}_{k}\left({x}_{i}\right)\end{array}$$y_i^K=\sum_{k=0}^Kf_k(x_i) \tag {13}$
$\begin{array}{}\text{(14)}& {y}_{i}^{1}={f}_{0}\left({x}_{i}\right)+{f}_{1}\left({x}_{i}\right)\end{array}$$y_i^1=f_0(x_i)+f_1(x_i) \tag{14}$

${f}_{1}\left({x}_{i}\right)$$f_1(x_i)$的值是样例${x}_{i}$$x_i$落在第一棵树上的叶子结点值。那${f}_{0}\left({x}_{i}\right)$$f_0(x_i)$是什么呢？这里就涉及前面提到一个问题base_score是一个经过sigmod映射后的值（因为选择使用Logloss做损失函数，概率$p=\frac{1}{1+{e}^{-x}}$$p=\frac{1}{1+e^{-x}}$

（其实当训练次数K足够多的时候，初始化这个值几乎不起作用的，这个在官网文档上有说明）

ID ${y}_{i,pred}$$y_{i,pred}$
1 0.490001
2 0.494445
3 0.522712
4 0.494445
5 0.522712
6 0.522712
7 0.494445
8 0.522712
9 0.494445
10 0.522712
11 0.522712
12 0.509999
13 0.490001
14 0.494445
15 0.522712

${p}_{1,pred}=\frac{1}{1+{e}^{\left(0+0.04\right)}}=0.490001$$p_{1,pred}=\frac{1}{1+e^{(0+0.04)}}=0.490001$

ID ${g}_{i}$$g_i$ ${h}_{i}$$h_i$
1 0.490001320839 0.249900026415
2 0.494444668293 0.24996913829
3 -0.477288365364 0.249484181652
4 -0.505555331707 0.24996913829
5 -0.477288365364 0.249484181652
6 -0.477288365364 0.249484181652
7 -0.505555331707 0.24996913829
8 0.522711634636 0.249484181652
9 0.494444668293 0.24996913829
10 -0.477288365364 0.249484181652
11 -0.477288365364 0.249484181652
12 -0.490001320839 0.249900026415
13 0.490001320839 0.249900026415
14 0.494444668293 0.24996913829
15 -0.477288365364 0.249484181652

## 训练过程细节-缺失值的处理

xgboost对缺失值的处理思想很简单，具体看下面的算法流程：

${I}_{k}$$I_k$是不包含空缺值样本的集合。

# xgboost如何用于特征选择？

get_fscore()就是返回这个指标。

‘cover’ - the average coverage of the feature when it is used in trees。大概的意义就是特征覆盖了多少个样本。

# xgboost与传统GBDT的区别与联系？

1.xgboost和GBDT的一个区别在于目标函数上。

GBDT中，只有损失函数。
2.xgboost中利用二阶导数的信息，而GBDT只利用了一阶导数。
3.xgboost在建树的时候利用的准则来源于目标函数推导，而GBDT建树利用的是启发式准则。（这一点，我个人认为是xgboost牛B的所在，也是为啥要费劲二阶泰勒展开）
4.xgboost中可以自动处理空缺值，自动学习空缺值的分裂方向，GBDT(sklearn版本）不允许包含空缺值。
5.其他若干工程实现上的不同（这个由于本文没有涉及就不说了）

1.xgboost和GBDT的学习过程都是一样的，都是基于Boosting的思想，先学习前n-1个学习器，然后基于前n-1个学习器学习第n个学习器。(Boosting)