<XGBoost: A Scalable Tree Boosting System>
论文结构如下:
1.介绍。
2.回顾boosting Tree以及作者做的一系列修改
3.寻找最佳分割点的算法
4.加速设计
5.相关工作
6.评价
可以看出,论文的重点是第3部分,其他都是在优化和加速上面的工作。
原始论文式(1)
y
i
^
=
ϕ
(
x
i
)
=
∑
k
=
1
K
f
k
(
x
i
)
,
f
∈
F
\hat{y_i}=\phi(x_i)=\sum_{k=1}^Kf_k(x_i),f\in F
yi^=ϕ(xi)=k=1∑Kfk(xi),f∈F
K:K个弱分类器
where
F
=
{
f
(
x
)
=
w
q
(
x
)
}
F=\{f(x)=w_{q(x)}\}
F={f(x)=wq(x)}
原文提到:
“here q represents the structure of each tree that maps an example to the correspoding leaf index”
意思就是:
q(x)表示x这条测试数据到达了哪一个叶子节点
原始论文式(3)
L
~
(
t
)
=
∑
i
=
1
n
[
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
\tilde{L}^{(t)}=\sum_{i=1}^n[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+Ω(f_t)
L~(t)=i=1∑n[gift(xi)+21hift2(xi)]+Ω(ft)
原始论文:
expanding Ω(
f
t
f_t
ft) as follows:
\--------------------------
原始论文式(2)
L
(
ϕ
)
=
∑
i
l
(
y
i
^
,
y
i
)
+
∑
k
Ω
(
f
k
)
L(\phi)=\sum_{i}l(\hat{y_i},y_i)+\sum_{k}Ω(f_k)
L(ϕ)=i∑l(yi^,yi)+k∑Ω(fk)
where
Ω
(
f
)
=
γ
T
+
1
2
λ
∣
∣
w
∣
∣
2
Ω(f)=\gamma T+\frac{1}{2}\lambda||w||^2
Ω(f)=γT+21λ∣∣w∣∣2
引用:
Here
l
l
l a differentiable convex loss function that measures
the difference between the prediction
y
^
i
\hat{y}_i
y^iand the target
y
i
y_i
yi.
The second term Ω penalizes the complexity of the model
(i.e., the regression tree functions). The additional regular-
ization term helps to smooth the final learnt weights to avoid
over-fitting.
也就是说:这是一个可以微分的凸函数,是可以求得最小值的。
\--------------------------
原始论文式(3)
泰勒公式展开到二阶近似:
L
(
t
)
≈
∑
i
=
1
n
[
l
(
y
i
,
y
^
(
t
−
1
)
)
+
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
k
)
L^{(t)}\approx\sum_{i=1}^n[l(y_i,\hat{y}^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+Ω(f_k)
L(t)≈i=1∑n[l(yi,y^(t−1))+gift(xi)+21hift2(xi)]+Ω(fk)
这里的n表示的n条数据
引文:
Formally,let
y
i
^
t
\hat{y_i}^{t}
yi^t be the prediction of the i-th instance at the t-th iteration,we will need to add
f
t
f_t
ft to minimize the following objecte.
g
i
=
∂
y
^
(
t
−
1
)
l
(
y
i
,
y
^
(
t
−
1
)
)
g_i=∂_{\hat{y}^{(t-1)}}l(y_i,\hat{y}^{(t-1)})
gi=∂y^(t−1)l(yi,y^(t−1))
h
i
=
∂
y
^
(
t
−
1
)
2
l
(
y
i
,
y
^
(
t
−
1
)
)
h_i=∂_{\hat{y}^{(t-1)}}^2l(y_i,\hat{y}^{(t-1)})
hi=∂y^(t−1)2l(yi,y^(t−1))
\--------------------------
原始论文式(4)
L
~
(
t
)
=
∑
i
=
1
n
[
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
γ
T
+
1
2
λ
∑
j
=
1
T
w
j
2
\tilde{L}^{(t)}=\sum_{i=1}^n[g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^T{w_j^2}
L~(t)=i=1∑n[gift(xi)+21hift2(xi)]+γT+21λj=1∑Twj2
这个式子继续往下推需要这么几个关系式(论文中没有提到):
1 2 ∑ j = 1 T ∑ i ∈ I j h i w j 2 = 1 2 ∑ i = 1 n h i f t 2 ( x i ) \frac{1}{2}\sum_{j=1}^T\sum_{i\in I_j}h_iw_j^2=\frac{1}{2}\sum_{i=1}^n h_if_t^2(x_i) 21j=1∑Ti∈Ij∑hiwj2=21i=1∑nhift2(xi)
∑ j = 1 T ( ∑ i ∈ I j g i ) w j = ∑ i = 1 n g i f t ( x i ) \sum_{j=1}^T(\sum_{i\in I_j}g_i)w_j=\sum_{i=1}^ng_if_t(x_i) j=1∑T(i∈Ij∑gi)wj=i=1∑ngift(xi)
w j = f t ( x i ) w_j=f_t(x_i) wj=ft(xi)
T:叶子数量
I
j
I_j
Ij:到达第j个叶子的数据集
w
j
w_j
wj:第j个叶子的权重,原文是:
“we use
w
i
w_i
wi to represent score on i-th leaf”
= ∑ j = 1 T [ ( ∑ i ∈ I j g j ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T =\sum_{j=1}^T[(\sum_{i\in I_j}g_j)w_j+\frac{1}{2}(\sum_{i\in{I_j}}h_i+\lambda)w_j^2]+\gamma T =j=1∑T[(i∈Ij∑gj)wj+21(i∈Ij∑hi+λ)wj2]+γT
\--------------------------
原始论文式(5)
w
j
∗
=
−
∑
i
∈
I
j
g
i
∑
i
∈
I
j
h
i
+
λ
w_j^{*}=-\frac{\sum_{i\in I_j}g_i}{\sum_{i \in I_j}h_i+\lambda}
wj∗=−∑i∈Ijhi+λ∑i∈Ijgi
\--------------------------
把式(5)带入(4)得到原始论文式(6):
L
~
(
t
)
(
q
)
=
−
1
2
∑
j
=
1
T
(
∑
i
∈
I
j
g
i
)
2
∑
i
∈
I
j
h
j
+
λ
+
γ
T
\tilde{L}^{(t)}(q)=-\frac{1}{2}\sum_{j=1}^T\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i \in{I_j}}h_j+\lambda}+\gamma T
L~(t)(q)=−21j=1∑T∑i∈Ijhj+λ(∑i∈Ijgi)2+γT
\--------------------------
然后是式(7),评价指标为:
L
s
p
l
i
t
=
1
2
[
(
∑
i
∈
I
L
g
i
)
2
∑
i
∈
I
L
h
i
+
λ
+
(
∑
i
∈
I
R
g
i
)
2
∑
i
∈
I
R
h
i
+
λ
−
(
∑
i
∈
I
g
i
)
2
∑
i
∈
I
h
i
+
λ
]
−
γ
L_{split}=\frac{1}{2}[\frac{(\sum_{i\in{I_L}}g_i)^2}{\sum_{i \in I_L}h_i+\lambda}+\frac{(\sum_{i\in{I_R}}g_i)^2}{\sum_{i \in I_R}h_i+\lambda}- \frac{(\sum_{i\in{I}}g_i)^2}{\sum_{i \in I}h_i+\lambda} ]-\gamma
Lsplit=21[∑i∈ILhi+λ(∑i∈ILgi)2+∑i∈IRhi+λ(∑i∈IRgi)2−∑i∈Ihi+λ(∑i∈Igi)2]−γ
\--------------------------
这里还采用了"Column Subsampling"技术,
也就是每棵树采用其中几个特征进行独自生成cart树,最后把每个cart树的预测值相加,就是整个XGBOOST集成模型的最终预测值