文章目录
XGBoost推导
目标
目标:我们希望学习一个既准确又简单的模型来实现预测
因此目标函数可以定为:
∑
i
=
1
n
l
(
y
i
,
y
^
i
)
+
∑
k
Ω
(
f
k
)
,
f
k
∈
F
\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}\right)+\sum_{k} \Omega\left(f_{k}\right), f_{k} \in \mathcal{F}
i=1∑nl(yi,y^i)+k∑Ω(fk),fk∈F
由于我们使用的是树模型,而不是权重向量,因此无法使用SGD算法来找到函数
f
f
f。但是可以使用Additive Training(Boosting)加性训练的方式来找到函数
f
f
f.
Additive Training(Boosting)
从一个常数预测开始,每一轮训练增加一个新的函数
y
^
i
(
0
)
=
0
y
^
i
(
1
)
=
f
1
(
x
i
)
=
y
^
i
(
0
)
+
f
1
(
x
i
)
y
^
i
(
2
)
=
f
1
(
x
i
)
+
f
2
(
x
i
)
=
y
^
i
(
1
)
+
f
2
(
x
i
)
y
^
i
(
t
)
=
∑
k
=
1
t
f
k
(
x
i
)
=
y
^
i
(
t
−
1
)
+
f
t
(
x
i
)
\begin{array}{l}{\hat{y}_{i}^{(0)}=0} \\ {\hat{y}_{i}^{(1)}=f_{1}\left(x_{i}\right)=\hat{y}_{i}^{(0)}+f_{1}\left(x_{i}\right)} \\ {\hat{y}_{i}^{(2)}=f_{1}\left(x_{i}\right)+f_{2}\left(x_{i}\right)=\hat{y}_{i}^{(1)}+f_{2}\left(x_{i}\right)} \\ {\hat{y}_{i}^{(t)}=\sum_{k=1}^{t} f_{k}\left(x_{i}\right)=\hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)}\end{array}
y^i(0)=0y^i(1)=f1(xi)=y^i(0)+f1(xi)y^i(2)=f1(xi)+f2(xi)=y^i(1)+f2(xi)y^i(t)=∑k=1tfk(xi)=y^i(t−1)+ft(xi)
如何决定新加入的函数
由目标函数决定!
在第
t
t
t轮训练中,
y
^
i
(
t
)
=
y
^
i
(
t
−
1
)
+
f
t
(
x
i
)
\hat{y}_{i}^{(t)}=\hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)
y^i(t)=y^i(t−1)+ft(xi)
因此目标函数可写成:
O
b
j
(
t
)
=
∑
i
=
1
n
l
(
y
i
,
y
^
i
(
t
)
)
+
∑
i
=
1
t
Ω
(
f
i
)
=
∑
i
=
1
n
l
(
y
i
,
y
^
i
(
t
−
1
)
+
f
t
(
x
i
)
)
+
Ω
(
f
t
)
+
constant
\begin{aligned} O b j^{(t)} &=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t)}\right)+\sum_{i=1}^{t} \Omega\left(f_{i}\right) \\ & = \sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)\right)+\Omega\left(f_{t}\right)+\text { constant } \end{aligned}
Obj(t)=i=1∑nl(yi,y^i(t))+i=1∑tΩ(fi)=i=1∑nl(yi,y^i(t−1)+ft(xi))+Ω(ft)+ constant
由于前 t − 1 t-1 t−1轮的模型已确定,因此其复杂度是确定,所以 ∑ t = 1 t − 1 Ω ( f t ) = c o n s t a n t \sum_{t=1}^{t-1}\Omega(f_t) = constant ∑t=1t−1Ω(ft)=constant
将目标函数泰勒展开
泰勒展开式
一维:
f ( x + Δ x ) ≃ f ( x ) + f ′ ( x ) Δ x + 1 2 f ′ ′ ( x ) Δ x 2 f(x+\Delta x) \simeq f(x)+f^{\prime}(x) \Delta x+\frac{1}{2} f^{\prime \prime}(x) \Delta x^{2} f(x+Δx)≃f(x)+f′(x)Δx+21f′′(x)Δx2
二维:
f ( x , y + Δ y ) ≃ f ( x , y ) + ∂ f ( x , y ) ∂ y Δ y + 1 2 ∂ 2 f ( x , y ) ∂ y 2 Δ y 2 f(x, y+\Delta y) \simeq f(x,y) + \frac{\partial f(x,y)}{\partial y} \Delta y + \frac{1}{2}\frac{\partial ^2 f(x, y)}{\partial y^2}\Delta y^2 f(x,y+Δy)≃f(x,y)+∂y∂f(x,y)Δy+21∂y2∂2f(x,y)Δy2
记
g
i
=
∂
y
^
(
t
−
1
)
l
(
y
i
,
y
^
(
t
−
1
)
)
,
h
i
=
∂
y
^
(
t
−
1
)
2
l
(
y
i
,
y
^
(
t
−
1
)
)
g_{i}=\partial_{\hat{y}^{(t-1)}} l\left(y_{i}, \hat{y}^{(t-1)}\right), \quad h_{i}=\partial_{\hat{y}^{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right)
gi=∂y^(t−1)l(yi,y^(t−1)),hi=∂y^(t−1)2l(yi,y^(t−1)),目标函数为:
O
b
j
(
t
)
≃
∑
i
=
1
n
[
l
(
y
i
,
y
^
i
(
t
−
1
)
)
+
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
+
c
o
n
s
t
a
n
t
O b j^{(t)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}_{i}^{(t-1)}\right)+g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right)+ constant
Obj(t)≃i=1∑n[l(yi,y^i(t−1))+gift(xi)+21hift2(xi)]+Ω(ft)+constant
移除常数项后,目标函数为:
∑
i
=
1
n
[
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
\sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right)
i=1∑n[gift(xi)+21hift2(xi)]+Ω(ft)
定义树的复杂度
将样本到叶子节点分数的映射关系表示成:
f
t
(
x
)
=
w
q
(
x
)
q
(
x
)
∈
1
,
2
,
.
.
.
,
T
f_t(x) = w_{q(x)}\\ q(x) \in {1,2,...,T}
ft(x)=wq(x)q(x)∈1,2,...,T
w w w是叶子节点的权重, T T T为叶子节点总个数
定义树的复杂度为:
Ω
(
f
t
)
=
γ
T
+
1
2
λ
∑
j
=
1
T
w
j
2
\Omega(f_t) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T}w_j^2
Ω(ft)=γT+21λj=1∑Twj2
目标函数求解
现按照样本所属的叶子节点划分样本子集, I j = { i ∣ q ( x i ) = j } I_j = \left \{ i | q(x_i)=j \right \} Ij={i∣q(xi)=j},属于同一个叶子节点的归为一类,共有 T T T类。
O b j ( t ) ≃ ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) = ∑ i = 1 n [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + λ 1 2 ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T \begin{aligned} O b j^{(t)} & \simeq \sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right) \\ &=\sum_{i=1}^{n}\left[g_{i} w_{q\left(x_{i}\right)}+\frac{1}{2} h_{i} w_{q\left(x_{i}\right)}^{2}\right]+\gamma T+\lambda \frac{1}{2} \sum_{j=1}^{T} w_{j}^{2} \\ &=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned} Obj(t)≃i=1∑n[gift(xi)+21hift2(xi)]+Ω(ft)=i=1∑n[giwq(xi)+21hiwq(xi)2]+γT+λ21j=1∑Twj2=j=1∑T⎣⎡⎝⎛i∈Ij∑gi⎠⎞wj+21⎝⎛i∈Ij∑hi+λ⎠⎞wj2⎦⎤+γT
记 G j = ∑ i ∈ I j g i , H j = ∑ i ∈ I j h i G_{j}=\sum_{i \in I_{j}} g_{i} , H_{j}=\sum_{i \in I_{j}} h_{i} Gj=∑i∈Ijgi,Hj=∑i∈Ijhi
则目标函数简化成
O
b
j
(
t
)
=
∑
j
=
1
T
[
(
∑
i
∈
I
j
g
i
)
w
j
+
1
2
(
∑
i
∈
I
j
h
i
+
λ
)
w
j
2
]
+
γ
T
=
∑
j
=
1
T
[
G
j
w
j
+
1
2
(
H
j
+
λ
)
w
j
2
]
+
γ
T
\begin{aligned} O b j^{(t)} &=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \\ &=\sum_{j=1}^{T}\left[G_{j} w_{j}+\frac{1}{2}\left(H_{j}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned}
Obj(t)=j=1∑T⎣⎡⎝⎛i∈Ij∑gi⎠⎞wj+21⎝⎛i∈Ij∑hi+λ⎠⎞wj2⎦⎤+γT=j=1∑T[Gjwj+21(Hj+λ)wj2]+γT
对
w
j
w_j
wj来说是一个一元二次函数,当
w
j
∗
=
−
G
j
2
×
1
2
(
H
j
+
λ
)
=
G
j
H
j
+
λ
w_j^* = - \frac{G_j}{2 \times \frac{1}{2}(H_j+\lambda)} = \frac{G_j}{H_j + \lambda}
wj∗=−2×21(Hj+λ)Gj=Hj+λGj
目标函数取得最小值:
O
b
j
(
t
)
=
∑
j
=
1
T
[
−
G
j
2
4
⋅
1
2
(
H
j
+
λ
)
]
+
γ
T
=
−
1
2
∑
j
=
1
T
G
j
2
H
j
+
λ
+
γ
T
\begin{aligned} Obj^{(t)} &= \sum_{j=1}^T[-\frac{G_j ^ 2}{4 \cdot\frac{1}{2} (H_j+\lambda)}] + \gamma T \\ &= -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T \end{aligned}
Obj(t)=j=1∑T[−4⋅21(Hj+λ)Gj2]+γT=−21j=1∑THj+λGj2+γT
树的生成
- 从根结点(所有数据在同一个结点中),深度为0开始
- 对每一个叶子结点,尝试将其分裂成两个叶子结点,分裂后目标函数值的变化如下:
G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ ] − γ G a i n=\frac{1}{2}\left[\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}\right]-\gamma Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ - 一直分裂直至不满足分裂条件为止
如何找到最优分裂特征
- 对每一个特征,将其特征值排序
- 尝试使用每一个特征值进行划分
- 选出所有特征所有特征值中增益最大的作为分类依据
剪枝和正则
- 增益不能为负。训练损失和树的复杂度得到平衡
G a i n = G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ − γ G a i n=\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}-\gamma Gain=HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2−γ - 提前停止。当最优分裂的增益值为负时,停止生长。(但可能这一次分裂有利于后续分裂)
- 设定最大深度,修剪所有增益为负的叶子结点
XGBoost算法步骤
- 在每一轮中,新建一棵空树 f t ( x ) f_t(x) ft(x)
- 计算每个叶子节点中每个样本的一阶梯度和二阶梯度值
g i = ∂ y ^ ( t − 1 ) l ( y i , y ^ ( t − 1 ) ) , h i = ∂ y ^ ( t − 1 ) 2 l ( y i , y ^ ( t − 1 ) ) g_{i}=\partial_{\hat{y}^{(t-1)}} l\left(y_{i}, \hat{y}^{(t-1)}\right), \quad h_{i}=\partial_{\hat{y}^{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right) gi=∂y^(t−1)l(yi,y^(t−1)),hi=∂y^(t−1)2l(yi,y^(t−1)) - 计算不同特征不同特征值作为分裂依据时的增益
G a i n = G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ − γ G a i n=\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}-\gamma Gain=HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2−γ - 不断地生长树,直至不满足分裂条件
- 将这一轮的树
f
t
(
x
)
f_t(x)
ft(x)添加到模型中
y ( t ) = y ( t − 1 ) + ϵ f t ( x i ) y^{(t)}=y^{(t-1)}+\epsilon f_{t}\left(x_{i}\right) y(t)=y(t−1)+ϵft(xi)
ϵ \epsilon ϵ称为步长,即在每一轮中,并不是做完了所有的优化,而是留一部分给后续的优化轮次,这样可以防止过拟合