1 GBDT
1.1 泰勒展开
泰勒公式是一个用函数在某点的信息描述其附近取值的公式。
一阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0) f(x)≈f(x0)+f′(x0)(x−x0)
二阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) ( x − x 0 ) 2 2 f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2} f(x)≈f(x0)+f′(x0)(x−x0)+f′′(x0)2(x−x0)2
若 Δ x = x t − x t − 1 \Delta{x}=x^t-x^{t-1} Δx=xt−xt−1
那么 f ( x t ) f(x^t) f(xt)在 x t − 1 x^{t-1} xt−1处位置的二阶泰勒展开为: f ( x t ) = f ( x t + Δ x ) = f ( x t − 1 ) + f ′ ( x t − 1 ) Δ x + f ′ ′ ( x t − 1 ) Δ x 2 2 f(x^t)=f(x_t+\Delta{x})=f(x^{t-1})+f^{'}(x^{t-1})\Delta{x}+f^{''}(x^{t-1})\frac{\Delta{x}^2}{2} f(xt)=f(xt+Δx)=f(xt−1)+f′(xt−1)Δx+f′′(xt−1)2Δx2
1.2 梯度下降法
在机器学习中通常需要最小化损失函数
L
(
θ
)
L(\theta)
L(θ),其中
θ
\theta
θ是模型中的参数,通常使用梯度下降来迭代处理这种无约束的优化问题,因此参数的更新为:
θ
t
=
θ
t
−
1
+
Δ
θ
\theta^{t}=\theta^{t-1}+\Delta\theta
θt=θt−1+Δθ
我们可以将
L
(
θ
t
)
L(\theta^{t})
L(θt)在
θ
t
−
1
\theta^{t-1}
θt−1处进行一阶泰勒展开:
L
(
θ
t
)
≈
L
(
θ
t
−
1
)
+
L
′
(
θ
t
−
1
)
Δ
θ
L(\theta^{t})\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta\theta
L(θt)≈L(θt−1)+L′(θt−1)Δθ
我们的目标是希望通过每次的更新迭代能够使
L
(
θ
t
)
<
L
(
θ
t
−
1
)
L(\theta^{t})<L(\theta^{t-1})
L(θt)<L(θt−1),这里
Δ
θ
=
−
α
L
′
(
θ
t
−
1
)
\Delta\theta=-\alpha{L^{'}(\theta^{t-1})}
Δθ=−αL′(θt−1),因此参数的更新为:
θ
t
=
θ
t
−
1
−
α
L
′
(
θ
t
−
1
)
\theta^{t}=\theta^{t-1}-\alpha{L^{'}(\theta^{t-1})}
θt=θt−1−αL′(θt−1)
1.3 从梯度下降到梯度提升
当我们需要优化对象是参数空间时就是上面1.2中的梯度下降更新的形式,而当我们的优化对象变成函数空间时,函数在每次迭代的更新就变成如下形式,其中
f
t
(
x
)
f^t(x)
ft(x)为第t次迭代的函数增量:
f
t
(
x
)
=
f
t
−
1
(
x
)
+
f
t
(
x
)
f^t(x)=f^{t-1}(x)+f_t(x)
ft(x)=ft−1(x)+ft(x)
这里函数增量同样是在拟合函数的负梯度,这里把一阶导记为
g
g
g:
f
t
(
x
)
=
−
α
t
g
t
(
x
)
f_t(x)=-\alpha_tg_t(x)
ft(x)=−αtgt(x)
最终函数等于每次迭代的增量的累加和:
F
(
x
)
=
∑
t
=
0
T
f
t
(
x
)
F(x)=\sum_{t=0}^{T}f_t(x)
F(x)=t=0∑Tft(x)
1.4 Gradient Boosting Tree算法
GBDT定义的加法模型如下,其中
x
x
x为输入样本,
w
w
w为回归树的参数,
α
\alpha
α为每棵树的权重:
F
(
x
;
w
)
=
∑
t
=
0
T
α
t
h
t
(
x
;
w
t
)
=
∑
t
=
0
T
f
t
(
x
;
w
t
)
F(x;w)=\sum_{t=0}^T\alpha_{t}h_t(x;w_t)=\sum_{t=0}^Tf_t(x;w_t)
F(x;w)=t=0∑Tαtht(x;wt)=t=0∑Tft(x;wt)
从以上的更新公式可以看出每颗书就是在前一棵树的输出基础上拟合前一棵树的梯度方向。
通过最小化损失函数来求解最优模型:
F
∗
=
a
r
g
m
i
n
F
∑
i
=
0
N
L
(
y
i
,
F
(
x
i
;
w
)
)
F^*=argmin_F\sum_{i=0}^NL(y_i,F(x_i;w))
F∗=argminFi=0∑NL(yi,F(xi;w))
算法整体实现如下:
1、初始化 f 0 f_0 f0
2、for t=1 to T do:
2.1 计算响应 y i ~ = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F t − 1 ( x ) , i = 1 , 2 , . . . , N \widetilde{y_i}=-[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{t-1}(x)},i=1,2,...,N yi =−[∂F(xi)∂L(yi,F(xi))]F(x)=Ft−1(x),i=1,2,...,N
2.2 学习第t棵树 w ∗ = a r g m i n w ∑ i = 1 N ( y i ~ − h t ( x i ; w ) ) 2 w^*=argmin_w\sum_{i=1}^N(\widetilde{y_i}-h_t(x_i;w))^2 w∗=argminw∑i=1N(yi −ht(xi;w))2
2.3 line search找步长 ρ ∗ = a r g m i n ρ ∑ i = 1 N L ( y i , F t − 1 ( x i ) + ρ h t ( x i ; w ∗ ) ) \rho^*=argmin_{\rho}\sum_{i=1}^NL(y_i,F_{t-1}(x_i)+\rho h_t(x_i;w^*)) ρ∗=argminρ∑i=1NL(yi,Ft−1(xi)+ρht(xi;w∗))
2.4 更新模型 F t = F t − 1 + ρ ∗ h t ( x ; w ∗ ) F_t=F_{t-1}+\rho^*h_t(x;w^*) Ft=Ft−1+ρ∗ht(x;w∗)
3 输出 F T F_T FT
其中第2步中共设计生成T棵树,2.1中就是在计算前一棵树输出的负梯度,可以使用MSE来作为损失函数,2.2中就是在学习当前的树,使当前的树可以尽可能的拟合上一棵树的负梯度。下面简单展开说明一下选择MSE作为损失函数时,每棵树拟合的负梯度到底是什么,其中损失函数的定义如下:
L
(
y
i
,
F
t
−
1
(
x
i
)
)
=
1
2
(
y
i
−
F
t
−
1
(
x
i
)
)
2
L(y_i,F_{t-1}(x_i))=\frac{1}{2}(y_i-F_{t-1}(x_i))^2
L(yi,Ft−1(xi))=21(yi−Ft−1(xi))2
损失函数对上一棵树求偏导得:
−
[
∂
L
(
y
i
,
F
(
x
i
)
)
∂
F
(
x
i
)
]
F
(
x
)
=
F
t
−
1
(
x
)
=
y
i
−
F
t
−
1
(
x
i
)
-[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{t-1}(x)}=y_i-F_{t-1}(x_i)
−[∂F(xi)∂L(yi,F(xi))]F(x)=Ft−1(x)=yi−Ft−1(xi)
所以如果以MSE作为损失函数的化,最终每棵树拟合的负梯度其实是标签与前一棵树输出的差值。
2 XGBoost
2.1 牛顿法
首先回顾一下泰勒展开二阶导的形式:
二阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ’ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) ( x − x 0 ) 2 2 f(x)\approx f(x_0)+f^{’}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2} f(x)≈f(x0)+f’(x0)(x−x0)+f′′(x0)2(x−x0)2
若 Δ x = x t − x t − 1 \Delta{x}=x^t-x^{t-1} Δx=xt−xt−1
那么 f ( x t ) f(x^t) f(xt)在 x t − 1 x^{t-1} xt−1处位置的二阶泰勒展开为: f ( x t ) ≈ f ( x t + Δ x ) = f ( x t − 1 ) + f ′ ( x t − 1 ) Δ x + f ′ ′ ( x t − 1 ) Δ x 2 2 f(x^t)\approx f(x_t+\Delta{x})=f(x^{t-1})+f^{'}(x^{t-1})\Delta{x}+f^{''}(x^{t-1})\frac{\Delta{x}^2}{2} f(xt)≈f(xt+Δx)=f(xt−1)+f′(xt−1)Δx+f′′(xt−1)2Δx2
因此,我们将
L
(
θ
t
)
L(\theta^t)
L(θt)在
θ
t
−
1
\theta^{t-1}
θt−1处进行二阶泰勒展开得到:
L
(
θ
t
)
≈
L
(
θ
t
−
1
)
+
L
′
(
θ
t
−
1
)
Δ
θ
+
L
′
′
(
θ
t
−
1
)
Δ
θ
2
2
L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta{\theta}+L^{''}(\theta^{t-1})\frac{\Delta\theta^2}{2}
L(θt)≈L(θt−1)+L′(θt−1)Δθ+L′′(θt−1)2Δθ2
为了之后推导的方便,通常将在
θ
t
−
1
\theta^{t-1}
θt−1的一阶导记为
g
g
g,将二阶导记为
h
h
h,所以:
L
(
θ
t
)
≈
L
(
θ
t
−
1
)
+
g
Δ
θ
+
h
Δ
θ
2
2
L(\theta^t)\approx L(\theta^{t-1})+g\Delta{\theta}+h\frac{\Delta\theta^2}{2}
L(θt)≈L(θt−1)+gΔθ+h2Δθ2
我们的目标是希望最小化
L
(
θ
t
)
L(\theta^t)
L(θt),由于
L
(
θ
t
−
1
)
L(\theta^{t-1})
L(θt−1)是历史固定结果,所以我们的目标转化为最小化
g
Δ
θ
+
h
Δ
θ
2
2
g\Delta{\theta}+h\frac{\Delta\theta^2}{2}
gΔθ+h2Δθ2,即:
∂
(
g
Δ
θ
+
h
Δ
θ
2
2
)
∂
Δ
θ
=
0
\frac{\partial(g\Delta{\theta}+h\frac{\Delta\theta^2}{2})}{\partial\Delta\theta}=0
∂Δθ∂(gΔθ+h2Δθ2)=0
计算偏导后我们可以得到:
Δ
θ
=
−
g
h
\Delta\theta=-\frac{g}{h}
Δθ=−hg
所以参数
θ
t
\theta^t
θt的更新为:
θ
t
=
θ
t
−
1
+
Δ
θ
=
θ
t
−
1
−
g
h
\theta^{t}=\theta^{t-1}+\Delta\theta=\theta^{t-1}-\frac{g}{h}
θt=θt−1+Δθ=θt−1−hg
2.2 从牛顿法到牛顿提升
与梯度提升一样,在函数空间中,牛顿提升也是累加模型:
f
t
(
x
)
=
f
t
−
1
(
x
)
+
f
t
(
x
)
f^t(x)=f^{t-1}(x)+f_t(x)
ft(x)=ft−1(x)+ft(x)
与梯度提升不同的是这里的函数增量有所不同:
f
t
(
x
)
=
−
g
t
(
x
)
h
t
(
x
)
f_t(x)=-\frac{g_t(x)}{h_t(x)}
ft(x)=−ht(x)gt(x)
最终函数等于每次迭代的增量的累加和:
F
(
x
)
=
∑
t
=
0
T
f
t
(
x
)
F(x)=\sum_{t=0}^{T}f_t(x)
F(x)=t=0∑Tft(x)
2.3 算法原理
XGBoost的目标函数中不光有损失部分还加入了正则化项,来防止训练的过拟合,即在训练中进行剪枝,公式如下,其中
i
i
i表示样本下标,
k
k
k表示每棵树叶子节点:
L
(
θ
)
=
∑
i
l
(
y
i
^
,
y
i
)
+
∑
k
Ω
(
f
k
)
L(\theta)=\sum_il(\hat{y_i}, y_i)+\sum_k\Omega(f_k)
L(θ)=i∑l(yi^,yi)+k∑Ω(fk)
其中正则化项展开为,其中
T
T
T表示叶子节点的个数,
w
w
w表示各叶子节点的预测取值:
Ω
(
f
)
=
γ
T
+
1
2
λ
∣
∣
w
∣
∣
2
\Omega(f)=\gamma{T}+\frac{1}{2}\lambda{||w||^2}
Ω(f)=γT+21λ∣∣w∣∣2
在t次迭代后,模型预测值为:
y
i
^
=
F
t
−
1
(
x
i
)
+
f
t
(
x
i
)
\hat{y_i}=F_{t-1}(x_i)+f_t(x_i)
yi^=Ft−1(xi)+ft(xi)
此时目标函数为,其中
F
t
−
1
(
x
i
)
F_{t-1}(x_i)
Ft−1(xi)为前t-1次迭代累加的回归树对于输入样本
x
i
x_i
xi得到的预测结果:
L
t
(
θ
)
=
∑
i
=
1
N
l
(
y
i
,
F
t
−
1
(
x
i
)
+
f
t
(
x
i
)
)
+
Ω
(
f
t
)
L^t(\theta)=\sum_{i=1}^{N}l(y_i, F_{t-1}(x_i)+f_t(x_i))+\Omega(f_t)
Lt(θ)=i=1∑Nl(yi,Ft−1(xi)+ft(xi))+Ω(ft)
我们对上式在
F
t
−
1
(
x
i
)
F_{t-1}(x_i)
Ft−1(xi)处进行二阶泰勒展开,其中
g
i
=
∂
l
(
y
i
,
F
t
−
1
(
x
i
)
)
∂
F
t
−
1
(
x
i
)
g_i=\frac{\partial{l(y_i,F_{t-1}(x_i))}}{\partial{F_{t-1}(x_i)}}
gi=∂Ft−1(xi)∂l(yi,Ft−1(xi)),
h
i
=
∂
2
l
(
y
i
,
F
t
−
1
(
x
i
)
)
∂
F
t
−
1
(
x
i
)
h_i=\frac{\partial^2{l(y_i,F_{t-1}(x_i))}}{\partial{F_{t-1}(x_i)}}
hi=∂Ft−1(xi)∂2l(yi,Ft−1(xi)):
L
t
(
θ
)
≈
∑
i
=
1
N
[
l
(
y
i
,
F
t
−
1
(
x
i
)
)
+
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
L^t(\theta)\approx\sum_{i=1}^{N}[l(y_i, F_{t-1}(x_i))+g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t)
Lt(θ)≈i=1∑N[l(yi,Ft−1(xi))+gift(xi)+21hift2(xi)]+Ω(ft)
这里由于
l
(
y
i
,
F
t
−
1
(
x
i
)
)
l(y_i, F_{t-1}(x_i))
l(yi,Ft−1(xi))是历史固定值,为常数项,所以我们去除常数项得:
L
~
t
(
θ
)
=
∑
i
=
1
N
[
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
\widetilde{L}^t(\theta)=\sum_{i=1}^{N}[g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t)
L
t(θ)=i=1∑N[gift(xi)+21hift2(xi)]+Ω(ft)
接下来我们将粒度放到树节点上,以树节点的形式重新整理上述公式,我们设
f
(
x
)
=
w
q
(
x
)
f(x)=w_{q(x)}
f(x)=wq(x),表示输入样本
x
x
x落入节点
q
(
x
)
q(x)
q(x)中对应的预测值,其中
j
j
j表示树的第
j
j
j个叶子节点
L
~
t
(
θ
)
=
∑
i
=
1
N
[
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
2
(
x
i
)
]
+
Ω
(
f
t
)
=
∑
i
=
1
N
[
g
i
w
q
(
x
i
)
+
1
2
h
i
w
q
(
x
i
)
2
]
+
γ
T
+
λ
1
2
∑
j
=
1
T
w
j
2
\widetilde{L}^t(\theta)=\sum_{i=1}^{N}[g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t) \\ =\sum_{i=1}^{N}[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]+\gamma T+\lambda\frac{1}{2}\sum_{j=1}^{T}{w^2_j}
L
t(θ)=i=1∑N[gift(xi)+21hift2(xi)]+Ω(ft)=i=1∑N[giwq(xi)+21hiwq(xi)2]+γT+λ21j=1∑Twj2
我们设每个叶子节点上的样本集合为
I
j
=
{
i
∣
q
(
x
i
)
=
j
}
I_j=\{i|q(x_i)=j\}
Ij={i∣q(xi)=j},进一步对上面公式进行整理得:
L
~
t
(
θ
)
=
∑
j
=
1
T
[
(
∑
i
∈
I
j
g
i
)
w
j
+
1
2
(
∑
i
∈
I
j
h
i
+
λ
)
w
j
2
]
+
γ
T
=
∑
j
=
1
T
[
G
j
w
j
+
1
2
(
H
j
+
λ
)
w
j
2
]
+
γ
T
\widetilde{L}^t(\theta)=\sum_{j=1}^{T}[(\sum_{i\in{I_j}}g_i)w_j+\frac{1}{2}(\sum_{i\in{I_j}}h_i+\lambda)w^2_j]+\gamma T\\ =\sum_{j=1}^{T}[G_jw_j+\frac{1}{2}(H_j+\lambda)w^2_j]+\gamma T
L
t(θ)=j=1∑T[(i∈Ij∑gi)wj+21(i∈Ij∑hi+λ)wj2]+γT=j=1∑T[Gjwj+21(Hj+λ)wj2]+γT
我们的目标是使目标函数最小,因此可以对叶节点
j
j
j中的预测值求偏导并令其为0,得到最佳的节点预测值为:
w
∗
=
−
G
j
H
j
+
λ
w^*=-\frac{G_j}{H_j+\lambda}
w∗=−Hj+λGj
我们将上述的最佳预测值带入目标函数,则得到了最小损失:
L
~
t
∗
=
−
1
2
∑
j
=
1
T
G
j
2
H
j
+
λ
+
γ
T
\widetilde{L}^{t*}=-\frac{1}{2}\sum_{j=1}^{T}\frac{G_j^2}{H_j+\lambda}+\gamma T
L
t∗=−21j=1∑THj+λGj2+γT
接下来我们可以利用上面对于节点的最小损失的计算作为寻找回归树最佳切分点的标准。根据上面公式可以看出
G
j
2
H
j
+
λ
\frac{G_j^2}{H_j+\lambda}
Hj+λGj2反应出节点对于损失的贡献度,我们希望损失越小越好,相应的就是贡献度部分的值越大越好,因此我们可以按照下式定义我们的增益:
G
a
i
n
=
G
L
2
H
L
+
λ
+
G
R
2
H
R
+
λ
−
(
G
L
+
G
R
)
2
H
L
+
H
R
+
λ
−
γ
Gain=\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}-\gamma
Gain=HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2−γ
遍历特征的可能取值,并将样本集合根据特征取值分为Left集合和Right集合,依照上述公式计算每个特征取值下的增益,若增益越大,则损失在分裂后减小的越多。
3 GBDT代码实现
3.1 CART树实现
# encoding=utf8
import numpy as np
import matplotlib.pyplot as plt
from random import randrange
from FileReader import loadDataForCART
from FileWriter import writeByJson
from collections import defaultdict, Counter
from Metric import *
DISCRETE_FLAG = 0
CONTINUOUS_FLAG = 1
class Node:
def __init__(self, fea=None, val=None, res=None, flag=None, left=None, right=None):
self.fea = fea
self.val = val
self.res = res
self.flag = flag
self.left = left
self.right = right
class CART:
def __init__(self, epsilon=1e-3, minSample=1, maxDepth=20, task='classify', **kwargs):
self.modelName = 'CART'
self.epsilon = epsilon
self.minSample = minSample
self.tree = None
self.maxDepth = maxDepth
self.task = task
# 这里需要根据时分类任务还是回归任务来选择不纯度的计算方法
if task == 'classify':
self.impurityFunc = self.getGini
self.resCalcFunc = self.calcNodeClassifyResult
self.getNoMergeErrorFunc = self.classifyByNode
self.getMergeErrorFunc = self.classifyByMajority
elif task == 'regression':
self.impurityFunc = self.getMSE
self.resCalcFunc = self.calcNodeRegressionResult
self.getNoMergeErrorFunc = self.regressByNode
self.getMergeErrorFunc = self.regressByMajority
# 判断数据集中某一维特征是连续的还是离散的
def __discreteOrContinuous(self, xData, feaUnique, beta=0.2):
# 如果是连续值则返回1,若为离散值则返回0
if np.shape(feaUnique)[0] > beta * (np.shape(xData)[0]):
return CONTINUOUS_FLAG
else:
return DISCRETE_FLAG
# 连续数据获得二分切分点
def __getSplits(self, feaArray):
feaArray = np.sort(feaArray)
array1 = feaArray[:-1]
array2 = feaArray[1:]
return (array1 + array2) / 2.0
def getGini(self, yData):
c = Counter(yData)
return 1 - sum([(val / float(yData.shape[0])) ** 2 for val in c.values()])
def getMSE(self, yData):
c = len(yData)
ySquaredSum = np.sum(np.power(yData, 2), axis=0)
ySum = np.sum(yData, axis=0)
return (ySquaredSum / c) - (ySum / c) ** 2
def __getFeaGini(self, set1, set2):
num = set1.shape[0] + set2.shape[0]
return float(set1.shape[0]) / num * self.getGini(set1) + float(set2.shape[0]) / num * self.getGini(set2)
def __getFeaImpurity(self, set1, set2):
num = set1.shape[0] + set2.shape[0]
return float(set1.shape[0]) / num * self.impurityFunc(set1) + \
float(set2.shape[0]) / num * self.impurityFunc(set2)
def calcNodeClassifyResult(self, yData):
return Counter(yData).most_common(1)[0][0]
def calcNodeRegressionResult(self, yData):
return np.mean(yData)
def bestSplit(self, splitSets, xData, yData):
preImpurity = self.impurityFunc(yData)
# subdataInds每个切分点包含的样本索引(特征维度索引, 切分值, flag):[样本索引]
subdataInds = defaultdict(list)
# 找出每一个满足切分点的子集索引,加入字典当中
for split in splitSets:
for index, sample in enumerate(xData):
if split[2] == 1:
if sample[split[0]] <= split[1]:
subdataInds[split].append(index)
elif split[2] == 0:
if sample[split[0]] == split[1]:
subdataInds[split].append(index)
minPurity = preImpurity
bestSplit = None
bestSet = None
flag = None
for split, index in subdataInds.items():
set1 = yData[index]
set2Ind = list(set(range(yData.shape[0])) - set(index))
set2 = yData[set2Ind]
if set1.shape[0] < 1 or set2.shape[0] < 1:
continue
newImpurity = self.__getFeaImpurity(set1, set2)
if newImpurity < minPurity:
minPurity = newImpurity
bestSplit = split
bestSet = (index, set2Ind)
flag = split[2]
if abs(minPurity - preImpurity) < self.epsilon:
bestSplit = None
# else:
# TODO 计算importance,需要计算当前总不纯度,impurity_all - impurity_new
# importance[bestSplit[0]] = preGini - minGini
return bestSplit, bestSet, minPurity, flag
def build(self, splitSets, xData, yData, currentDepth):
# 集合内若样本数少于阈值则返回最多的样本类别
if (yData.shape[0] < self.minSample) or (currentDepth == self.maxDepth):
return Node(res=self.resCalcFunc(yData))
# 计算当前集合中最佳的切分点
bestSplit, bestSet, minImpurity, flag = self.bestSplit(splitSets, xData, yData)
if bestSplit is None:
return Node(res=self.resCalcFunc(yData))
else:
splitSets.remove(bestSplit)
left = self.build(splitSets, xData=xData[bestSet[0]], yData=yData[bestSet[0]],
currentDepth=currentDepth + 1)
right = self.build(splitSets, xData=xData[bestSet[1]], yData=yData[bestSet[1]],
currentDepth=currentDepth + 1)
return Node(fea=bestSplit[0], val=bestSplit[1], flag=flag, right=right, left=left)
# 前续遍历训练好的模型
def preorderTraversal(self, currentNode, dictWrite, depth, verbose=False):
if currentNode is None:
return
if verbose:
print("---" * depth, "[depth:{} feature:{}, val:{}, res:{}]".
format(depth, currentNode.fea, currentNode.val, currentNode.res))
dictWrite['fea'] = currentNode.fea
dictWrite['val'] = currentNode.val
dictWrite['res'] = currentNode.res
dictWrite['flag'] = currentNode.flag
if currentNode.left is not None:
dictWrite['left'] = {}
self.preorderTraversal(currentNode.left, dictWrite['left'], depth + 1)
else:
dictWrite['left'] = None
if currentNode.right is not None:
dictWrite['right'] = {}
self.preorderTraversal(currentNode.right, dictWrite['right'], depth + 1)
else:
dictWrite['right'] = None
def fit(self, xData, yData, randomFeature=False, nFeatures=None, mustReservedInd=None):
"""
Fit the input data and train the model
Parameters
----------
xData : numpy.ndarray
Feature set of input samples
yData : numpy.ndarray
Label set of input samples
randomFeature : bool, optional
Whether the random jitter of the feature is adopted
nFeatures : int, optional
Number of randomly sampled features, which used while randomFeature is True. If nFeatures is None,
the value of the nFeatures will be calculated according to a specific formula.
mustReservedInd : int, optional
The feature index must be added to the training set, which used while randomFeature is True.
"""
if not isinstance(xData, np.ndarray):
print("xData must be type of numpy.ndarray")
return
if not isinstance(yData, np.ndarray):
print("yData must be type of numpy.ndarray")
return
splitsSet = []
features = set()
# 如果采用随机特征抖动(随机森林或xgboost)
if randomFeature:
if nFeatures is None:
nFeatures = np.round(np.sqrt(np.shape(xData)[1]))
while len(features) < nFeatures:
ind = randrange(1, np.shape(xData)[1])
if ind not in features:
features.add(ind)
# 因为第一个参数表示当前链路是哪一条链路,所以默认必须加入
if mustReservedInd is not None:
if isinstance(mustReservedInd, int) is False:
print("mustReservedInd must be of type int or None")
return
features.add(mustReservedInd)
else:
features = set(range(np.shape(xData)[1]))
for feaInd in features:
# 将同一维特征的所有的取值提取出来
feaUnique = np.unique(xData[:, feaInd])
# 自适应判断当前特征是否为连续类型
flag = self.__discreteOrContinuous(xData, feaUnique)
if feaUnique.shape[0] < 2:
continue
elif feaUnique.shape[0] == 2:
if flag == CONTINUOUS_FLAG:
midVal = (feaUnique[0] + feaUnique[1]) / 2.0
splitsSet.append((feaInd, midVal, flag))
elif flag == DISCRETE_FLAG:
splitsSet.append((feaInd, feaUnique[0], flag))
else:
if flag == 1:
feaUnique = self.__getSplits(feaUnique)
for val in feaUnique:
splitsSet.append((feaInd, val, flag))
self.tree = self.build(splitsSet, xData, yData, 0)
# 进行后剪枝操作
dataSet = np.hstack((xData, yData.reshape(-1, 1)))
self.prune(dataSet, self.tree)
return self
def findRes(self, data, tree):
if tree.res is not None:
return tree.res
else:
if tree.flag == CONTINUOUS_FLAG:
if data[tree.fea] <= tree.val:
return self.findRes(data, tree.left)
elif data[tree.fea] > tree.val:
return self.findRes(data, tree.right)
elif tree.flag == DISCRETE_FLAG:
if data[tree.fea] == tree.val:
return self.findRes(data, tree.left)
else:
return self.findRes(data, tree.right)
def predict(self, xData):
return self.findRes(xData, self.tree)
# ======================================后剪枝======================================
def __binSplit(self, dataSet, node):
fea = node.fea
val = node.val
flag = node.flag
if flag == DISCRETE_FLAG:
leftSet = dataSet[np.nonzero(dataSet[:, fea] == val)[0], :]
rightSet = dataSet[np.nonzero(dataSet[:, fea] != val)[0], :]
elif flag == CONTINUOUS_FLAG:
leftSet = dataSet[np.nonzero(dataSet[:, fea] <= val)[0], :]
rightSet = dataSet[np.nonzero(dataSet[:, fea] > val)[0], :]
return leftSet, rightSet
def classifyByNode(self, dataSet, node):
errorCount = 0
for data in dataSet:
res = self.findRes(data, node)
if res != data[-1]:
errorCount += 1
return errorCount
def regressByNode(self, dataSet, node):
sumError = 0
dataSize = dataSet.shape[0]
for data in dataSet:
res = self.findRes(data, node)
sumError += np.abs(res - data[-1])
return sumError / dataSize
def classifyByMajority(self, dataSet):
errorCount = 0
majLabel = Counter(dataSet[:, -1]).most_common(1)[0][0]
for data in dataSet:
if majLabel != data[-1]:
errorCount += 1
return errorCount, majLabel
def regressByMajority(self, dataSet):
sumError = 0
dataSize = dataSet.shape[0]
meanValue = np.mean(dataSet[:, -1])
for data in dataSet:
sumError += np.abs(data[-1] - meanValue)
return sumError / dataSize, meanValue
# testSet内的每一个样本的最后一维都是他们的分类
def prune(self, testSet, node, verbose=False):
if verbose:
print("pruning now=====>")
if len(testSet) == 0:
return
lSet = None
rSet = None
if (node.left and node.left.res is None) or (node.right and node.right.res is None):
lSet, rSet = self.__binSplit(testSet, node)
if node.left and node.left.res is None:
self.prune(lSet, node.left)
if node.right and node.right.res is None:
self.prune(rSet, node.right)
if (node.left and node.left.res is not None) and (node.right and node.right.res is not None):
errorNoMerge = self.getNoMergeErrorFunc(testSet, node)
errorMerge, label = self.getMergeErrorFunc(testSet)
if errorMerge <= errorNoMerge:
node.res = label
node.left = None
node.right = None
node.fea = None
node.val = None
node.flag = None
3.2 梯度提升树实现
import numpy as np
from CART import CART
class GradientBoostTree:
def __init__(self, nTrees=30, epsilon=1e-3, alpha=0.09, minSample=1, maxDepth=20, earlyStopping=False, theta=0.5, **kwargs):
"""
Parameters
----------
nTrees: int
the number of trees for building GBDT
epsilon: float
the minimum acceptable change of impurity
alpha: float
the learning rate for each gradient boosting
minSample: int
the minimum number of samples in tree-node for building child nodes
maxDepth: int
the maximum depth for each tree
earlyStopping: bool
if earlyStopping is True, GBDT building procession may stop early because the average loss of predictions
less than threshold
theta: float
the threshold for stopping training early, which available if earlyStopping is True
"""
self.nTrees = nTrees
self.epsilon = epsilon
self.alpha = alpha
self.minSample = minSample
self.maxDepth = maxDepth
self.earlyStopping = earlyStopping
self.theta = theta
self.trees = []
self.xData = None
self.yData = None
self.mustReservedInd = None
def fit(self, xData, yData, mustReservedInd=None):
self.xData = xData
self.yData = yData
self.mustReservedInd = mustReservedInd
preds = np.zeros(yData.shape)
# 遍历T棵树
for t in range(self.nTrees):
# 计算所有样本的标签与f0的误差即计算yData-f0
residuals = yData - preds
if self.earlyStopping and residuals.mean() < self.theta:
print("current training average residual is {}, theta is {}, tree number is {}, stopping now"
.format(residuals.mean(), self.theta, len(self.trees)))
return
# 拟合residual
tree = CART(self.epsilon, self.minSample, self.maxDepth, 'regression')
tree.fit(xData, residuals, mustReservedInd=self.mustReservedInd)
self.trees.append(tree)
# 遍历所有样本计算对于残差的预测并累加到预测列表中
for idx, data in enumerate(self.xData):
preds[idx] += self.alpha * tree.predict(data)
def predict(self, xData):
if len(self.trees) <= 0:
print("you must fit data set firstly!")
return None
pred = 0
for tree in self.trees:
pred += self.alpha * tree.predict(xData)
return pred