梯度提升树的使用
GBDT算法流程
GBDT流程
输入:训练数据集 D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } D=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} D={(x1,y1),(x2,y2),…,(xN,yN)}
1.初始化 f 0 ( x ) = 0 f_{0}(x) = 0 f0(x)=0
2.For m = 1 , 2 , … , M m=1,2, \ldots, M m=1,2,…,M
3.针对每一个样本 ( x i , y i ) \left(x_{i}, y_{i}\right) (xi,yi),计算残差
r m , i = y i − f m − 1 ( x i ) , i = 1 , 2 , … , N r_{m, i}=y_{i}-f_{m-1}\left(x_{i}\right), i=1,2, \ldots, N rm,i=yi−fm−1(xi),i=1,2,…,N4.利用 { ( x i , r m , i ) } i = 1 , 2 , … , N \left\{\left(x_{i}, r_{m, i}\right)\right\}_{i=1,2, \ldots, N} {(xi,rm,i)}i=1,2,…,N训练一个决策树(回归树),>得到 T ( x ; Θ m ) T\left(x ; \Theta_{m}\right) T(x;Θm)
5.更新 f m ( x ) = f m − 1 ( x ) + T ( x ; Θ m ) f_{m}(x)=f_{m-1}(x)+T\left(x ; \Theta_{m}\right) fm(x)=fm−1(x)+T(x;Θm)
6.完成以上迭代,得到提升树 f M ( x ) = ∑ m = 1 M T ( x ; Θ m ) f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right) fM(x)=∑m=1MT(x;Θm)
负梯度和残差
GBDT全称:Gradient Boosting Decision Tree,即梯度提升决策树,理解为梯度提升 + 决策树。Friedman提出了利用最速下降的近似方法,利用损失函数的负梯度拟合集学习器:
−
[
∂
L
(
y
i
,
F
(
x
i
)
)
∂
F
(
x
i
)
]
F
(
x
)
=
F
t
−
1
(
x
)
-\left[\frac{\partial L\left(y_{i}, F\left(\mathbf{x}_{\mathbf{i}}\right)\right)}{\partial F\left(\mathbf{x}_{\mathbf{i}}\right)}\right]_{F(\mathbf{x})=F_{t-1}(\mathbf{x})}
−[∂F(xi)∂L(yi,F(xi))]F(x)=Ft−1(x)怎么理解这个近似,我们通过平方损失函数来给大家进行介绍
为了求导方便,在损失函数前面乘以1/2
L
(
y
i
,
F
(
x
i
)
)
=
1
2
(
y
i
−
F
(
x
i
)
)
2
L\left(y_{i}, F\left(\mathbf{x}_{\mathbf{i}}\right)\right)=\frac{1}{2}\left(y_{i}-F\left(\mathbf{x}_{\mathbf{i}}\right)\right)^{2}
L(yi,F(xi))=21(yi−F(xi))2对
F
(
X
i
)
F(X_{i})
F(Xi)求导,则有:
∂
L
(
y
i
,
F
(
x
i
)
)
∂
F
(
x
i
)
=
F
(
x
i
)
−
y
i
\frac{\partial L\left(y_{i}, F\left(\mathbf{x}_{\mathbf{i}}\right)\right)}{\partial F\left(\mathbf{x}_{\mathbf{i}}\right)}={F}\left(\mathbf{x}_{\mathbf{i}}\right)-y_{i}
∂F(xi)∂L(yi,F(xi))=F(xi)−yi残差是梯度的相反数,即:
r
t
i
=
y
i
−
F
t
−
1
(
x
)
=
−
[
∂
L
(
y
i
,
F
(
x
i
)
)
∂
F
(
x
i
)
]
F
(
x
)
=
F
t
−
1
(
x
)
r_{t i}=y_{i}-F_{t-1}(\mathbf{x})=-\left[\frac{\partial L\left(y_{i}, F\left(\mathbf{x}_{\mathbf{i}}\right)\right)}{\partial F\left(\mathbf{x}_{\mathbf{i}}\right)}\right]_{F(\mathbf{x})=F_{t-1}(\mathbf{x})}
rti=yi−Ft−1(x)=−[∂F(xi)∂L(yi,F(xi))]F(x)=Ft−1(x)在GBDT中使用负梯度作为残差进行拟合。
GBDT流程(回归)
GBDT是使用梯度提升的决策树(CART),CART树回归将空间划分为K个不相交的区域,并确定每个区域的输出
c
k
c_{k}
ck,数学表达如下:
f
(
X
)
=
∑
k
=
1
K
c
k
I
(
X
∈
R
k
)
f(\mathbf{X})=\sum_{k=1}^{K} c_{k} I\left(\mathbf{X} \in R_{k}\right)
f(X)=k=1∑KckI(X∈Rk)
输入:训练数据集
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
…
,
(
x
N
,
y
N
)
}
D=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\}
D={(x1,y1),(x2,y2),…,(xN,yN)}
1.初始化:
F
0
(
x
)
=
arg
min
h
0
∑
i
=
1
N
L
(
y
i
,
h
δ
(
x
)
)
=
arg
min
c
∑
i
=
1
N
L
(
y
i
,
c
)
)
\left.F_{0}(\mathbf{x})=\arg \min _{h_{0}} \sum_{i=1}^{N} L\left(y_{i}, h_{\delta}(\mathbf{x})\right)=\arg \min _{c} \sum_{i=1}^{N} L\left(y_{i}, c\right)\right)
F0(x)=argminh0∑i=1NL(yi,hδ(x))=argminc∑i=1NL(yi,c))
2.for t = 1 to T do
2.1 计算负梯度
y ~ i = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F t − 1 ( x ) , i = 1 , 2 , ⋯ , N \tilde{y}_{i}=-\left[\frac{\partial L\left(y_{i}, F\left(\mathbf{x}_{\mathrm{i}}\right)\right)}{\partial F\left(\mathbf{x}_{\mathrm{i}}\right)}\right]_{F(\mathbf{x})=F_{t-1}(\mathbf{x})}, i=1,2, \cdots, N y~i=−[∂F(xi)∂L(yi,F(xi))]F(x)=Ft−1(x),i=1,2,⋯,N2.2 拟合残差得到回归树,得到第t棵树的叶节点区域:
h t ( x ) = ∑ k = 1 K c k I ( X ∈ R t k ) h_{t}(\mathbf{x})=\sum_{k=1}^{K} c_{k} I\left(\mathbf{X} \in R_{t k}\right) ht(x)=k=1∑KckI(X∈Rtk)2.3更新
F t ( x ) = F t − 1 ( x i ) + h t ( x ) = F t − 1 ( x i ) + ∑ k = 1 K c k I ( X ∈ R t k ) F_{t}(\mathbf{x})=F_{t-1}\left(\mathbf{x}_{\mathbf{i}}\right)+h_{t}(\mathbf{x})=F_{t-1}\left(\mathbf{x}_{\mathbf{i}}\right)+\sum_{k=1}^{K} c_{k} I\left(\mathbf{X} \in R_{t k}\right) Ft(x)=Ft−1(xi)+ht(x)=Ft−1(xi)+k=1∑KckI(X∈Rtk)
3.得到加法模型: F ( x ) = ∑ t = 1 T h t ( x ) \boldsymbol{F}(\mathbf{x})=\sum_{t=1}^{T} h_{t}(\mathbf{x}) F(x)=∑t=1Tht(x)
GBDT流程(分类)
GBDT用于分类仍然使用CART回归树,使用softmax进行概率的映射,然后对概率的残差进行拟合
1.针对每个类别都先训练一个回归书,如三个类别,训练三棵树。就是比如对于样本
x
i
x_{i}
xi为第二类,则输入三棵树分别为:(
x
i
x_{i}
xi,0),(
x
i
x_{i}
xi,1),(
x
i
x_{i}
xi,0)这其实是典型的OvR的多分类训练方式。而每棵树的训练过程就是CART的训练过程。这样,对于样本
x
i
x_{i}
xi就得出了三棵树的预测值
F
1
(
x
i
)
F_{1}(x_{i})
F1(xi),
F
2
(
x
i
)
F_{2}(x_{i})
F2(xi),
F
3
(
x
i
)
F_{3}(x_{i})
F3(xi),模仿多分类的逻辑回归,用softmax来产生概率,以类别1为例:
p
1
(
x
i
)
=
exp
(
F
1
(
x
i
)
)
/
∑
l
=
1
3
exp
(
F
l
(
x
i
)
)
p_{1}\left(\mathbf{x}_{\mathbf{i}}\right)=\exp \left(F_{1}\left(\mathbf{x}_{\mathbf{i}}\right)\right) / \sum_{l=1}^{3} \exp \left(F_{l}\left(\mathbf{x}_{\mathbf{i}}\right)\right)
p1(xi)=exp(F1(xi))/∑l=13exp(Fl(xi))
2.对每个类别分别计算残差,如类别1: y ~ i 1 = 0 − p 1 ( x i ) \tilde{y}_{i 1}=0-p_{1}\left(\mathbf{x}_{\mathbf{i}}\right) y~i1=0−p1(xi),类别2: y ~ i 2 = 1 − p 2 ( x i ) \tilde{y}_{i 2}=1-p_{2}\left(\mathbf{x}_{\mathbf{i}}\right) y~i2=1−p2(xi),类别3: y ~ i 3 = 0 − p 3 ( x i ) \tilde{y}_{i 3}=0-p_{3}\left(\mathbf{x}_{\mathbf{i}}\right) y~i3=0−p3(xi)
3.开始第二轮的训练,针对第一类输入为 ( x i , y ~ i 1 ) \left(\mathbf{x}_{\mathbf{i}}, \tilde{y}_{i 1}\right) (xi,y~i1),针对第二类输入为 ( x i , y ~ i 2 ) \left(\mathbf{x}_{\mathbf{i}}, \tilde{y}_{i 2}\right) (xi,y~i2),针对第三类输入为 ( x i , y ~ i 3 ) \left(\mathbf{x}_{\mathbf{i}}, \tilde{y}_{i 3}\right) (xi,y~i3),继续训练出三棵树。
4.重复3直到迭代M轮,就得到了最后的模型。预测的时候只要找出概率最高的即为对应的类别
GBDT原理案例举例
import numpy as np
import matplotlib.pyplot as plt
#回归时分类的极限思想
#分类的类别多到一定程度,那么就是回归
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor
from sklearn import tree
# X数据:购物金额和上网时间
# y目标:14(高一),16(高三),24(大学毕业),26(工作两年)
X = np.array([[800,3],[1200,1],[1800,4],[2500,2]])
y = np.array([14,16,24,26])
gbdt = GradientBoostingRegressor(n_estimators=10)
gbdt.fit(X,y)
gbdt.predict(X)
#array([16.09207064, 17.39471376, 22.60528624, 23.90792936])
第一颗决策树,根据平均值,计算了残差[-6,-4,4,6]
plt.rcParams["font.sans-serif"] = ["Heiti TC"]
plt.figure(figsize=(9,6))
_ = tree.plot_tree(gbdt[0,0],filled=True,feature_names=["消费","上网"])
#计算friedman_mse
((y-y.mean())**2).mean()
#26.0
((y[:2]-y[:2].mean())**2).mean()
#1.0
value(-6,-4,6,4)是14,16,26,24和20的差,即残差
残差越小——>越好——>越准确
第二颗决策树,根据梯度提升,减少残差(残差越小,结果越好,越准确)
plt.rcParams["font.sans-serif"] = ["Heiti TC"]
plt.figure(figsize=(9,6))
_ = tree.plot_tree(gbdt[1,0],filled=True,feature_names=["消费","上网"])
gbdt1 = np.array([-6,-4,6,4])
#梯度提升
gbdt2 = gbdt1 - gbdt1*0.1 #learning_rate = 0.1
#array([-5.4, -3.6, 5.4, 3.6])
第三颗决策树
plt.rcParams["font.sans-serif"] = ["Heiti TC"]
plt.figure(figsize=(9,6))
_ = tree.plot_tree(gbdt[0,0],filled=True,feature_names=["消费","上网"])
gbdt1 = np.array([-5.4,-3.6,5.4,3.6])
#梯度提升
gbdt1 - gbdt1 *0.1 #learning_rate = 0.1
#array([-4.86, -3.24, 4.86, 3.24])
最后一棵树
plt.rcParams["font.sans-serif"] = ["Heiti TC"]
plt.figure(figsize=(9,6))
_ = tree.plot_tree(gbdt[-1,0],filled=True,feature_names=["消费","上网"])
#learning_rate = 0.1
gbdt = np.array([-2.325,-1.55,1.55,2.325])
#梯度提升 学习率0.1
residual = gbdt - gbdt*0.1
residual
#array([-2.0925, -1.395 , 1.395 , 2.0925])
y - residual
#array([16.0925, 17.395 , 22.605 , 23.9075])
gbdt.predict(X)
#array([16.09207064, 17.39471376, 22.60528624, 23.90792936])
根据最后一棵树的残差,计算了算法最终的预测值
直接使用算法predict返回的值和手算一模一样