梯度提升树原理分析与源码实现

1 GBDT

1.1 泰勒展开

泰勒公式是一个用函数在某点的信息描述其附近取值的公式。

一阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0) f(x)f(x0)+f(x0)(xx0)

二阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) ( x − x 0 ) 2 2 f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2} f(x)f(x0)+f(x0)(xx0)+f(x0)2(xx0)2

Δ x = x t − x t − 1 \Delta{x}=x^t-x^{t-1} Δx=xtxt1

那么 f ( x t ) f(x^t) f(xt) x t − 1 x^{t-1} xt1处位置的二阶泰勒展开为: f ( x t ) = f ( x t + Δ x ) = f ( x t − 1 ) + f ′ ( x t − 1 ) Δ x + f ′ ′ ( x t − 1 ) Δ x 2 2 f(x^t)=f(x_t+\Delta{x})=f(x^{t-1})+f^{'}(x^{t-1})\Delta{x}+f^{''}(x^{t-1})\frac{\Delta{x}^2}{2} f(xt)=f(xt+Δx)=f(xt1)+f(xt1)Δx+f(xt1)2Δx2

1.2 梯度下降法

在机器学习中通常需要最小化损失函数 L ( θ ) L(\theta) L(θ),其中 θ \theta θ是模型中的参数,通常使用梯度下降来迭代处理这种无约束的优化问题,因此参数的更新为:
θ t = θ t − 1 + Δ θ \theta^{t}=\theta^{t-1}+\Delta\theta θt=θt1+Δθ
我们可以将 L ( θ t ) L(\theta^{t}) L(θt) θ t − 1 \theta^{t-1} θt1处进行一阶泰勒展开:
L ( θ t ) ≈ L ( θ t − 1 ) + L ′ ( θ t − 1 ) Δ θ L(\theta^{t})\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta\theta L(θt)L(θt1)+L(θt1)Δθ
我们的目标是希望通过每次的更新迭代能够使 L ( θ t ) < L ( θ t − 1 ) L(\theta^{t})<L(\theta^{t-1}) L(θt)<L(θt1),这里 Δ θ = − α L ′ ( θ t − 1 ) \Delta\theta=-\alpha{L^{'}(\theta^{t-1})} Δθ=αL(θt1),因此参数的更新为:
θ t = θ t − 1 − α L ′ ( θ t − 1 ) \theta^{t}=\theta^{t-1}-\alpha{L^{'}(\theta^{t-1})} θt=θt1αL(θt1)

1.3 从梯度下降到梯度提升

当我们需要优化对象是参数空间时就是上面1.2中的梯度下降更新的形式,而当我们的优化对象变成函数空间时,函数在每次迭代的更新就变成如下形式,其中 f t ( x ) f^t(x) ft(x)为第t次迭代的函数增量:
f t ( x ) = f t − 1 ( x ) + f t ( x ) f^t(x)=f^{t-1}(x)+f_t(x) ft(x)=ft1(x)+ft(x)
这里函数增量同样是在拟合函数的负梯度,这里把一阶导记为 g g g
f t ( x ) = − α t g t ( x ) f_t(x)=-\alpha_tg_t(x) ft(x)=αtgt(x)
最终函数等于每次迭代的增量的累加和:
F ( x ) = ∑ t = 0 T f t ( x ) F(x)=\sum_{t=0}^{T}f_t(x) F(x)=t=0Tft(x)

1.4 Gradient Boosting Tree算法

GBDT定义的加法模型如下,其中 x x x为输入样本, w w w为回归树的参数, α \alpha α为每棵树的权重:
F ( x ; w ) = ∑ t = 0 T α t h t ( x ; w t ) = ∑ t = 0 T f t ( x ; w t ) F(x;w)=\sum_{t=0}^T\alpha_{t}h_t(x;w_t)=\sum_{t=0}^Tf_t(x;w_t) F(x;w)=t=0Tαtht(x;wt)=t=0Tft(x;wt)
从以上的更新公式可以看出每颗书就是在前一棵树的输出基础上拟合前一棵树的梯度方向。

通过最小化损失函数来求解最优模型:
F ∗ = a r g m i n F ∑ i = 0 N L ( y i , F ( x i ; w ) ) F^*=argmin_F\sum_{i=0}^NL(y_i,F(x_i;w)) F=argminFi=0NL(yi,F(xi;w))
算法整体实现如下:

1、初始化 f 0 f_0 f0

2、for t=1 to T do:

2.1 计算响应 y i ~ = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F t − 1 ( x ) , i = 1 , 2 , . . . , N \widetilde{y_i}=-[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{t-1}(x)},i=1,2,...,N yi =[F(xi)L(yi,F(xi))]F(x)=Ft1(x),i=1,2,...,N

2.2 学习第t棵树 w ∗ = a r g m i n w ∑ i = 1 N ( y i ~ − h t ( x i ; w ) ) 2 w^*=argmin_w\sum_{i=1}^N(\widetilde{y_i}-h_t(x_i;w))^2 w=argminwi=1N(yi ht(xi;w))2

2.3 line search找步长 ρ ∗ = a r g m i n ρ ∑ i = 1 N L ( y i , F t − 1 ( x i ) + ρ h t ( x i ; w ∗ ) ) \rho^*=argmin_{\rho}\sum_{i=1}^NL(y_i,F_{t-1}(x_i)+\rho h_t(x_i;w^*)) ρ=argminρi=1NL(yi,Ft1(xi)+ρht(xi;w))

2.4 更新模型 F t = F t − 1 + ρ ∗ h t ( x ; w ∗ ) F_t=F_{t-1}+\rho^*h_t(x;w^*) Ft=Ft1+ρht(x;w)

3 输出 F T F_T FT

其中第2步中共设计生成T棵树,2.1中就是在计算前一棵树输出的负梯度,可以使用MSE来作为损失函数,2.2中就是在学习当前的树,使当前的树可以尽可能的拟合上一棵树的负梯度。下面简单展开说明一下选择MSE作为损失函数时,每棵树拟合的负梯度到底是什么,其中损失函数的定义如下:
L ( y i , F t − 1 ( x i ) ) = 1 2 ( y i − F t − 1 ( x i ) ) 2 L(y_i,F_{t-1}(x_i))=\frac{1}{2}(y_i-F_{t-1}(x_i))^2 L(yi,Ft1(xi))=21(yiFt1(xi))2
损失函数对上一棵树求偏导得:
− [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F t − 1 ( x ) = y i − F t − 1 ( x i ) -[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{t-1}(x)}=y_i-F_{t-1}(x_i) [F(xi)L(yi,F(xi))]F(x)=Ft1(x)=yiFt1(xi)
所以如果以MSE作为损失函数的化,最终每棵树拟合的负梯度其实是标签与前一棵树输出的差值。

2 XGBoost

2.1 牛顿法

首先回顾一下泰勒展开二阶导的形式:

二阶泰勒展开: f ( x ) ≈ f ( x 0 ) + f ’ ( x 0 ) ( x − x 0 ) + f ′ ′ ( x 0 ) ( x − x 0 ) 2 2 f(x)\approx f(x_0)+f^{’}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2} f(x)f(x0)+f(x0)(xx0)+f(x0)2(xx0)2

Δ x = x t − x t − 1 \Delta{x}=x^t-x^{t-1} Δx=xtxt1

那么 f ( x t ) f(x^t) f(xt) x t − 1 x^{t-1} xt1处位置的二阶泰勒展开为: f ( x t ) ≈ f ( x t + Δ x ) = f ( x t − 1 ) + f ′ ( x t − 1 ) Δ x + f ′ ′ ( x t − 1 ) Δ x 2 2 f(x^t)\approx f(x_t+\Delta{x})=f(x^{t-1})+f^{'}(x^{t-1})\Delta{x}+f^{''}(x^{t-1})\frac{\Delta{x}^2}{2} f(xt)f(xt+Δx)=f(xt1)+f(xt1)Δx+f(xt1)2Δx2

因此,我们将 L ( θ t ) L(\theta^t) L(θt) θ t − 1 \theta^{t-1} θt1处进行二阶泰勒展开得到:
L ( θ t ) ≈ L ( θ t − 1 ) + L ′ ( θ t − 1 ) Δ θ + L ′ ′ ( θ t − 1 ) Δ θ 2 2 L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta{\theta}+L^{''}(\theta^{t-1})\frac{\Delta\theta^2}{2} L(θt)L(θt1)+L(θt1)Δθ+L(θt1)2Δθ2
为了之后推导的方便,通常将在 θ t − 1 \theta^{t-1} θt1的一阶导记为 g g g,将二阶导记为 h h h,所以:
L ( θ t ) ≈ L ( θ t − 1 ) + g Δ θ + h Δ θ 2 2 L(\theta^t)\approx L(\theta^{t-1})+g\Delta{\theta}+h\frac{\Delta\theta^2}{2} L(θt)L(θt1)+gΔθ+h2Δθ2
我们的目标是希望最小化 L ( θ t ) L(\theta^t) L(θt),由于 L ( θ t − 1 ) L(\theta^{t-1}) L(θt1)是历史固定结果,所以我们的目标转化为最小化 g Δ θ + h Δ θ 2 2 g\Delta{\theta}+h\frac{\Delta\theta^2}{2} gΔθ+h2Δθ2,即:
∂ ( g Δ θ + h Δ θ 2 2 ) ∂ Δ θ = 0 \frac{\partial(g\Delta{\theta}+h\frac{\Delta\theta^2}{2})}{\partial\Delta\theta}=0 Δθ(gΔθ+h2Δθ2)=0
计算偏导后我们可以得到:
Δ θ = − g h \Delta\theta=-\frac{g}{h} Δθ=hg
所以参数 θ t \theta^t θt的更新为:
θ t = θ t − 1 + Δ θ = θ t − 1 − g h \theta^{t}=\theta^{t-1}+\Delta\theta=\theta^{t-1}-\frac{g}{h} θt=θt1+Δθ=θt1hg

2.2 从牛顿法到牛顿提升

与梯度提升一样,在函数空间中,牛顿提升也是累加模型:
f t ( x ) = f t − 1 ( x ) + f t ( x ) f^t(x)=f^{t-1}(x)+f_t(x) ft(x)=ft1(x)+ft(x)
与梯度提升不同的是这里的函数增量有所不同:
f t ( x ) = − g t ( x ) h t ( x ) f_t(x)=-\frac{g_t(x)}{h_t(x)} ft(x)=ht(x)gt(x)
最终函数等于每次迭代的增量的累加和:
F ( x ) = ∑ t = 0 T f t ( x ) F(x)=\sum_{t=0}^{T}f_t(x) F(x)=t=0Tft(x)

2.3 算法原理

XGBoost的目标函数中不光有损失部分还加入了正则化项,来防止训练的过拟合,即在训练中进行剪枝,公式如下,其中 i i i表示样本下标, k k k表示每棵树叶子节点:
L ( θ ) = ∑ i l ( y i ^ , y i ) + ∑ k Ω ( f k ) L(\theta)=\sum_il(\hat{y_i}, y_i)+\sum_k\Omega(f_k) L(θ)=il(yi^,yi)+kΩ(fk)
其中正则化项展开为,其中 T T T表示叶子节点的个数, w w w表示各叶子节点的预测取值:
Ω ( f ) = γ T + 1 2 λ ∣ ∣ w ∣ ∣ 2 \Omega(f)=\gamma{T}+\frac{1}{2}\lambda{||w||^2} Ω(f)=γT+21λw2
在t次迭代后,模型预测值为:
y i ^ = F t − 1 ( x i ) + f t ( x i ) \hat{y_i}=F_{t-1}(x_i)+f_t(x_i) yi^=Ft1(xi)+ft(xi)
此时目标函数为,其中 F t − 1 ( x i ) F_{t-1}(x_i) Ft1(xi)为前t-1次迭代累加的回归树对于输入样本 x i x_i xi得到的预测结果:
L t ( θ ) = ∑ i = 1 N l ( y i , F t − 1 ( x i ) + f t ( x i ) ) + Ω ( f t ) L^t(\theta)=\sum_{i=1}^{N}l(y_i, F_{t-1}(x_i)+f_t(x_i))+\Omega(f_t) Lt(θ)=i=1Nl(yi,Ft1(xi)+ft(xi))+Ω(ft)
我们对上式在 F t − 1 ( x i ) F_{t-1}(x_i) Ft1(xi)处进行二阶泰勒展开,其中 g i = ∂ l ( y i , F t − 1 ( x i ) ) ∂ F t − 1 ( x i ) g_i=\frac{\partial{l(y_i,F_{t-1}(x_i))}}{\partial{F_{t-1}(x_i)}} gi=Ft1(xi)l(yi,Ft1(xi)) h i = ∂ 2 l ( y i , F t − 1 ( x i ) ) ∂ F t − 1 ( x i ) h_i=\frac{\partial^2{l(y_i,F_{t-1}(x_i))}}{\partial{F_{t-1}(x_i)}} hi=Ft1(xi)2l(yi,Ft1(xi))
L t ( θ ) ≈ ∑ i = 1 N [ l ( y i , F t − 1 ( x i ) ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) L^t(\theta)\approx\sum_{i=1}^{N}[l(y_i, F_{t-1}(x_i))+g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t) Lt(θ)i=1N[l(yi,Ft1(xi))+gift(xi)+21hift2(xi)]+Ω(ft)
这里由于 l ( y i , F t − 1 ( x i ) ) l(y_i, F_{t-1}(x_i)) l(yi,Ft1(xi))是历史固定值,为常数项,所以我们去除常数项得:
L ~ t ( θ ) = ∑ i = 1 N [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) \widetilde{L}^t(\theta)=\sum_{i=1}^{N}[g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t) L t(θ)=i=1N[gift(xi)+21hift2(xi)]+Ω(ft)
接下来我们将粒度放到树节点上,以树节点的形式重新整理上述公式,我们设 f ( x ) = w q ( x ) f(x)=w_{q(x)} f(x)=wq(x),表示输入样本 x x x落入节点 q ( x ) q(x) q(x)中对应的预测值,其中 j j j表示树的第 j j j个叶子节点
L ~ t ( θ ) = ∑ i = 1 N [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) = ∑ i = 1 N [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + λ 1 2 ∑ j = 1 T w j 2 \widetilde{L}^t(\theta)=\sum_{i=1}^{N}[g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t) \\ =\sum_{i=1}^{N}[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]+\gamma T+\lambda\frac{1}{2}\sum_{j=1}^{T}{w^2_j} L t(θ)=i=1N[gift(xi)+21hift2(xi)]+Ω(ft)=i=1N[giwq(xi)+21hiwq(xi)2]+γT+λ21j=1Twj2
我们设每个叶子节点上的样本集合为 I j = { i ∣ q ( x i ) = j } I_j=\{i|q(x_i)=j\} Ij={iq(xi)=j},进一步对上面公式进行整理得:
L ~ t ( θ ) = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T = ∑ j = 1 T [ G j w j + 1 2 ( H j + λ ) w j 2 ] + γ T \widetilde{L}^t(\theta)=\sum_{j=1}^{T}[(\sum_{i\in{I_j}}g_i)w_j+\frac{1}{2}(\sum_{i\in{I_j}}h_i+\lambda)w^2_j]+\gamma T\\ =\sum_{j=1}^{T}[G_jw_j+\frac{1}{2}(H_j+\lambda)w^2_j]+\gamma T L t(θ)=j=1T[(iIjgi)wj+21(iIjhi+λ)wj2]+γT=j=1T[Gjwj+21(Hj+λ)wj2]+γT
我们的目标是使目标函数最小,因此可以对叶节点 j j j中的预测值求偏导并令其为0,得到最佳的节点预测值为:
w ∗ = − G j H j + λ w^*=-\frac{G_j}{H_j+\lambda} w=Hj+λGj
我们将上述的最佳预测值带入目标函数,则得到了最小损失:
L ~ t ∗ = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T \widetilde{L}^{t*}=-\frac{1}{2}\sum_{j=1}^{T}\frac{G_j^2}{H_j+\lambda}+\gamma T L t=21j=1THj+λGj2+γT
接下来我们可以利用上面对于节点的最小损失的计算作为寻找回归树最佳切分点的标准。根据上面公式可以看出 G j 2 H j + λ \frac{G_j^2}{H_j+\lambda} Hj+λGj2反应出节点对于损失的贡献度,我们希望损失越小越好,相应的就是贡献度部分的值越大越好,因此我们可以按照下式定义我们的增益:
G a i n = G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ − γ Gain=\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}-\gamma Gain=HL+λGL2+HR+λGR2HL+HR+λ(GL+GR)2γ
遍历特征的可能取值,并将样本集合根据特征取值分为Left集合和Right集合,依照上述公式计算每个特征取值下的增益,若增益越大,则损失在分裂后减小的越多。

3 GBDT代码实现

3.1 CART树实现

# encoding=utf8
import numpy as np
import matplotlib.pyplot as plt
from random import randrange
from FileReader import loadDataForCART
from FileWriter import writeByJson
from collections import defaultdict, Counter
from Metric import *

DISCRETE_FLAG = 0
CONTINUOUS_FLAG = 1


class Node:
    def __init__(self, fea=None, val=None, res=None, flag=None, left=None, right=None):
        self.fea = fea
        self.val = val
        self.res = res
        self.flag = flag
        self.left = left
        self.right = right


class CART:
    def __init__(self, epsilon=1e-3, minSample=1, maxDepth=20, task='classify', **kwargs):
        self.modelName = 'CART'
        self.epsilon = epsilon
        self.minSample = minSample
        self.tree = None
        self.maxDepth = maxDepth
        self.task = task
        # 这里需要根据时分类任务还是回归任务来选择不纯度的计算方法
        if task == 'classify':
            self.impurityFunc = self.getGini
            self.resCalcFunc = self.calcNodeClassifyResult
            self.getNoMergeErrorFunc = self.classifyByNode
            self.getMergeErrorFunc = self.classifyByMajority
        elif task == 'regression':
            self.impurityFunc = self.getMSE
            self.resCalcFunc = self.calcNodeRegressionResult
            self.getNoMergeErrorFunc = self.regressByNode
            self.getMergeErrorFunc = self.regressByMajority

    # 判断数据集中某一维特征是连续的还是离散的
    def __discreteOrContinuous(self, xData, feaUnique, beta=0.2):
        # 如果是连续值则返回1,若为离散值则返回0
        if np.shape(feaUnique)[0] > beta * (np.shape(xData)[0]):
            return CONTINUOUS_FLAG
        else:
            return DISCRETE_FLAG

    # 连续数据获得二分切分点
    def __getSplits(self, feaArray):
        feaArray = np.sort(feaArray)
        array1 = feaArray[:-1]
        array2 = feaArray[1:]
        return (array1 + array2) / 2.0

    def getGini(self, yData):
        c = Counter(yData)
        return 1 - sum([(val / float(yData.shape[0])) ** 2 for val in c.values()])

    def getMSE(self, yData):
        c = len(yData)
        ySquaredSum = np.sum(np.power(yData, 2), axis=0)
        ySum = np.sum(yData, axis=0)
        return (ySquaredSum / c) - (ySum / c) ** 2

    def __getFeaGini(self, set1, set2):
        num = set1.shape[0] + set2.shape[0]
        return float(set1.shape[0]) / num * self.getGini(set1) + float(set2.shape[0]) / num * self.getGini(set2)

    def __getFeaImpurity(self, set1, set2):
        num = set1.shape[0] + set2.shape[0]
        return float(set1.shape[0]) / num * self.impurityFunc(set1) + \
               float(set2.shape[0]) / num * self.impurityFunc(set2)

    def calcNodeClassifyResult(self, yData):
        return Counter(yData).most_common(1)[0][0]

    def calcNodeRegressionResult(self, yData):
        return np.mean(yData)

    def bestSplit(self, splitSets, xData, yData):
        preImpurity = self.impurityFunc(yData)
        # subdataInds每个切分点包含的样本索引(特征维度索引, 切分值, flag):[样本索引]
        subdataInds = defaultdict(list)
        # 找出每一个满足切分点的子集索引,加入字典当中
        for split in splitSets:
            for index, sample in enumerate(xData):
                if split[2] == 1:
                    if sample[split[0]] <= split[1]:
                        subdataInds[split].append(index)
                elif split[2] == 0:
                    if sample[split[0]] == split[1]:
                        subdataInds[split].append(index)

        minPurity = preImpurity
        bestSplit = None
        bestSet = None
        flag = None

        for split, index in subdataInds.items():
            set1 = yData[index]
            set2Ind = list(set(range(yData.shape[0])) - set(index))
            set2 = yData[set2Ind]
            if set1.shape[0] < 1 or set2.shape[0] < 1:
                continue
            newImpurity = self.__getFeaImpurity(set1, set2)
            if newImpurity < minPurity:
                minPurity = newImpurity
                bestSplit = split
                bestSet = (index, set2Ind)
                flag = split[2]
        if abs(minPurity - preImpurity) < self.epsilon:
            bestSplit = None
        # else:
        # TODO 计算importance,需要计算当前总不纯度,impurity_all - impurity_new
        # importance[bestSplit[0]] = preGini - minGini
        return bestSplit, bestSet, minPurity, flag

    def build(self, splitSets, xData, yData, currentDepth):
        # 集合内若样本数少于阈值则返回最多的样本类别
        if (yData.shape[0] < self.minSample) or (currentDepth == self.maxDepth):
            return Node(res=self.resCalcFunc(yData))
        # 计算当前集合中最佳的切分点
        bestSplit, bestSet, minImpurity, flag = self.bestSplit(splitSets, xData, yData)

        if bestSplit is None:
            return Node(res=self.resCalcFunc(yData))
        else:
            splitSets.remove(bestSplit)
            left = self.build(splitSets, xData=xData[bestSet[0]], yData=yData[bestSet[0]],
                              currentDepth=currentDepth + 1)
            right = self.build(splitSets, xData=xData[bestSet[1]], yData=yData[bestSet[1]],
                               currentDepth=currentDepth + 1)
            return Node(fea=bestSplit[0], val=bestSplit[1], flag=flag, right=right, left=left)

    # 前续遍历训练好的模型
    def preorderTraversal(self, currentNode, dictWrite, depth, verbose=False):
        if currentNode is None:
            return
        if verbose:
            print("---" * depth, "[depth:{} feature:{}, val:{}, res:{}]".
                  format(depth, currentNode.fea, currentNode.val, currentNode.res))

        dictWrite['fea'] = currentNode.fea
        dictWrite['val'] = currentNode.val
        dictWrite['res'] = currentNode.res
        dictWrite['flag'] = currentNode.flag

        if currentNode.left is not None:
            dictWrite['left'] = {}
            self.preorderTraversal(currentNode.left, dictWrite['left'], depth + 1)
        else:
            dictWrite['left'] = None

        if currentNode.right is not None:
            dictWrite['right'] = {}
            self.preorderTraversal(currentNode.right, dictWrite['right'], depth + 1)
        else:
            dictWrite['right'] = None

    def fit(self, xData, yData, randomFeature=False, nFeatures=None, mustReservedInd=None):
        """
        Fit the input data and train the model

        Parameters
        ----------
        xData : numpy.ndarray
            Feature set of input samples
        yData : numpy.ndarray
            Label set of input samples
        randomFeature : bool, optional
            Whether the random jitter of the feature is adopted
        nFeatures : int, optional
            Number of randomly sampled features, which used while randomFeature is True. If nFeatures is None,
            the value of the nFeatures will be calculated according to a specific formula.
        mustReservedInd : int, optional
            The feature index must be added to the training set, which used while randomFeature is True.
        """
        if not isinstance(xData, np.ndarray):
            print("xData must be type of numpy.ndarray")
            return
        if not isinstance(yData, np.ndarray):
            print("yData must be type of numpy.ndarray")
            return

        splitsSet = []
        features = set()
        # 如果采用随机特征抖动(随机森林或xgboost)
        if randomFeature:
            if nFeatures is None:
                nFeatures = np.round(np.sqrt(np.shape(xData)[1]))
            while len(features) < nFeatures:
                ind = randrange(1, np.shape(xData)[1])
                if ind not in features:
                    features.add(ind)
            # 因为第一个参数表示当前链路是哪一条链路,所以默认必须加入
            if mustReservedInd is not None:
                if isinstance(mustReservedInd, int) is False:
                    print("mustReservedInd must be of type int or None")
                    return
                features.add(mustReservedInd)
        else:
            features = set(range(np.shape(xData)[1]))

        for feaInd in features:
            # 将同一维特征的所有的取值提取出来
            feaUnique = np.unique(xData[:, feaInd])
            # 自适应判断当前特征是否为连续类型
            flag = self.__discreteOrContinuous(xData, feaUnique)
            if feaUnique.shape[0] < 2:
                continue
            elif feaUnique.shape[0] == 2:
                if flag == CONTINUOUS_FLAG:
                    midVal = (feaUnique[0] + feaUnique[1]) / 2.0
                    splitsSet.append((feaInd, midVal, flag))
                elif flag == DISCRETE_FLAG:
                    splitsSet.append((feaInd, feaUnique[0], flag))
            else:
                if flag == 1:
                    feaUnique = self.__getSplits(feaUnique)
                for val in feaUnique:
                    splitsSet.append((feaInd, val, flag))
        self.tree = self.build(splitsSet, xData, yData, 0)

        # 进行后剪枝操作
        dataSet = np.hstack((xData, yData.reshape(-1, 1)))
        self.prune(dataSet, self.tree)

        return self

    def findRes(self, data, tree):
        if tree.res is not None:
            return tree.res
        else:
            if tree.flag == CONTINUOUS_FLAG:
                if data[tree.fea] <= tree.val:
                    return self.findRes(data, tree.left)
                elif data[tree.fea] > tree.val:
                    return self.findRes(data, tree.right)
            elif tree.flag == DISCRETE_FLAG:
                if data[tree.fea] == tree.val:
                    return self.findRes(data, tree.left)
                else:
                    return self.findRes(data, tree.right)

    def predict(self, xData):
        return self.findRes(xData, self.tree)

    # ======================================后剪枝======================================
    def __binSplit(self, dataSet, node):
        fea = node.fea
        val = node.val
        flag = node.flag
        if flag == DISCRETE_FLAG:
            leftSet = dataSet[np.nonzero(dataSet[:, fea] == val)[0], :]
            rightSet = dataSet[np.nonzero(dataSet[:, fea] != val)[0], :]
        elif flag == CONTINUOUS_FLAG:
            leftSet = dataSet[np.nonzero(dataSet[:, fea] <= val)[0], :]
            rightSet = dataSet[np.nonzero(dataSet[:, fea] > val)[0], :]
        return leftSet, rightSet

    def classifyByNode(self, dataSet, node):
        errorCount = 0
        for data in dataSet:
            res = self.findRes(data, node)
            if res != data[-1]:
                errorCount += 1
        return errorCount

    def regressByNode(self, dataSet, node):
        sumError = 0
        dataSize = dataSet.shape[0]
        for data in dataSet:
            res = self.findRes(data, node)
            sumError += np.abs(res - data[-1])
        return sumError / dataSize

    def classifyByMajority(self, dataSet):
        errorCount = 0
        majLabel = Counter(dataSet[:, -1]).most_common(1)[0][0]
        for data in dataSet:
            if majLabel != data[-1]:
                errorCount += 1
        return errorCount, majLabel

    def regressByMajority(self, dataSet):
        sumError = 0
        dataSize = dataSet.shape[0]
        meanValue = np.mean(dataSet[:, -1])
        for data in dataSet:
            sumError += np.abs(data[-1] - meanValue)
        return sumError / dataSize, meanValue

    # testSet内的每一个样本的最后一维都是他们的分类
    def prune(self, testSet, node, verbose=False):
        if verbose:
            print("pruning now=====>")
        if len(testSet) == 0:
            return
        lSet = None
        rSet = None

        if (node.left and node.left.res is None) or (node.right and node.right.res is None):
            lSet, rSet = self.__binSplit(testSet, node)
        if node.left and node.left.res is None:
            self.prune(lSet, node.left)
        if node.right and node.right.res is None:
            self.prune(rSet, node.right)
        if (node.left and node.left.res is not None) and (node.right and node.right.res is not None):
            errorNoMerge = self.getNoMergeErrorFunc(testSet, node)
            errorMerge, label = self.getMergeErrorFunc(testSet)
            if errorMerge <= errorNoMerge:
                node.res = label
                node.left = None
                node.right = None
                node.fea = None
                node.val = None
                node.flag = None

3.2 梯度提升树实现

import numpy as np
from CART import CART

class GradientBoostTree:
    def __init__(self, nTrees=30, epsilon=1e-3, alpha=0.09, minSample=1, maxDepth=20, earlyStopping=False, theta=0.5, **kwargs):
        """
        Parameters
        ----------
        nTrees: int
            the number of trees for building GBDT
        epsilon: float
            the minimum acceptable change of impurity
        alpha: float
            the learning rate for each gradient boosting
        minSample: int
            the minimum number of samples in tree-node for building child nodes
        maxDepth: int
            the maximum depth for each tree
        earlyStopping: bool
            if earlyStopping is True, GBDT building procession may stop early because the average loss of predictions
            less than threshold
        theta: float
            the threshold for stopping training early, which available if earlyStopping is True
        """
        self.nTrees = nTrees
        self.epsilon = epsilon
        self.alpha = alpha
        self.minSample = minSample
        self.maxDepth = maxDepth
        self.earlyStopping = earlyStopping
        self.theta = theta
        self.trees = []

        self.xData = None
        self.yData = None
        self.mustReservedInd = None

    def fit(self, xData, yData, mustReservedInd=None):
        self.xData = xData
        self.yData = yData
        self.mustReservedInd = mustReservedInd

        preds = np.zeros(yData.shape)
        # 遍历T棵树
        for t in range(self.nTrees):
            # 计算所有样本的标签与f0的误差即计算yData-f0
            residuals = yData - preds
            if self.earlyStopping and residuals.mean() < self.theta:
                print("current training average residual is {}, theta is {}, tree number is {}, stopping now"
                      .format(residuals.mean(), self.theta, len(self.trees)))
                return

            # 拟合residual
            tree = CART(self.epsilon, self.minSample, self.maxDepth, 'regression')
            tree.fit(xData, residuals, mustReservedInd=self.mustReservedInd)
            self.trees.append(tree)
            # 遍历所有样本计算对于残差的预测并累加到预测列表中
            for idx, data in enumerate(self.xData):
                preds[idx] += self.alpha * tree.predict(data)

    def predict(self, xData):
        if len(self.trees) <= 0:
            print("you must fit data set firstly!")
            return None
        pred = 0
        for tree in self.trees:
            pred += self.alpha * tree.predict(xData)
        return pred


  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值