GBDT梯度提升树

前向分步算法框架

前向分步算法是在Adaboost基础上提出的算法框架。

  • 研究对象

    • 加法集成模型 f ( x ) = ∑ m = 1 M β m b ( x ; γ m ) f(x)=\sum_{m=1}^{M} \beta_{m} b\left(x ; \gamma_{m}\right) f(x)=m=1Mβmb(x;γm)
      其中,
      b ( x ; γ m ) b\left(x ; \gamma_{m}\right) b(x;γm)为即基本分类器,
      γ m \gamma_{m} γm为基本分类器的参数,
      β m \beta_m βm为基本分类器的权重
    • 损失函数:在给定训练数据以及损失函数 L ( y , f ( x ) ) L(y, f(x)) L(y,f(x))的条件下,学习 f ( x ) f(x) f(x)就是:
      min ⁡ β m , γ m ∑ i = 1 N L ( y i , ∑ m = 1 M β m b ( x i ; γ m ) ) \min _{\beta_{m}, \gamma_{m}} \sum_{i=1}^{N} L\left(y_{i}, \sum_{m=1}^{M} \beta_{m} b\left(x_{i} ; \gamma_{m}\right)\right) βm,γmmini=1NL(yi,m=1Mβmb(xi;γm))
  • 基本思路:将同时求解从m=1到M的所有参数 β m \beta_{m} βm γ m \gamma_{m} γm的优化问题简化为逐次求解各个 β m \beta_{m} βm γ m \gamma_{m} γm的问题。(并不一定是全局最优)

  • 步骤
    给定数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),,(xN,yN)} x i ∈ X ⊆ R n x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{n} xiXRn y i ∈ Y = { + 1 , − 1 } y_{i} \in \mathcal{Y}=\{+1,-1\} yiY={+1,1}。损失函数 L ( y , f ( x ) ) L(y, f(x)) L(y,f(x)),基函数集合 { b ( x ; γ ) } \{b(x ; \gamma)\} {b(x;γ)},输出加法模型 f ( x ) f(x) f(x)
    Step1 初始化 f 0 ( x ) = 0 f_{0}(x)=0 f0(x)=0
    Step2 迭代极小化损失函数
    对m = 1,2,…,M:
    ( β m , γ m ) = arg ⁡ min ⁡ β , γ ∑ i = 1 N L ( y i , f m − 1 ( x i ) + β b ( x i ; γ ) ) \left(\beta_{m}, \gamma_{m}\right)=\arg \min _{\beta, \gamma} \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+\beta b\left(x_{i} ; \gamma\right)\right) (βm,γm)=argβ,γmini=1NL(yi,fm1(xi)+βb(xi;γ))
    f m ( x ) = f m − 1 ( x ) + β m b ( x ; γ m ) f_{m}(x)=f_{m-1}(x)+\beta_{m} b\left(x ; \gamma_{m}\right) fm(x)=fm1(x)+βmb(x;γm)
    Step3 最终加法模型
    f ( x ) = f M ( x ) = ∑ m = 1 M β m b ( x ; γ m ) f(x)=f_{M}(x)=\sum_{m=1}^{M} \beta_{m} b\left(x ; \gamma_{m}\right) f(x)=fM(x)=m=1Mβmb(x;γm)

梯度提升树

梯度提升树(Gradient Boosting Decision Tree,GBDT)是在前向分布算法框架下,以决策树为基学习器,利用梯度下降法的集成模型。

回归问题

基本概念

设优化问题为: w i = arg ⁡ min ⁡ w L ( y i , F m − 1 ( X i ) + w ) w_i=\arg\min_{w}L(y_i,F_{m-1}(X_i)+w) wi=argwminL(yi,Fm1(Xi)+w)
对于可微损失函数, w i = 0 w_i=0 wi=0处进行剃度下降即可得到损失更小的 w i ∗ w_i^* wi
r i = y i − F m − 1 ( X i ) r_i = y_i-F_{m-1}(X_i) ri=yiFm1(Xi)
损失函数 L ( y , y ^ ) L(y,\hat y) L(y,y^)

  • L ( y , y ^ ) = ∣ y − y ^ ∣ L(y,\hat y) = \sqrt {|y-\hat y|} L(y,y^)=yy^ 时:
    w i ∗ = 0 − ∂ L ∂ w ∣ w = 0 = − ∂ ∣ r i − w i ∣ ∂ w ∣ w = 0 = 1 2 ∣ r i ∣ s i g n ( r i ) \begin{aligned} w^*_i &= 0-\left.\frac{\partial L}{\partial w} \right|_{w=0}\\ &= -\left.\frac{\partial \sqrt {|r_i-w_i|}}{\partial w} \right|_{w=0}\\ &= \frac{1}{2\sqrt{\vert r_i\vert}}sign(r_i) \end{aligned} wi=0wLw=0=wriwi w=0=2ri 1sign(ri)
    注:当 r i = 0 r_i=0 ri=0时,上轮结果与真实值无差异,令 w ∗ = 0 w^*=0 w=0即可。
  • L ( y , y ^ ) = ( y − y ^ ) 2 L(y,\hat y) = (y-\hat y)^2 L(y,y^)=(yy^)2时:
    w i ∗ = 0 − ∂ L ∂ w ∣ w = 0 = − ∂ ( r i − w i ) 2 ∂ w ∣ w = 0 = 2 r i \begin{aligned} w^*_i &= 0-\left.\frac{\partial L}{\partial w} \right|_{w=0}\\ &= -\left.\frac{\partial (r_i-w_i)^2}{\partial w} \right|_{w=0}\\ &= 2r_i \end{aligned} wi=0wLw=0=w(riwi)2w=0=2ri
    为了缓解模型的过拟合现象,我们需要引入学习率参数 η \eta η来控制每轮的学习速度,即获得了由 w ∗ \textbf{w}^* w拟合的第m棵树 h ∗ h^* h后,当前轮的输出结果为: y ^ i = F m − 1 ( X i ) + η h m ∗ ( X i ) \hat{y}_i=F_{m-1}(X_i)+\eta h^*_m(X_i) y^i=Fm1(Xi)+ηhm(Xi)

当损失函数为绝对值时, F 0 F_0 F0为y中位数;当损失函数为均方值时, F 0 F_0 F0为y均值。

代码

from sklearn.tree import DecisionTreeRegressor as DT
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np

class GBDTRegressor:

    def __init__(self, max_depth=4, n_estimator=100, lr=0.1):
        self.max_depth = max_depth
        self.n_estimator = n_estimator
        self.lr = lr
        self.booster = []
        self.best_round = None

    def record_score(self, y_train, y_val, train_predict, val_predict, i):
        mse_val = mean_squared_error(y_val, val_predict)
        if (i+1)%10==0:
            mse_train = mean_squared_error(y_train, train_predict)
            print("第%d轮\t训练集: %.4f\t"
                "验证集: %.4f"%(i+1, mse_train, mse_val))
        return mse_val

    def fit(self, X, y):
        # 在数据集中划分训练集和验证集
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=0.25, random_state=0)
        train_predict, val_predict = 0, 0
        #损失函数为均方损失
        next_fit_val = np.full(X_train.shape[0], np.mean(y_train))
        # 为early_stop做记录准备
        last_val_score = np.infty
        for i in range(self.n_estimator):
            cur_booster = DT(max_depth=self.max_depth)
            cur_booster.fit(X_train, next_fit_val)
            train_predict += cur_booster.predict(X_train) * self.lr
            val_predict += cur_booster.predict(X_val) * self.lr
            next_fit_val = y_train - train_predict
            self.booster.append(cur_booster)
            cur_val_score = self.record_score(
                y_train, y_val, train_predict, val_predict, i)
            if cur_val_score > last_val_score:
                self.best_round = i
                print("\n训练结束!最佳轮数为%d"%(i+1))#为防止过拟合,所以选择验证集最佳
                break
            last_val_score = cur_val_score

    def predict(self, X):
        cur_predict = 0
        # 在最佳验证集得分的轮数停止,防止过拟合
        for i in range(self.best_round):
            cur_predict += self.lr * self.booster[i].predict(X)
        return cur_predict

分类问题

基本概念

GBDT用回归树处理分类问题。

  • 利用Softmax处理类别,得到样本属于类别的概率 P = e F k i ∑ c = 1 K e F c i P = \frac{e^{F_{ki}}}{\sum_{c=1}^Ke^{F_{ci}}} P=c=1KeFcieFki
  • 对类别进行one-hot编码
  • 损失函数用交叉熵表示为 L ( y i , F i ) = − ∑ c = 1 K y c i log ⁡ e F c i ∑ c ~ = 1 K e F c ~ i L(\textbf{y}_i,\textbf{F}_i)=-\sum_{c=1}^K y_{ci}\log \frac{e^{F_{ci}}}{\sum_{\tilde{c}=1}^Ke^{F_{\tilde{c}i}}} L(yi,Fi)=c=1Kycilogc~=1KeFc~ieFci
  • 学习目标为 h i ∗ ( m ) = F i ∗ ( m ) − F i ( m − 1 ) = − ∂ L ∂ F i ∣ F i = F i ( m − 1 ) = [ y 1 i − e F 1 i ( m − 1 ) ∑ c = 1 K e F c i ( m − 1 ) , . . . , y K i − e F K i ( m − 1 ) ∑ c = 1 K e F c i ( m − 1 ) ] \begin{aligned} \textbf{h}_i^{*(m)} &= \textbf{F}_i^{*(m)} - \textbf{F}_i^{(m-1)}\\ &= - \left.\frac{\partial L}{\partial \textbf{F}_i} \right|_{\textbf{F}_i=\textbf{F}_i^{(m-1)}} \\ &= [y_{1i} - \frac{e^{F^{(m-1)}_{1i}}}{\sum_{c=1}^K e^{F^{(m-1)}_{ci}}},...,y_{Ki} - \frac{e^{F^{(m-1)}_{Ki}}}{\sum_{c=1}^K e^{F^{(m-1)}_{ci}}}] \end{aligned} hi(m)=Fi(m)Fi(m1)=FiLFi=Fi(m1)=[y1ic=1KeFci(m1)eF1i(m1),...,yKic=1KeFci(m1)eFKi(m1)]
  • 引入学习率后 F i ∗ ( m ) = F i ( m − 1 ) + η h i ∗ ( m ) \textbf{F}^{*(m)}_i=\textbf{F}_i^{(m-1)}+\eta \textbf{h}_i^{*(m)} Fi(m)=Fi(m1)+ηhi(m)

算法简化

因为K分类每种分类概率和为1,利用此性质,所以可以将K次拟合减少为K-1次,这在处理类别数较少的分类问题时(特别二分类问题)非常有用。

  • K ≥ 3 K\geq3 K3

    • 损失函数 L ( F 1 i , . . . , F ( K − 1 ) i ) = y K i log ⁡ [ 1 + ∑ c = 1 K − 1 e F c i ] − ∑ c = 1 K − 1 y c i log ⁡ e F c i 1 + ∑ c = 1 K − 1 e F c i L(F_{1i},...,F_{(K-1)i})= y_{Ki}\log [1+\sum_{c=1}^{K-1}e^{F_{ci}}] -\sum_{c=1}^{K-1} y_{ci}\log \frac{e^{F_{ci}}}{1+\sum_{c=1}^{K-1}e^{F_{ci}}} L(F1i,...,F(K1)i)=yKilog[1+c=1K1eFci]c=1K1ycilog1+c=1K1eFcieFci
    • 负梯度 − ∂ L ∂ F k i ∣ F i = F i ( m − 1 ) = { − e F k i ( m − 1 ) 1 + ∑ c = 1 K − 1 e F c i ( m − 1 ) y K i = 1 y k i − e F k i ( m − 1 ) 1 + ∑ c = 1 K − 1 e F c i ( m − 1 ) y K i = 0 -\left.\frac{\partial L}{\partial F_{ki}} \right|_{\textbf{F}_i=\textbf{F}_i^{(m-1)}} = \left\{ \begin{aligned} -\frac{e^{F^{(m-1)}_{ki}}}{1+\sum_{c=1}^{K-1} e^{F^{(m-1)}_{ci}}} &\qquad y_{Ki}=1 \\ y_{ki} - \frac{e^{F^{(m-1)}_{ki}}}{1+\sum_{c=1}^{K-1} e^{F^{(m-1)}_{ci}}} & \qquad y_{Ki}=0 \\ \end{aligned} \right. FkiLFi=Fi(m1)=1+c=1K1eFci(m1)eFki(m1)yki1+c=1K1eFci(m1)eFki(m1)yKi=1yKi=0
    • 初始值 [ e F 1 i ( 0 ) 1 + ∑ c = 1 K − 1 e F c i ( 0 ) , . . . , e F ( K − 1 ) i ( 0 ) 1 + ∑ c = 1 K − 1 e F c i ( 0 ) , 1 1 + ∑ c = 1 K − 1 e F c i ( 0 ) ] = [ p 1 , . . . , p K − 1 , p K ] [\frac{e^{F^{(0)}_{1i}}}{1+\sum_{c=1}^{K-1}e^{F^{(0)}_{ci}}},...,\frac{e^{F^{(0)}_{(K-1)i}}}{1+\sum_{c=1}^{K-1}e^{F^{(0)}_{ci}}},\frac{1}{1+\sum_{c=1}^{K-1}e^{F^{(0)}_{ci}}}] = [p_1,...,p_{K-1},p_K] [1+c=1K1eFci(0)eF1i(0),...,1+c=1K1eFci(0)eF(K1)i(0),1+c=1K1eFci(0)1]=[p1,...,pK1,pK]
  • K = 2 K=2 K=2

    • 损失函数 L ( F i ) = − y i log ⁡ e F i 1 + e F i − ( 1 − y i ) log ⁡ 1 1 + e F i L(F_i) = - y_i\log \frac{e^{F_i}}{1+e^{F_i}} - (1-y_i)\log \frac{1}{1+e^{F_i}} L(Fi)=yilog1+eFieFi(1yi)log1+eFi1
    • 负梯度 − ∂ L ∂ F i ∣ F i = F i ( m − 1 ) = y i − e F i 1 + e F i -\left.\frac{\partial L}{\partial F_{i}} \right|_{F_i=F^{(m-1)}_i}=y_i-\frac{e^{F_i}}{1+e^{F_i}} FiLFi=Fi(m1)=yi1+eFieFi
    • 初始值 [ 1 1 + e F i ( 0 ) , e F i ( 0 ) 1 + e F i ( 0 ) ] = [ p 0 , p 1 ] [ \frac{1}{1+e^{F^{(0)}_i}},\frac{e^{F^{(0)}_i}}{1+e^{F^{(0)}_i}}]=[p_0,p_1] [1+eFi(0)1,1+eFi(0)eFi(0)]=[p0,p1]
      例:设二分类数据集中正样本比例为10%(即 p 1 = 1 10 p_1=\frac{1}{10} p1=101),则 e F ( 0 ) 1 + e F ( 0 ) = 1 10 \frac{e^{F^{(0)}}}{1+e^{F^{(0)}}}=\frac{1}{10} 1+eF(0)eF(0)=101 F ( 0 ) = − ln ⁡ 9 F^{(0)}=-\ln9 F(0)=ln9

代码

from sklearn.tree import DecisionTreeRegressor as DT
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import numpy as np

class GBDTClassifier:

    def __init__(self, max_depth=4, n_estimator=100, lr=0.1):
        self.max_depth = max_depth
        self.n_estimator = n_estimator
        self.lr = lr
        self.booster = []

        self.best_round = None

    def record_score(self, y_train, y_val, train_predict, val_predict, i):
        train_predict = np.exp(train_predict) / (1 + np.exp(train_predict))
        val_predict = np.exp(val_predict) / (1 + np.exp(val_predict))
        auc_val = roc_auc_score(y_val, val_predict)
        if (i+1)%10==0:
            auc_train = roc_auc_score(y_train, train_predict)
            print("第%d轮\t训练集: %.4f\t"
                "验证集: %.4f"%(i+1, auc_train, auc_val))
        return auc_val

    def fit(self, X, y):
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=0.25, random_state=0)
        train_predict, val_predict = 0, 0
        fit_val = np.log(y_train.mean() / (1 - y_train.mean()))
        next_fit_val = np.full(X_train.shape[0], fit_val)
        last_val_score = - np.infty
        for i in range(self.n_estimator):
            cur_booster = DT(max_depth=self.max_depth)
            cur_booster.fit(X_train, next_fit_val)
            train_predict += cur_booster.predict(X_train) * self.lr
            val_predict += cur_booster.predict(X_val) * self.lr
            next_fit_val = y_train - np.exp(
                train_predict) / (1 + np.exp(train_predict))
            self.booster.append(cur_booster)
            cur_val_score = self.record_score(
                y_train, y_val, train_predict, val_predict, i)
            if cur_val_score < last_val_score:
                self.best_round = i
                print("\n训练结束!最佳轮数为%d"%(i+1))
                break
            last_val_score = cur_val_score

    def predict(self, X):
        cur_predict = 0
        for i in range(self.best_round):
            cur_predict += self.lr * self.booster[i].predict(X)
        return np.exp(cur_predict) / (1 + np.exp(cur_predict))

XGBoost

项目GBDTXGBoost
损失函数 G ( h m ) = ∑ i = 1 N L ( y i , F m − 1 ( X i ) + h m ( X i ) ) G(h_m) = \sum_{i=1}^NL(y_i, F_{m-1}(X_i)+h_m(X_i)) G(hm)=i=1NL(yi,Fm1(Xi)+hm(Xi)) L ( m ) ( F i ( m ) ) = γ T + 1 2 λ ∑ j = 1 T w j + ∑ i = 1 N L ( y i , F i ( m ) ) L^{(m)}(F^{(m)}_i) = \gamma T+\frac{1}{2}\lambda \sum_{j=1}^Tw_j+\sum_{i=1}^NL(y_i, F^{(m)}_i) L(m)(Fi(m))=γT+21λj=1Twj+i=1NL(yi,Fi(m)) 补充了两个正则项控制树生长以及拟合效果
优化方法依赖于损失函数一阶导数选择二次函数近似方法求目标值(对各类损失函数更有自适应性)
分裂依据信息增益近似损失 G = 1 2 [ ( ∑ i ∈ I L p i ) 2 ∑ i ∈ I L q i + λ + ( ∑ i ∈ I R p i ) 2 ∑ i ∈ I R q i + λ − ( ∑ i ∈ I p i ) 2 ∑ i ∈ I q i + λ ] − γ G= \frac{1}{2}[\frac{(\sum_{i\in I_L}p_i)^2}{\sum_{i\in I_L}q_i+\lambda}+\frac{(\sum_{i\in I_R}p_i)^2}{\sum_{i\in I_R}q_i+\lambda}-\frac{(\sum_{i\in I}p_i)^2}{\sum_{i\in I}q_i+\lambda}] -\gamma G=21[iILqi+λ(iILpi)2+iIRqi+λ(iIRpi)2iIqi+λ(iIpi)2]γ

其中, p i = ∂ L ∂ h i ∣ h i = 0 p_i=\left . \frac{\partial L}{\partial h_i}\right |_{h_i=0} pi=hiLhi=0 q i = ∂ 2 L ∂ h i 2 ∣ h i = 0 q_i=\left . \frac{\partial^2 L}{\partial h^2_i}\right |_{h_i=0} qi=hi22Lhi=0

注:为保证二次函数开口向上(即 q i > 0 q_i>0 qi>0)。损失函数应当选取在整个定义域上或在𝑦𝑖临域上二阶导数恒正的损失函数,例如平方损失。均方根误差 ∣ y − y ^ ∣ \sqrt{\vert y-\hat{y}\vert} yy^ 、log平方误差 1 2 [ log ⁡ ( y + 1 y ^ + 1 ) ] 2 \frac{1}{2}[\log(\frac{y+1}{\hat{y}+1})]^2 21[log(y^+1y+1)]2不能满足该要求。Pseudo Huber Error δ 2 ( 1 + ( y − y ^ δ ) 2 − 1 ) \delta^2(\sqrt{1+(\frac{y-\hat{y}}{\delta})^2}-1) δ2(1+(δyy^)2 1)则可以满足。

LightGBM

LightGBM在XGBoost的二阶近似基础上提出了两个新算法:单边梯度采样(GOSS)和互斥特征绑定(EFB)。

单边提督采样

  • 目的:使梯度绝对值越小
  • 思路:对梯度绝对值小的样本进行抽样
  • 方法:对样本梯度绝对值排序后,先选出前𝑎%梯度绝对值对应的样本,再从剩下(1−𝑎)的样本中抽取𝑏%的样本(此处𝑏%是对于总样本的百分比),然后计算信息增益: G a i n ~ ( F , d ) = 1 N [ ( ∑ i ∈ A L g i + 1 − a b ∑ i ∈ B L g i ) 2 N L + ( ∑ i ∈ A R g i + 1 − a b ∑ i ∈ B R g i ) 2 N R ] \tilde{Gain}(F,d) = \frac{1}{N}[\frac{(\sum_{i\in A_L}{g_i}+\frac{1-a}{b}\sum_{i\in B_L}{g_i})^2}{N_L}+\frac{(\sum_{i\in A_R}{g_i}+\frac{1-a}{b}\sum_{i\in B_R}{g_i})^2}{N_R}] Gain~(F,d)=N1[NL(iALgi+b1aiBLgi)2+NR(iARgi+b1aiBRgi)2]

互斥特征绑定

  • 目的:绑定互斥特征,减少稀疏特征
  • 互斥特征:任意两个特征都不同时取非零值的特征集合
  • 等效问题:图着色问题
  • 几乎互斥:若存在一个样本使得两个特征同时为非零值则记它们存在一次冲突,特征之间的冲突总数不超过给定的最大冲突数𝐾,则近似为互斥(更符合实际情况)。

[参考]:
DataWhale集成学习

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值