Boosting之GBDT源码分析

对数损失函数

  LR中,求参数时用到了极大似然,即求得某些参数使得已知样本出现的概率最大。对于分类问题,训练集我们已知label,如果所有样本都预测准确,那么 y ^ = y \hat{y}=y y^=y。即:

  • y = 0 y=0 y=0, P ( Y ∣ X ) P(Y|X) P(YX)尽可能接近0
  • y = 1 y=1 y=1, P ( Y ∣ X ) P(Y|X) P(YX)尽可能接近1

则目标为: ∏ i = 1 N [ P ( Y = 1 ∣ X = x i ) ] y i [ 1 − P ( Y = 0 ∣ X = x i ) ] 1 − y i \prod_{i=1}^{N}\left[P\left(Y=1|X=x_{i}\right)\right]^{y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right]^{1-y_{i}} i=1N[P(Y=1X=xi)]yi[1P(Y=0X=xi)]1yi尽可能的大
进一步用log把求积转化求和,则损失函数定义如下:
L o s s = − ∑ i = 1 N y i [ P ( Y = 1 ∣ X = x i ) ] + 1 − y i [ 1 − P ( Y = 0 ∣ X = x i ) ] Loss = -\sum_{i=1}^{N}{y_{i}}\left[P\left(Y=1|X=x_{i}\right)\right]+{1-y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right] Loss=i=1Nyi[P(Y=1X=xi)]+1yi[1P(Y=0X=xi)]

假设 y ∈ { 0 , 1 } y\in\{0,1\} y{0,1},对于某个样本( x i x_i xi, y i y_i yi),对数损失函数logloss可以定义为:
L ( y i , p ( Y = 1 ∣ x i ) ) = − { y i log ⁡ p i + ( 1 − y i ) log ⁡ ( 1 − p i ) } L\left(y_{i}, p\left(Y=1|x_{i}\right)\right)=-\left\{y_{i} \log p_{i}+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right\} L(yi,p(Y=1xi))={yilogpi+(1yi)log(1pi)}

在GBDT中,每一轮都在拟合负梯度,预测时首先是对连续值的累加,最终通过sigmoid函数映射到0-1之间。

f m ( x ) = f m − 1 ( x ) + α m G m ( x ) = ∑ t = 1 m G t ( x ) P ( Y = 1 ∣ X = x i ) = s i g m o i d ( f m ( x i ) ) = p ( x i ) = 1 1 + e ( − f m ( x i ) ) \begin{aligned} &{f_m}\left( x \right) = {f_{m - 1}}\left( x \right) + {\alpha _m}{G_m}\left( x \right) = \sum\limits_{t = 1}^m {{G_t}\left( x \right)}\\ &P(Y=1|X=x_i)=sigmoid(f_m(x_i))=p{(x_i)}=\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}} \end{aligned} fm(x)=fm1(x)+αmGm(x)=t=1mGt(x)P(Y=1X=xi)=sigmoid(fm(xi))=p(xi)=1+e(fm(xi))1

f m ( x ) f_m(x) fm(x)代入loss:

− L o s s ( y i , f m ( x i ) ) = y i log ⁡ ( 1 1 + e ( − f m ( x i ) ) ) + ( 1 − y i ) log ⁡ ( e ( − f m ( x i ) ) 1 + e ( − f m ( x i ) ) ) = − y i log ⁡ ( 1 + e ( − f m ( x i ) ) ) + ( 1 − y i ) { log ⁡ ( e ( − f m ( x i ) ) ) − log ⁡ ( 1 + e ( − f m ( x i ) ) ) } = − y i log ⁡ ( 1 + e ( − f m ( x i ) ) ) + log ⁡ ( e ( − f m ( x i ) ) ) − log ⁡ ( 1 + e ( − f m ( x i ) ) ) − y i log ⁡ ( e ( − f m ( x i ) ) ) + y i log ⁡ ( 1 + e ( − f m ( x i ) ) ) = y i f m ( x i ) − log ⁡ ( 1 + e f m ( x i ) ) s o : L o s s ( y i , f m ( x i ) ) = − { y i f m ( x i ) − log ⁡ ( 1 + e f m ( x i ) ) } = − y i f m ( x i ) + log ⁡ ( 1 + e f m ( x i ) ) \begin{aligned} -Loss(y_i,f_m(x_i))&=y_{i} \log \left(\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)+\left(1-y_{i}\right) \log \left(\frac{e^{\left(-f_{m}\left(x_{i}\right)\right)}}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)\\ &=-y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\left(1-y_{i}\right)\left\{\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)\right\}\\ &=-y_{i}\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-y_{i} \log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right) \\ &= y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\\ so:Loss(y_i,f_m(x_i))&=-\{ y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\}\\ &=- y_{i} f_{m}\left(x_{i}\right)+\log \left(1+e^{f_{m}\left(x_{i}\right)}\right) \end{aligned} Loss(yi,fm(xi))so:Loss(yi,fm(xi))=yilog(1+e(fm(xi))1)+(1yi)log(1+e(fm(xi))e(fm(xi)))=yilog(1+e(fm(xi)))+(1yi){log(e(fm(xi)))log(1+e(fm(xi)))}=yilog(1+e(fm(xi)))+log(e(fm(xi)))log(1+e(fm(xi)))yilog(e(fm(xi)))+yilog(1+e(fm(xi)))=yifm(xi)log(1+efm(xi))={yifm(xi)log(1+efm(xi))}=yifm(xi)+log(1+efm(xi))

负梯度

那么负梯度为:

δ L δ f = − y + e f 1 + e f = − y + 1 1 + e − f \begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \end{aligned} δfδL=y+1+efef=y+1+ef1

对于第m次迭代, f = f m − 1 f=f_{m-1} f=fm1,即可求个各个点的负梯度

Sklearn源码

Intro

  • sklearn版本:0.22
  • 原来代码为_gb.py,方便测试和结果数据,改为_gb_Source.py
  • 用iris数据进行代码测试和验证

重点关注下面几个部分:

  • 参数初始化、check部分
  • 迭代过程中给定的初始值
  • 具体每一轮的迭代过程
  • 模型储存和预测
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.ensemble._gb_Source import GradientBoostingClassifier
from sklearn.utils.validation import check_array, column_or_1d
from sklearn.tree._tree import DTYPE, DOUBLE
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree._tree import TREE_LEAF
from scipy.sparse import csc_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
from scipy.special import expit
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils import check_random_state
from sklearn.utils import check_array
from sklearn.utils import column_or_1d
from sklearn.utils import check_consistent_length
from sklearn.utils import deprecated
from sklearn.utils.fixes import logsumexp
from sklearn.utils.stats import _weighted_percentile
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.multiclass import check_classification_targets
from sklearn.exceptions import NotFittedError
from sklearn.dummy import DummyClassifier
import math
from sklearn import tree
iris_data = datasets.load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df["target"] = iris_data.target  # 0 1 2
iris_df["target_name"] = iris_df.target.replace({0: "setosa", 1: "versicolor", 2: "virginica"})
iris_df = iris_df.query("target_name in ('virginica', 'versicolor')")
iris_df = iris_df.iloc[0:80]
print(iris_df.target_name.value_counts())
iris_df.head()
versicolor    50
virginica     30
Name: target_name, dtype: int64
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)targettarget_name
507.03.24.71.41versicolor
516.43.24.51.51versicolor
526.93.14.91.51versicolor
535.52.34.01.31versicolor
546.52.84.61.51versicolor

参数初始化、check

调用BaseGradientBoosting类的fit方法时,会完成部分参数的初始化。
初始化主要看两个点:

  • check_array把X和y转成数组,column_or_1d把y转成1维数组
  • check_consistent_length检验X和y数据量是否一致
  • self._validate_y完成了y的encode,把数据转换成0 1 2这样的有序值
  • self.classes_、self.n_classes_分别表示类别值的枚举和枚举值个数

重点看下_validate_y函数:

    def _validate_y(self, y, sample_weight):
        check_classification_targets(y) # check是不是类别型数据,看代码更像通过check变量y的unique数量判断类型
        # unique枚举值、转换成数值0,1,2等等,原来的值排序,映射到0,1,2
        self.classes_, y = np.unique(y, return_inverse=True)
        # np.bincount统计各个值出现的个数,由于y在前面已经排完序了,所以就是各个值出现的个数
        # 在统计非0的个数,这个理论上不会出现异常吧。。。
        n_trim_classes = np.count_nonzero(np.bincount(y, sample_weight))
        if n_trim_classes < 2:
            raise ValueError("y contains %d class after sample_weight "
                             "trimmed classes with zero weights, while a "
                             "minimum of 2 classes are required."
                             % n_trim_classes)
        self.n_classes_ = len(self.classes_)
        return y

涉及到一些np的api,具体可以查文档。总之,_validate_y['a','b','c']映射成[0,1,2]。self.classes_=已排序的原来字符串,y则被映射过的值替换。另,二分类fit时,调用的是GradientBoostingClassifier中override的_validate_y,而不是BaseGradientBoosting中的~

gbm = GradientBoostingClassifier(random_state=10,n_estimators=5)
gbm.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=5,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=10, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
gbm.n_classes_
2
gbm.classes_
array(['versicolor', 'virginica'], dtype=object)

fit时完成y的encode,预测时完成decode。

gbm._validate_y(iris_df.target_name,sample_weight=None)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

迭代初始值确定

gbm.loss
'deviance'
        if (self.loss not in self._SUPPORTED_LOSS
                or self.loss not in _gb_losses.LOSS_FUNCTIONS):
            raise ValueError("Loss '{0:s}' not supported. ".format(self.loss))
        # loss_class相当于直接把类赋过来
        if self.loss == 'deviance':
            loss_class = (_gb_losses.MultinomialDeviance
                          if len(self.classes_) > 2
                          else _gb_losses.BinomialDeviance)
        else:
            loss_class = _gb_losses.LOSS_FUNCTIONS[self.loss]

        if self.loss in ('huber', 'quantile'):
            self.loss_ = loss_class(self.n_classes_, self.alpha)
        else:
            self.loss_ = loss_class(self.n_classes_) #传入样本类别数??得到一个类的实例

先看loss咋整的:

  • _check_params完成部分参数的初始化和check功能。
  • self.loss有参数传进来,默认deviance。先验证传过来的值是否在可支持范围内
  • 根据self.classes_的个数判断多分类or二分类
  • loss_class相当于是BinomialDeviance类
  • self.loss_是这个类的一个实例,对于二分类,传入了参数2,但是生成实例时,强制把参数赋为1
  • 在超类LossFunction中,属性K的值就变为了1,即self.loss_.K=1,这个K后面对生成数据的纬度至关重要
class BinomialDeviance(ClassificationLossFunction):
    # 注意n_class强制转成1了,也就是K是1了
    def __init__(self, n_classes):
        if n_classes != 2:
            raise ValueError("{0:s} requires 2 classes; got {1:d} class(es)"
                             .format(self.__class__.__name__, n_classes))
        # we only need to fit one tree for binary clf.
        super().__init__(n_classes=1)

接下来看初始值

    def _init_state(self):
        """Initialize model state and allocate model state data structures. """

        self.init_ = self.init
        # 返回的是第一轮的预测值,也就是决定了第一次迭代时y^hat,分类默认是根据先验,选择占比多的
        # 默认是None,此时直接调用DummyEstimator函数,计算先验
        if self.init_ is None:
            self.init_ = self.loss_.init_estimator()

        self.estimators_ = np.empty((self.n_estimators, self.loss_.K),
                                    dtype=np.object)
        self.train_score_ = np.zeros((self.n_estimators,), dtype=np.float64)
        # do oob?
        if self.subsample < 1.0:
            self.oob_improvement_ = np.zeros((self.n_estimators),
                                             dtype=np.float64)

_init_state完成loss和estimators_的初始化。注意对于二分类,生成了维度为(self.n_estimators,1)个空数组,用来存放迭代的回归树

初始值由init指定:

estimator or ‘zero’, default=None
An estimator object that is used to compute the initial predictions. init has to provide fit and predict_proba. If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.
  • _init_state初始化参数,self.init_ = self.init,把init值赋给_init
  • 后面接了判断,如果self.init_ is None,self.init_ = self.loss_.init_estimator()
  • 调用self.loss_.init_estimator(),即BinomialDeviance的init_estimator方法,返回DummyClassifier(strategy=‘prior’)
  • 以上完成self._init_state()过程
  • self.init_ == 'zero'时,raw_predictions为(n,1)的数组,所有元素都是0
  • 否则,self.init_.fit(X, y, sample_weight=sample_weight),调用DummyClassifier.fit
  • raw_predictions = self.loss_.get_init_raw_predictions(X, self.init_),这里有点点复杂,直接看下代码
    def get_init_raw_predictions(self, X, estimator):
        probas = estimator.predict_proba(X)
        # postive类的先验概率值
        proba_pos_class = probas[:, 1]
        # eps是取非负的最小值
        eps = np.finfo(np.float32).eps
        # 限制proba_pos_class最大最小值不能超过给定eps和1-eps,超过则等于
        proba_pos_class = np.clip(proba_pos_class, eps, 1 - eps)
        # 反函数,最后算概率要再反算一遍,所以先用反函数
        # log(x / (1 - x)) is the inverse of the sigmoid (expit) function
        raw_predictions = np.log(proba_pos_class / (1 - proba_pos_class))
        # 转换之后维度变成(n,1),这样就和_update_terminal_regions中的raw_predictions[:, k]
        # 对应起来,对于分类问题,k固定是1
        return raw_predictions.reshape(-1, 1).astype(np.float64)
prior_model = DummyClassifier(strategy='prior')
prior_model.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)
DummyClassifier(constant=None, random_state=None, strategy='prior')
prior_model.predict_proba(X=iris_df.iloc[:, 0:4])[0:5]
array([[0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375]])

简单说,如果正负样本为3:5,p(y=1)概率都被预测为0.375,然后通过sigmoid的反函数计算得到raw_predictions。这里需要搞清楚所谓的raw_predictions其实就是个连续值,也就是迭代过程中一步步叠加的累计值。最终都是通过sigmoid转换到0-1之间。上面代码中np函数可以自己查看文档,再了解下。

下面验证下逻辑:

from sklearn.ensemble._gb_losses import BinomialDeviance
gbm_loss = BinomialDeviance(2)
gbm_loss.K
1

可见K被强制赋值1,和入参无关

raw_predictions = gbm_loss.get_init_raw_predictions(X=iris_df.iloc[:, 0:4],estimator=prior_model)
raw_predictions.shape
(80, 1)
raw_predictions[0:5]
array([[-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562]])

再验证下sigmoid转换

from scipy.special import expit
print(expit(-0.51082562),1/(1+np.exp(0.51082562)))
0.37500000088265406 0.37500000088265406

和DummyClassifier预测的先验概率一致

迭代过程

n_stages = self._fit_stages(
            X, y, raw_predictions, sample_weight, self._rng, X_val, y_val,
            sample_weight_val, begin_at_stage, monitor, X_idx_sorted)
  • self._fit_stages主要逻辑没啥,就是for循环self.n_estimators次,调用self._fit_stage,需关注每次训练的值和模型如何更新
  • 即:raw_predictions = self._fit_stage(i, X, y, raw_predictions, sample_weight, sample_mask,random_state, X_idx_sorted, X_csc, X_csr)

对于每一次迭代,主要步骤如下:

  • 根据上一次的raw_predictions计算负梯度
  • 回归树预测负梯度,更新叶子结点的预测值,更新树结构
  • 更新raw_predictions,把树模型存在self.estimators_

下面走一下单次模型更新逻辑

训练回归树,入参X=特征数据和y=负梯度

# 先把字符串类别的label转成0-1型  
y = gbm._validate_y(iris_df.target_name,sample_weight=None)
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
raw_predictions[0:5]
array([[-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562]])
residual = gbm.loss_.negative_gradient(y,raw_predictions)
residual
array([-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625])
    def negative_gradient(self, y, raw_predictions, **kargs):
        """Compute the residual (= negative gradient).

        Parameters
        ----------
        y : 1d array, shape (n_samples,)
            True labels.

        raw_predictions : 2d array, shape (n_samples, K)
            The raw_predictions (i.e. values from the tree leaves) of the
            tree ensemble at iteration ``i - 1``.
        """
        # expit(x) = 1/(1+exp(-x))
        return y - expit(raw_predictions.ravel())

负梯度的公式前面推导过,代码如上,很清晰。y为真实label值[0,1],raw_predictions则是上一轮预测的值,验证下~

δ L δ f = − y + e f 1 + e f = − y + 1 1 + e − f = − 0 + 1 / ( 1 + e − ( − 0.51082562 ) ) = 1 / ( 1 + e 0.51082562 ) = 0.3750 \begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \\ &=-0+1/(1+e^{-(-0.51082562)}) \\ &=1/(1+e^{0.51082562})\\ &=0.3750 \end{aligned} δfδL=y+1+efef=y+1+ef1=0+1/(1+e(0.51082562))=1/(1+e0.51082562)=0.3750

1/(1+math.exp(0.51082562))
0.37500000088265406

拟合回归树&更新叶子节点

参数和源码中可能不同,无所谓,看主要矛盾

dtree = DecisionTreeRegressor(criterion="friedman_mse")
dtree.fit(iris_df.iloc[:, 0:4], residual)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='friedman_mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
from IPython.display import Image 
import pydotplus 
dot_data = tree.export_graphviz(dtree, out_file=None, 
                         feature_names=iris_data.feature_names,  
                         class_names=iris_data.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png()) 

在这里插入图片描述

更新叶子节点预测值的公式很直观,但是代码涉及很多tree的方法,不太熟悉,凑合着整吧

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, raw_predictions, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight *
                             (y - residual) * (1 - y + residual))

        # prevents overflow and division by zero
        # 决定叶子结点的值
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator
dtree.apply(iris_df.iloc[:,0:4])
array([ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2, 14,  2,  2,  2,  2,  2,  2, 11,  2,  2,  2,  2,  2,  6,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 16,
       16, 16, 16, 16, 16, 10, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       16,  4, 16, 16, 16, 16, 16, 16, 15, 16, 16,  7], dtype=int64)
dtree.tree_.children_left
array([ 1,  2, -1,  4, -1,  6, -1, -1,  9, 10, -1, -1, 13, 14, -1, -1, -1],
      dtype=int64)

第3、5、7等节点是叶子节点,计算第3个节点的预测值是否为y的均值
注意:第3个节点是索引从1开始计数,apply方法返回的是从0开始计数

np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0]
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)
for i in [2, 4, 6, 7]:
    print(
        "leaf num - %s:" % i, "(check value %s)" %
        residual[np.where(dtree.apply(iris_df.iloc[:, 0:4]) == i)[0]].mean(),
        "(real value%s)" % dtree.tree_.value[i])
leaf num - 2: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 4: (check value 0.625) (real value[[0.625]])
leaf num - 6: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 7: (check value 0.625) (real value[[0.625]])
resudial_sample = residual.take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0])
y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0], axis=0)
numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator
-1.5999999999999999

常规的CART回归树,在叶子节点的预测值or输出值,为落在该叶子节点区域训练样本标签值的均值(在均方误差时,最优)。而二元GBDT中,考虑的不是尽可能接近负梯度,而是直接瞄准最原始的目标,对数损失函数最小。即:
c t j = arg ⁡ min ⁡ ⏟ c ∑ x i ∈ R t j L ( y i , f t − 1 ( x i ) + c ) c_{t j}=\underbrace{\arg \min }_{c} \sum_{x_{i} \in R_{t j}} L\left(y_{i}, f_{t-1}\left(x_{i}\right)+c\right) ctj=c argminxiRtjL(yi,ft1(xi)+c)
上式较难优化,一般使用如下近似值代替:
c t j = ∑ x i ∈ R t j r t i ∑ x i ∈ R t j ( y i − r t i ) ( 1 − y i + r t i ) c_{t j}=\frac{\sum_{x_{i} \in R_{t j}} r_{t i}} { \sum_{x_{i} \in R_{t j}}(y_i-r_{t i})\left(1-y_i+r_{t i}\right)} ctj=xiRtj(yirti)(1yi+rti)xiRtjrti

看下代码:

# 单独更新某一个叶子节点的预测值
    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, raw_predictions, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        # 找到归属于该叶子节点的样本索引
        terminal_region = np.where(terminal_regions == leaf)[0]
        # 取出对应的样本的负梯度值和label值
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        # 样本权重,暂且不管
        sample_weight = sample_weight.take(terminal_region, axis=0)
        # sum负梯度
        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight *
                             (y - residual) * (1 - y + residual))

        # prevents overflow and division by zero
        # 决定叶子结点的值
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

tree作为可变变量,直接修改值,并在各个函数中传递,list、array有类似效果,后续修改残差也是通过这种方式直接更新值的。
下面验证下上述的计算逻辑:

先看第一个分类器的树图,可以便捷的看到叶子节点的值

from IPython.display import Image 
import pydotplus 
dot_data = tree.export_graphviz(gbm.estimators_[0][0], out_file=None, 
                         feature_names=iris_data.feature_names,  
                         class_names=iris_data.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png()) 

在这里插入图片描述

gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])
array([ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2, 11,  2,  2,  2,  2,  2,  2,  9,  2,  2,  2,  2,  2,  4,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 12,
       12, 12, 12, 12, 12,  8, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
       12,  4, 12, 12, 12, 12, 12, 12, 11, 12, 12,  5], dtype=int64)
gbm.estimators_[0][0].tree_.children_left
array([ 1,  2, -1,  4, -1, -1,  7,  8, -1, -1, 11, -1, -1], dtype=int64)

通过修改源码,把每次残差保存在self.raw_predictions_list

计算叶子节点[3]对应的输出值

leaf2_sample_index = np.where(gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])==2)[0]
leaf2_sample_index
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)
y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(leaf2_sample_index, axis=0)
resudial_sample = gbm.residual_list[0][0].take(leaf2_sample_index, axis=0)
numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator
-1.5999999999999999
gbm.estimators_[0][0].tree_.value[2, 0, 0]
-1.5999999999999999

raw_predictions更新

上面回归树拟合,得到新的回归树

        raw_predictions[:, k] += \
            learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0)
  • 二分类时,k=1,多分类先不管
  • 用新回归树预测的值+raw_predictions原来的值来更新raw_predictions
  • 这就是加法的逻辑

第二次训练

  • 更新之后的raw_predictions和y计算更新负梯度
  • (X,residual)训练新回归树
  • 继续迭代raw_predictions

迭代终止条件

达到预设的n_estimators停止迭代
迭代过程中的模型存放在self.estimators_

模型预测

raw_prediction

模型预测分两步:

  • N个模型得到N个结果,raw_prediction
  • N个结果*learning_rate相加+初始化的值,sigmoid转换成0-1之间的概率

F M ( x ) = F 0 ( x ) + ∑ m = 1 M ∑ j = 1 J m c m , j I ( x ∈ R m , j ) F_{M}(x)=F_{0}(x)+\sum_{m=1}^{M} \sum_{j=1}^{J_{m}} c_{m, j} I\left(x \in R_{m, j}\right) FM(x)=F0(x)+m=1Mj=1Jmcm,jI(xRm,j)

    # 不特殊指定情况下,返回先验,也就是类别占比多的
    def _raw_predict_init(self, X):
        """Check input and compute raw predictions of the init estimtor."""
        self._check_initialized()
        X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
        if X.shape[1] != self.n_features_:
            raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
                self.n_features_, X.shape[1]))
        if self.init_ == 'zero':
            raw_predictions = np.zeros(shape=(X.shape[0], self.loss_.K),
                                       dtype=np.float64)
        else:
            raw_predictions = self.loss_.get_init_raw_predictions(
                X, self.init_).astype(np.float64)
        return raw_predictions
    # 看下面解释好像是返回预测值的和,具体代码似乎是c写的,隐藏了
    def _raw_predict(self, X):
        """Return the sum of the trees raw predictions (+ init estimator)."""
        # 初始预测值
        raw_predictions = self._raw_predict_init(X)
        # 走迭代的过程,求出最终的raw_predictions
        predict_stages(self.estimators_, X, self.learning_rate,
                       raw_predictions)
        return raw_predictions

核心代码是pwd文件,也看不到源码。直接按照上面的逻辑,做验证~

初始值和前面初始化中的逻辑一致~gbm.init_DummyClassifier先得到预测概率,再sigmod反函数得到raw_prediction

gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:2,0:4],gbm.init_)
array([[-0.51082562],
       [-0.51082562]])

下面获取每个tree的预测值乘以learning_rate

pre_list = []
for i in gbm.estimators_:
    pre_list.append(i[0].predict(iris_df.iloc[0:1,0:4])[0]*gbm.learning_rate)
pre_list
[-0.16,
 -0.1511286273379727,
 -0.14395717809370576,
 -0.13806361187211877,
 -0.13315505338282463]
gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list)
-1.2371300944526125

predict_proba

sigmoid函数转换成概率

    def predict_proba(self, X):
        raw_predictions = self.decision_function(X)
        try:
            # 做了Logit变换
            return self.loss_._raw_prediction_to_proba(raw_predictions)
        except NotFittedError:
            raise
        except AttributeError:
            raise AttributeError('loss=%r does not support predict_proba' %
                                 self.loss)
gbm.predict_proba(iris_df.iloc[0:1,0:4])
array([[0.77506407, 0.22493593]])
expit(gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list))
0.2249359293642033

和直接预测结果一致,逻辑验证完毕

predict

predict有个label decode的逻辑,前面初始化时,通过gbm._validate_y把label值做了encode, 比如本case中,把'versicolor', 'virginica'映射为0、1

    def predict(self, X):
        # 有些代码不是纯py的,总之,最终累加各个模型预测值得到一个结果
        raw_predictions = self.decision_function(X)
        # 找到最大的那个概率值所在的索引
        encoded_labels = \
            self.loss_._raw_prediction_to_decision(raw_predictions)
        # encoded_labels是[0,1]这样值,return返回其对应的原始y的数据,如(good,bad)、(+1,-1)这样
        return self.classes_.take(encoded_labels, axis=0)
-------------------------------------------------------------------------------------
    def _raw_prediction_to_proba(self, raw_predictions):
        proba = np.ones((raw_predictions.shape[0], 2), dtype=np.float64) # 生成N个[1,1],用来存放概率
        proba[:, 1] = expit(raw_predictions.ravel())#做了logit变换,把值映射到0-1之间
        proba[:, 0] -= proba[:, 1]
        return proba

    def _raw_prediction_to_decision(self, raw_predictions):
        # 概率转换
        proba = self._raw_prediction_to_proba(raw_predictions)
        return np.argmax(proba, axis=1) # 返回的好像是索引(即哪个概率大,0or1),代码会把索引进一步映射为对应的类别
  • decision_function函数得到raw_prediction
  • _raw_prediction_to_decision获取概率大的索引,其实也就是0、1
  • predict根据索引,映射成对应的原始类别

总结

模型训练

  • 初始化第0个弱学习器 F 0 F_0 F0,根据入参情况,可能是DummyClassifier,得到raw_prediction
  • 训练n_estimators个分类器
    • 1)根据raw_prediction和y计算负梯度residual
    • 2)(特征X,负梯度residual)训练cart回归树
    • 3)更新每个叶子节点的输出值
    • 4)更新raw_prediction,raw_prediction+learning_rate*leaf_output即加上学习率乘tree的预测值
    • 继续上述4步,进行迭代
  • 分类器达到指定个数,停止迭代

模型预测

  • 首先等到初始化值,也就是上面第0个弱学习器所得到的预测值
  • 其次n_estimators个分类器预测结果乘以学习率
  • 上述两结果相加,再通过sigmoid转换,得到概率

Ref

[1] https://www.6aiq.com/article/1583859185789
[2] https://liangyaorong.github.io/blog/2017/scikit-learn%E4%B8%AD%E7%9A%84GBDT%E5%AE%9E%E7%8E%B0/#2.6
[3] https://blog.csdn.net/qq_22238533/article/details/79192579
[4] https://www.cnblogs.com/pinard/p/6140514.html#!comments

折腾了快一个月,我还是太懒,懒得无可救药。聪明和勤奋一个都不占,还想着偷懒,大谬!

                                2021-05-18 于南京市江宁区九龙湖

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值