Boosting之GBDT源码分析

维格堂406小队

于 2021-05-18 20:36:29 发布

阅读量411

点赞数

分类专栏： ★★★机器学习 # ★★分类&回归

本文链接：https://blog.csdn.net/wendaomudong_l2d4/article/details/117000359

版权

★★★机器学习同时被 2 个专栏收录

62 篇文章 3 订阅

订阅专栏

★★分类&回归

23 篇文章 0 订阅

订阅专栏

对数损失函数

LR中，求参数时用到了极大似然，即求得某些参数使得已知样本出现的概率最大。对于分类问题，训练集我们已知label，如果所有样本都预测准确，那么 $\hat{y}=y$ 。即：

$y = 0$ , $P (Y ∣ X)$ 尽可能接近0
$y = 1$ , $P (Y ∣ X)$ 尽可能接近1

则目标为： $\prod_{i=1}^{N}\left[P\left(Y=1|X=x_{i}\right)\right]^{y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right]^{1-y_{i}}$ 尽可能的大
进一步用log把求积转化求和,则损失函数定义如下：
$-\sum_{i=1}^{N}{y_{i}}\left[P\left(Y=1|X=x_{i}\right)\right]+{1-y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right]$

假设 $y\in\{0,1\}$ ，对于某个样本( $x_i$ , $y_i$ ),对数损失函数logloss可以定义为：
$L\left(y_{i}, p\left(Y=1|x_{i}\right)\right)=-\left\{y_{i} \log p_{i}+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right\}$

在GBDT中，每一轮都在拟合负梯度，预测时首先是对连续值的累加，最终通过sigmoid函数映射到0-1之间。

$\begin{aligned} &{f_m}\left( x \right) = {f_{m - 1}}\left( x \right) + {\alpha _m}{G_m}\left( x \right) = \sum\limits_{t = 1}^m {{G_t}\left( x \right)}\\ &P(Y=1|X=x_i)=sigmoid(f_m(x_i))=p{(x_i)}=\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}} \end{aligned}$

把 $f_m(x)$ 代入loss：

$\begin{aligned} -Loss(y_i,f_m(x_i))&=y_{i} \log \left(\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)+\left(1-y_{i}\right) \log \left(\frac{e^{\left(-f_{m}\left(x_{i}\right)\right)}}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)\\ &=-y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\left(1-y_{i}\right)\left\{\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)\right\}\\ &=-y_{i}\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-y_{i} \log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right) \\ &= y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\\ so:Loss(y_i,f_m(x_i))&=-\{ y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\}\\ &=- y_{i} f_{m}\left(x_{i}\right)+\log \left(1+e^{f_{m}\left(x_{i}\right)}\right) \end{aligned}$

负梯度

那么负梯度为：

$\begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \end{aligned}$

对于第m次迭代， $f=f_{m-1}$ ，即可求个各个点的负梯度

Sklearn源码

Intro

sklearn版本：0.22
原来代码为_gb.py，方便测试和结果数据，改为_gb_Source.py
用iris数据进行代码测试和验证

重点关注下面几个部分：

参数初始化、check部分
迭代过程中给定的初始值
具体每一轮的迭代过程
模型储存和预测

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.ensemble._gb_Source import GradientBoostingClassifier
from sklearn.utils.validation import check_array, column_or_1d
from sklearn.tree._tree import DTYPE, DOUBLE
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree._tree import TREE_LEAF
from scipy.sparse import csc_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
from scipy.special import expit
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils import check_random_state
from sklearn.utils import check_array
from sklearn.utils import column_or_1d
from sklearn.utils import check_consistent_length
from sklearn.utils import deprecated
from sklearn.utils.fixes import logsumexp
from sklearn.utils.stats import _weighted_percentile
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.multiclass import check_classification_targets
from sklearn.exceptions import NotFittedError
from sklearn.dummy import DummyClassifier
import math
from sklearn import tree

iris_data = datasets.load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df["target"] = iris_data.target  # 0 1 2
iris_df["target_name"] = iris_df.target.replace({0: "setosa", 1: "versicolor", 2: "virginica"})
iris_df = iris_df.query("target_name in ('virginica', 'versicolor')")
iris_df = iris_df.iloc[0:80]
print(iris_df.target_name.value_counts())
iris_df.head()

versicolor    50
virginica     30
Name: target_name, dtype: int64

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target	target_name
50	7.0	3.2	4.7	1.4	1	versicolor
51	6.4	3.2	4.5	1.5	1	versicolor
52	6.9	3.1	4.9	1.5	1	versicolor
53	5.5	2.3	4.0	1.3	1	versicolor
54	6.5	2.8	4.6	1.5	1	versicolor

参数初始化、check

调用BaseGradientBoosting类的fit方法时，会完成部分参数的初始化。
初始化主要看两个点：

check_array把X和y转成数组，column_or_1d把y转成1维数组
check_consistent_length检验X和y数据量是否一致
self._validate_y完成了y的encode，把数据转换成0 1 2这样的有序值
self.classes_、self.n_classes_分别表示类别值的枚举和枚举值个数

重点看下_validate_y函数：

    def _validate_y(self, y, sample_weight):
        check_classification_targets(y) # check是不是类别型数据，看代码更像通过check变量y的unique数量判断类型
        # unique枚举值、转换成数值0,1,2等等，原来的值排序，映射到0,1,2
        self.classes_, y = np.unique(y, return_inverse=True)
        # np.bincount统计各个值出现的个数，由于y在前面已经排完序了，所以就是各个值出现的个数
        # 在统计非0的个数,这个理论上不会出现异常吧。。。
        n_trim_classes = np.count_nonzero(np.bincount(y, sample_weight))
        if n_trim_classes < 2:
            raise ValueError("y contains %d class after sample_weight "
                             "trimmed classes with zero weights, while a "
                             "minimum of 2 classes are required."
                             % n_trim_classes)
        self.n_classes_ = len(self.classes_)
        return y

涉及到一些np的api，具体可以查文档。总之，_validate_y把['a','b','c']映射成[0,1,2]。self.classes_=已排序的原来字符串，y则被映射过的值替换。另，二分类fit时，调用的是GradientBoostingClassifier中override的_validate_y，而不是BaseGradientBoosting中的~

gbm = GradientBoostingClassifier(random_state=10,n_estimators=5)
gbm.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=5,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=10, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

gbm.n_classes_

gbm.classes_

array(['versicolor', 'virginica'], dtype=object)

fit时完成y的encode，预测时完成decode。

gbm._validate_y(iris_df.target_name,sample_weight=None)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

迭代初始值确定

gbm.loss

'deviance'

        if (self.loss not in self._SUPPORTED_LOSS
                or self.loss not in _gb_losses.LOSS_FUNCTIONS):
            raise ValueError("Loss '{0:s}' not supported. ".format(self.loss))
        # loss_class相当于直接把类赋过来
        if self.loss == 'deviance':
            loss_class = (_gb_losses.MultinomialDeviance
                          if len(self.classes_) > 2
                          else _gb_losses.BinomialDeviance)
        else:
            loss_class = _gb_losses.LOSS_FUNCTIONS[self.loss]

        if self.loss in ('huber', 'quantile'):
            self.loss_ = loss_class(self.n_classes_, self.alpha)
        else:
            self.loss_ = loss_class(self.n_classes_) #传入样本类别数？？得到一个类的实例

先看loss咋整的：

_check_params完成部分参数的初始化和check功能。
self.loss有参数传进来，默认deviance。先验证传过来的值是否在可支持范围内
根据self.classes_的个数判断多分类or二分类
loss_class相当于是BinomialDeviance类
self.loss_是这个类的一个实例，对于二分类，传入了参数2，但是生成实例时，强制把参数赋为1
在超类LossFunction中，属性K的值就变为了1，即self.loss_.K=1，这个K后面对生成数据的纬度至关重要

class BinomialDeviance(ClassificationLossFunction):
    # 注意n_class强制转成1了，也就是K是1了
    def __init__(self, n_classes):
        if n_classes != 2:
            raise ValueError("{0:s} requires 2 classes; got {1:d} class(es)"
                             .format(self.__class__.__name__, n_classes))
        # we only need to fit one tree for binary clf.
        super().__init__(n_classes=1)

接下来看初始值

    def _init_state(self):
        """Initialize model state and allocate model state data structures. """

        self.init_ = self.init
        # 返回的是第一轮的预测值，也就是决定了第一次迭代时y^hat,分类默认是根据先验，选择占比多的
        # 默认是None，此时直接调用DummyEstimator函数，计算先验
        if self.init_ is None:
            self.init_ = self.loss_.init_estimator()

        self.estimators_ = np.empty((self.n_estimators, self.loss_.K),
                                    dtype=np.object)
        self.train_score_ = np.zeros((self.n_estimators,), dtype=np.float64)
        # do oob?
        if self.subsample < 1.0:
            self.oob_improvement_ = np.zeros((self.n_estimators),
                                             dtype=np.float64)

_init_state完成loss和estimators_的初始化。注意对于二分类，生成了维度为(self.n_estimators,1)个空数组,用来存放迭代的回归树

初始值由init指定：

estimator or ‘zero’, default=None
An estimator object that is used to compute the initial predictions. init has to provide fit and predict_proba. If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.

_init_state初始化参数，self.init_ = self.init，把init值赋给_init
后面接了判断，如果self.init_ is None,self.init_ = self.loss_.init_estimator()
调用self.loss_.init_estimator()，即BinomialDeviance的init_estimator方法，返回DummyClassifier(strategy=‘prior’)
以上完成self._init_state()过程
self.init_ == 'zero'时，raw_predictions为(n,1)的数组，所有元素都是0
否则，self.init_.fit(X, y, sample_weight=sample_weight),调用DummyClassifier.fit
raw_predictions = self.loss_.get_init_raw_predictions(X, self.init_)，这里有点点复杂，直接看下代码

    def get_init_raw_predictions(self, X, estimator):
        probas = estimator.predict_proba(X)
        # postive类的先验概率值
        proba_pos_class = probas[:, 1]
        # eps是取非负的最小值
        eps = np.finfo(np.float32).eps
        # 限制proba_pos_class最大最小值不能超过给定eps和1-eps，超过则等于
        proba_pos_class = np.clip(proba_pos_class, eps, 1 - eps)
        # 反函数，最后算概率要再反算一遍，所以先用反函数
        # log(x / (1 - x)) is the inverse of the sigmoid (expit) function
        raw_predictions = np.log(proba_pos_class / (1 - proba_pos_class))
        # 转换之后维度变成(n,1)，这样就和_update_terminal_regions中的raw_predictions[:, k]
        # 对应起来，对于分类问题，k固定是1
        return raw_predictions.reshape(-1, 1).astype(np.float64)

prior_model = DummyClassifier(strategy='prior')
prior_model.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)

DummyClassifier(constant=None, random_state=None, strategy='prior')

prior_model.predict_proba(X=iris_df.iloc[:, 0:4])[0:5]

array([[0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375],
       [0.625, 0.375]])

简单说，如果正负样本为3:5，p(y=1)概率都被预测为0.375,然后通过sigmoid的反函数计算得到raw_predictions。这里需要搞清楚所谓的raw_predictions其实就是个连续值，也就是迭代过程中一步步叠加的累计值。最终都是通过sigmoid转换到0-1之间。上面代码中np函数可以自己查看文档，再了解下。

下面验证下逻辑：

from sklearn.ensemble._gb_losses import BinomialDeviance
gbm_loss = BinomialDeviance(2)
gbm_loss.K

可见K被强制赋值1，和入参无关

raw_predictions = gbm_loss.get_init_raw_predictions(X=iris_df.iloc[:, 0:4],estimator=prior_model)
raw_predictions.shape

(80, 1)

raw_predictions[0:5]

array([[-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562]])

再验证下sigmoid转换

from scipy.special import expit
print(expit(-0.51082562),1/(1+np.exp(0.51082562)))

0.37500000088265406 0.37500000088265406

和DummyClassifier预测的先验概率一致

迭代过程

n_stages = self._fit_stages(
            X, y, raw_predictions, sample_weight, self._rng, X_val, y_val,
            sample_weight_val, begin_at_stage, monitor, X_idx_sorted)

self._fit_stages主要逻辑没啥，就是for循环self.n_estimators次，调用self._fit_stage，需关注每次训练的值和模型如何更新
即:raw_predictions = self._fit_stage(i, X, y, raw_predictions, sample_weight, sample_mask,random_state, X_idx_sorted, X_csc, X_csr)

对于每一次迭代，主要步骤如下：

根据上一次的raw_predictions计算负梯度
回归树预测负梯度，更新叶子结点的预测值，更新树结构
更新raw_predictions，把树模型存在self.estimators_中

下面走一下单次模型更新逻辑

训练回归树，入参X=特征数据和y=负梯度

# 先把字符串类别的label转成0-1型  
y = gbm._validate_y(iris_df.target_name,sample_weight=None)
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

raw_predictions[0:5]

array([[-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562],
       [-0.51082562]])

residual = gbm.loss_.negative_gradient(y,raw_predictions)
residual

array([-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
       -0.375, -0.375,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,
        0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625,  0.625])

    def negative_gradient(self, y, raw_predictions, **kargs):
        """Compute the residual (= negative gradient).

        Parameters
        ----------
        y : 1d array, shape (n_samples,)
            True labels.

        raw_predictions : 2d array, shape (n_samples, K)
            The raw_predictions (i.e. values from the tree leaves) of the
            tree ensemble at iteration ``i - 1``.
        """
        # expit(x) = 1/(1+exp(-x))
        return y - expit(raw_predictions.ravel())

负梯度的公式前面推导过，代码如上，很清晰。y为真实label值[0,1],raw_predictions则是上一轮预测的值，验证下~

$\begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \\ &=-0+1/(1+e^{-(-0.51082562)}) \\ &=1/(1+e^{0.51082562})\\ &=0.3750 \end{aligned}$

1/(1+math.exp(0.51082562))

0.37500000088265406

拟合回归树&更新叶子节点

参数和源码中可能不同，无所谓，看主要矛盾

dtree = DecisionTreeRegressor(criterion="friedman_mse")
dtree.fit(iris_df.iloc[:, 0:4], residual)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='friedman_mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

from IPython.display import Image 
import pydotplus 
dot_data = tree.export_graphviz(dtree, out_file=None, 
                         feature_names=iris_data.feature_names,  
                         class_names=iris_data.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

在这里插入图片描述

更新叶子节点预测值的公式很直观，但是代码涉及很多tree的方法，不太熟悉，凑合着整吧

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, raw_predictions, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight *
                             (y - residual) * (1 - y + residual))

        # prevents overflow and division by zero
        # 决定叶子结点的值
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

dtree.apply(iris_df.iloc[:,0:4])

array([ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2, 14,  2,  2,  2,  2,  2,  2, 11,  2,  2,  2,  2,  2,  6,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 16,
       16, 16, 16, 16, 16, 10, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       16,  4, 16, 16, 16, 16, 16, 16, 15, 16, 16,  7], dtype=int64)

dtree.tree_.children_left

array([ 1,  2, -1,  4, -1,  6, -1, -1,  9, 10, -1, -1, 13, 14, -1, -1, -1],
      dtype=int64)

第3、5、7等节点是叶子节点，计算第3个节点的预测值是否为y的均值
注意：第3个节点是索引从1开始计数，apply方法返回的是从0开始计数

np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)

for i in [2, 4, 6, 7]:
    print(
        "leaf num - %s:" % i, "(check value %s)" %
        residual[np.where(dtree.apply(iris_df.iloc[:, 0:4]) == i)[0]].mean(),
        "(real value%s)" % dtree.tree_.value[i])

leaf num - 2: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 4: (check value 0.625) (real value[[0.625]])
leaf num - 6: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 7: (check value 0.625) (real value[[0.625]])

resudial_sample = residual.take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0])
y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0], axis=0)
numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator

-1.5999999999999999

常规的CART回归树，在叶子节点的预测值or输出值，为落在该叶子节点区域训练样本标签值的均值(在均方误差时，最优)。而二元GBDT中，考虑的不是尽可能接近负梯度，而是直接瞄准最原始的目标，对数损失函数最小。即：
$c_{t j}=\underbrace{\arg \min }_{c} \sum_{x_{i} \in R_{t j}} L\left(y_{i}, f_{t-1}\left(x_{i}\right)+c\right)$
上式较难优化，一般使用如下近似值代替：
$c_{t j}=\frac{\sum_{x_{i} \in R_{t j}} r_{t i}} { \sum_{x_{i} \in R_{t j}}(y_i-r_{t i})\left(1-y_i+r_{t i}\right)}$

看下代码:

# 单独更新某一个叶子节点的预测值
    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, raw_predictions, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        # 找到归属于该叶子节点的样本索引
        terminal_region = np.where(terminal_regions == leaf)[0]
        # 取出对应的样本的负梯度值和label值
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        # 样本权重，暂且不管
        sample_weight = sample_weight.take(terminal_region, axis=0)
        # sum负梯度
        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight *
                             (y - residual) * (1 - y + residual))

        # prevents overflow and division by zero
        # 决定叶子结点的值
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

tree作为可变变量，直接修改值，并在各个函数中传递，list、array有类似效果，后续修改残差也是通过这种方式直接更新值的。
下面验证下上述的计算逻辑：

先看第一个分类器的树图，可以便捷的看到叶子节点的值

from IPython.display import Image 
import pydotplus 
dot_data = tree.export_graphviz(gbm.estimators_[0][0], out_file=None, 
                         feature_names=iris_data.feature_names,  
                         class_names=iris_data.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

在这里插入图片描述

gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])

array([ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2, 11,  2,  2,  2,  2,  2,  2,  9,  2,  2,  2,  2,  2,  4,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 12,
       12, 12, 12, 12, 12,  8, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
       12,  4, 12, 12, 12, 12, 12, 12, 11, 12, 12,  5], dtype=int64)

gbm.estimators_[0][0].tree_.children_left

array([ 1,  2, -1,  4, -1, -1,  7,  8, -1, -1, 11, -1, -1], dtype=int64)

通过修改源码，把每次残差保存在self.raw_predictions_list中

计算叶子节点[3]对应的输出值

leaf2_sample_index = np.where(gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])==2)[0]
leaf2_sample_index

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
       37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)

y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(leaf2_sample_index, axis=0)

resudial_sample = gbm.residual_list[0][0].take(leaf2_sample_index, axis=0)

numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator

-1.5999999999999999

gbm.estimators_[0][0].tree_.value[2, 0, 0]

-1.5999999999999999

raw_predictions更新

上面回归树拟合，得到新的回归树

        raw_predictions[:, k] += \
            learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0)

二分类时，k=1，多分类先不管
用新回归树预测的值+raw_predictions原来的值来更新raw_predictions
这就是加法的逻辑

第二次训练

更新之后的raw_predictions和y计算更新负梯度
(X,residual)训练新回归树
继续迭代raw_predictions

迭代终止条件

达到预设的n_estimators停止迭代
迭代过程中的模型存放在self.estimators_中

模型预测

raw_prediction

模型预测分两步：

N个模型得到N个结果，raw_prediction
N个结果*learning_rate相加+初始化的值，sigmoid转换成0-1之间的概率

$F_{M}(x)=F_{0}(x)+\sum_{m=1}^{M} \sum_{j=1}^{J_{m}} c_{m, j} I\left(x \in R_{m, j}\right)$

    # 不特殊指定情况下，返回先验，也就是类别占比多的
    def _raw_predict_init(self, X):
        """Check input and compute raw predictions of the init estimtor."""
        self._check_initialized()
        X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
        if X.shape[1] != self.n_features_:
            raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
                self.n_features_, X.shape[1]))
        if self.init_ == 'zero':
            raw_predictions = np.zeros(shape=(X.shape[0], self.loss_.K),
                                       dtype=np.float64)
        else:
            raw_predictions = self.loss_.get_init_raw_predictions(
                X, self.init_).astype(np.float64)
        return raw_predictions
    # 看下面解释好像是返回预测值的和，具体代码似乎是c写的，隐藏了
    def _raw_predict(self, X):
        """Return the sum of the trees raw predictions (+ init estimator)."""
        # 初始预测值
        raw_predictions = self._raw_predict_init(X)
        # 走迭代的过程，求出最终的raw_predictions
        predict_stages(self.estimators_, X, self.learning_rate,
                       raw_predictions)
        return raw_predictions

核心代码是pwd文件，也看不到源码。直接按照上面的逻辑，做验证~

初始值和前面初始化中的逻辑一致~gbm.init_即DummyClassifier先得到预测概率，再sigmod反函数得到raw_prediction

gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:2,0:4],gbm.init_)

array([[-0.51082562],
       [-0.51082562]])

下面获取每个tree的预测值乘以learning_rate

pre_list = []
for i in gbm.estimators_:
    pre_list.append(i[0].predict(iris_df.iloc[0:1,0:4])[0]*gbm.learning_rate)

pre_list

[-0.16,
 -0.1511286273379727,
 -0.14395717809370576,
 -0.13806361187211877,
 -0.13315505338282463]

gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list)

-1.2371300944526125

predict_proba

sigmoid函数转换成概率

    def predict_proba(self, X):
        raw_predictions = self.decision_function(X)
        try:
            # 做了Logit变换
            return self.loss_._raw_prediction_to_proba(raw_predictions)
        except NotFittedError:
            raise
        except AttributeError:
            raise AttributeError('loss=%r does not support predict_proba' %
                                 self.loss)

gbm.predict_proba(iris_df.iloc[0:1,0:4])

array([[0.77506407, 0.22493593]])

expit(gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list))

0.2249359293642033

和直接预测结果一致，逻辑验证完毕

predict

predict有个label decode的逻辑，前面初始化时，通过gbm._validate_y把label值做了encode，比如本case中，把'versicolor', 'virginica'映射为0、1

    def predict(self, X):
        # 有些代码不是纯py的，总之，最终累加各个模型预测值得到一个结果
        raw_predictions = self.decision_function(X)
        # 找到最大的那个概率值所在的索引
        encoded_labels = \
            self.loss_._raw_prediction_to_decision(raw_predictions)
        # encoded_labels是[0,1]这样值，return返回其对应的原始y的数据，如(good,bad)、(+1,-1)这样
        return self.classes_.take(encoded_labels, axis=0)
-------------------------------------------------------------------------------------
    def _raw_prediction_to_proba(self, raw_predictions):
        proba = np.ones((raw_predictions.shape[0], 2), dtype=np.float64) # 生成N个[1,1]，用来存放概率
        proba[:, 1] = expit(raw_predictions.ravel())#做了logit变换，把值映射到0-1之间
        proba[:, 0] -= proba[:, 1]
        return proba

    def _raw_prediction_to_decision(self, raw_predictions):
        # 概率转换
        proba = self._raw_prediction_to_proba(raw_predictions)
        return np.argmax(proba, axis=1) # 返回的好像是索引(即哪个概率大，0or1)，代码会把索引进一步映射为对应的类别

decision_function函数得到raw_prediction
_raw_prediction_to_decision获取概率大的索引，其实也就是0、1
predict根据索引，映射成对应的原始类别

总结

模型训练

初始化第0个弱学习器 $F_0$ ，根据入参情况，可能是DummyClassifier，得到raw_prediction
训练n_estimators个分类器
- 1)根据raw_prediction和y计算负梯度residual
- 2)(特征X,负梯度residual)训练cart回归树
- 3)更新每个叶子节点的输出值
- 4)更新raw_prediction，raw_prediction+learning_rate*leaf_output即加上学习率乘tree的预测值
- 继续上述4步，进行迭代
分类器达到指定个数，停止迭代