GBDT原理与Sklearn源码分析-分类篇

摘要:

继上一篇文章,介绍完回归任务下的GBDT后,这篇文章将介绍在分类任务下的GBDT,大家将可以看到,对于回归和分类,其实GBDT过程简直就是一模一样的。如果说最大的不同的话,那就是在于由于loss function不同而引起的初始化不同、叶子节点取值不同。

正文:

GB的一些基本原理都已经在上文中介绍了,下面直接进入正题。
下面是分类任务的GBDT算法过程,其中选用的loss function是logloss。
L(yi,Fm(xi))=yilogpi+(1yi)log(1pi)L(yi,Fm(xi))=yilogpi+(1−yi)log(1−pi)
其中pi=11+e(Fm(xi))pi=11+e(−Fm(xi))


这里简单推导一下logloss通常化简后的式子:
L(yi,Fm(xi))={yilogpi+(1yi)log(1pi)}L(yi,Fm(xi))=−{yilogpi+(1−yi)log(1−pi)}
(先不带入负号)
带入pipi=>yilog(11+e(Fm(xi)))+(1yi)log(e(Fm(xi))1+e(Fm(xi)))yilog(11+e(−Fm(xi)))+(1−yi)log(e(−Fm(xi))1+e(−Fm(xi)))
=>yilog(1+e(Fm(xi)))+(1yi){log(e(Fm(xi)))log(1+e(Fm(xi)))}−yilog(1+e(−Fm(xi)))+(1−yi){log(e(−Fm(xi)))−log(1+e(−Fm(xi)))}
=>yilog(1+e(Fm(xi)))+log(e(Fm(xi)))log(1+e(Fm(xi)))yilog(e(Fm(xi)))+yilog(1+e(Fm(xi)))−yilog(1+e(−Fm(xi)))+log(e(−Fm(xi)))−log(1+e(−Fm(xi)))−yilog(e(−Fm(xi)))+yilog(1+e(−Fm(xi)))
=>yiFm(xi)log(1+eFm(xi))yiFm(xi)−log(1+eFm(xi))
最后加上负号可以得:
L(yi,Fm(xi))={yilogpi+(1yi)log(1pi)}={yiFm(xi)log(1+eFm(xi))}L(yi,Fm(xi))=−{yilogpi+(1−yi)log(1−pi)}=−{yiFm(xi)−log(1+eFm(xi))}


Algorithm 3:BinomiaDeviance_TreeBoost____________________________________F0(x)=0.5log(Ni=1yiNi=1(1yi))From=1 to M do:       yi~=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)=yi11+e(Fm1(xi))       {Rjm}J1=Jterminal node tree({y~i,xi}N1)       γjm=xiRjmy~ixiRjm(yiy~i)(1yi+y~i)       Fm(x)=Fm1(x)+j=1JγjmI(xRjm)Algorithm 3:BinomiaDeviance_TreeBoost____________________________________F0(x)=0.5∗log(∑i=1Nyi∑i=1N(1−yi))From=1 to M do:       yi~=−[∂L(yi,F(xi))∂F(xi)]F(x)=Fm−1(x)=yi−11+e(−Fm−1(xi))       {Rjm}1J=J−terminal node tree({y~i,xi}1N)       γjm=∑xi∈Rjmy~i∑xi∈Rjm(yi−y~i)∗(1−yi+y~i)       Fm(x)=Fm−1(x)+∑j=1JγjmI(x∈Rjm)

算法3就是GBDT用于分类任务时,loss funcion选用logloss的算法流程。
可以看到,和回归任务是一样的,并没有什么特殊的处理环节。
(其实在sklearn源码里面,虽然回归任务的模型定义是GradientBoostingRegressor()而分类任务是GradientBoostingClassifier(),但是这两者区分开来是为了方便用户使用,最终两者都是共同继承BaseGradientBoosting(),算法3这些流程都是在BaseGradientBoosting()完成的,GradientBoostingRegressor()、GradientBoostingClassifier()只是完成一些学习器参数配置的任务)

实践

下面同样以一个简单的数据集来大致的介绍一下GBDT的过程。

xixi12345678910
yiyi0001100011

参数配置:
1. 以logloss为损失函数
2. 以MSE为分裂准则
3. 树的深度为1
4. 学习率为0.1


算法3的第一步,初始化。
F0(x)=log(Ni=1yiNi=1(1yi))=log(46)=0.4054F0(x)=log(∑i=1Nyi∑i=1N(1−yi))=log(46)=−0.4054


拟合第一颗树(m=1m=1)
计算负梯度值:
yi~=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)=yi11+e(Fm1(xi))=yi11+e(F0(xi))yi~=−[∂L(yi,F(xi))∂F(xi)]F(x)=Fm−1(x)=yi−11+e(−Fm−1(xi))=yi−11+e(−F0(xi))

比如计算第一个样本(i=1i=1)有:
y1~=011+e(0.4054)=0.400y1~=0−11+e(0.4054)=−0.400
同样地,其他计算后如下表:

xixi12345678910
y~iy~i-0.4-0.4-0.40.60.6-0.4-0.4-0.40.60.6

接着,我们需要以y~iy~i为目标,拟合一颗树。
拟合树的过程上篇文章已经详细介绍了,这里就不再累述了。拟合完后结果如下:
这里写图片描述

可以得出建好树之后叶子节点的区域:
R11R11xi<=8xi<=8R21R21xi>8xi>8
下面计算可以叶子节点的值γjmγjm
由公式:γjm=xiRjmy~ixiRjm(yiy~i)(1yi+y~i)γjm=∑xi∈Rjmy~i∑xi∈Rjm(yi−y~i)∗(1−yi+y~i)
对于区域R11R11有如下:
xiR11y~i=(y~1+y~2+y~3+y~4+y~5+y~6+y~7+y~8)=1.2∑xi∈R11y~i=(y~1+y~2+y~3+y~4+y~5+y~6+y~7+y~8)=−1.2
xiR11(yiy~i)(1yi+y~i)=(y1y~1)(1y1+y~1)+(y2y~2)(1y2+y~2)+(y3y~3)(1y3+y~3)+(y4y~4)(1y4+y~4)+(y5y~5)(1y5+y~5)+(y6y~6)(1y6+y~6)+(y7y~7)(1y7+y~7)+(y8y~8)(1y8+y~8)+(y9y~9)(1y9+y~9)=1.92∑xi∈R11(yi−y~i)∗(1−yi+y~i)=(y1−y~1)∗(1−y1+y~1)+(y2−y~2)∗(1−y2+y~2)+(y3−y~3)∗(1−y3+y~3)+(y4−y~4)∗(1−y4+y~4)+(y5−y~5)∗(1−y5+y~5)+(y6−y~6)∗(1−y6+y~6)+(y7−y~7)∗(1−y7+y~7)+(y8−y~8)∗(1−y8+y~8)+(y9−y~9)∗(1−y9+y~9)=1.92

对于区域R21R21有如下:
xiR21y~i=(y~9+y~10)=1.2∑xi∈R21y~i=(y~9+y~10)=1.2
xiR21(yiy~i)(1yi+y~i)=(y9y~9)(1y9+y~9)+(y10y~10)(1y10+y~10)=0.48∑xi∈R21(yi−y~i)∗(1−yi+y~i)=(y9−y~9)∗(1−y9+y~9)+(y10−y~10)∗(1−y10+y~10)=0.48

故最后可以得到两个叶子节点的值:
γ11=1.21.92=0.625γ11=−1.21.92=−0.625γ21=1.20.480=2.5γ21=1.20.480=2.5

最后通过Fm(x)=Fm1(x)+Jj=1γjmI(xRjm)Fm(x)=Fm−1(x)+∑j=1JγjmI(x∈Rjm)更新F1(x)F1(x),需要注意的是,这里同样也用shrinkage,即乘一个学习率ηη,具体表现为:
Fm(x)=Fm1(x)+ηJj=1γjmI(xRjm)Fm(x)=Fm−1(x)+η∗∑j=1JγjmI(x∈Rjm)

以计算x1x1为例:
F1(x1)=F0(x1)+0.1(0.625)=0.40540.0625=0.4679F1(x1)=F0(x1)+0.1∗(−0.625)=−0.4054−0.0625=−0.4679
其他计算完毕后如下表供参考:

xixi12345678910
F1(xi)F1(xi)-0.46796511-0.46796511-0.46796511-0.46796511-0.46796511-0.46796511-0.46796511-0.46796511-0.15546511-0.15546511

至此,第一颗树已经训练完成。可以再次看到其训练过程和回归基本没有区别。


下面简单提一下拟合第二颗树(m=2)m=2)

计算负梯度值:
比如对于x1x1有:
=>y~1=y111+e(F1(x1))=00.38509=0.38509y~1=y1−11+e(−F1(x1))=0−0.38509=−0.38509
其他同理,可得下表:

xixi12345678910
y~iy~i-0.38509799-0.38509799-0.385097990.614902010.61490201-0.38509799-0.38509799-0.385097990.538788180.53878818

之后也是以新的y~iy~i为目标拟合一颗回归树后计算叶子节点的区间和叶子节点的值。


关于预测

当只有2颗树的时候,其预测过程也是和下面这个图一样
这里写图片描述

相比于回归任务,分类任务需把要最后累加的结果Fm(x)Fm(x)转成概率。(其实Fm(x)Fm(x)可以理解成一个得分)。具体来说:
对于采用logloss作为损失函数的情况下,pi=11+e(Fm(xi))pi=11+e(−Fm(xi))
对于采用指数损失作为损失函数的情况下,pi=11+e(2Fm(xi))pi=11+e(−2Fm(xi))
当然这里的pipi指的是正样本的概率。

这里再详细一点,比如对于上面例子,当我们拟合完第二颗树后,计算F2(x)F2(x)可有有下表:

xixi12345678910
F2(xi)F2(xi)-0.52501722-0.52501722-0.52501722-0.52501722-0.52501722-0.52501722-0.52501722-0.525017220.061355010.06135501

此时计算相应的概率值有:
F2(x)F2(x)可有有下表:

xixi12345678910
pipi0.371679790.371679790.371679790.371679790.371679790.371679790.371679790.371679790.515333940.51533394

(表中的概率为正样本的概率,即yi=1yi=1的概率)

Sklearn源码简单分析

写在前面:Sklearn源码分析后面有时间有添加一些内容,下面先简单了解GDBT分类的核心代码。


当loss function选用logloss时,对应的是sklearn里面的loss=’deviance’。
计算负梯度、初始化、更新叶子节点、转成概率都在一个名叫BinomialDeviance()的类中。

class BinomialDeviance(ClassificationLossFunction):
    """Binomial deviance loss function for binary classification.

    Binary classification is a special case; here, we only need to
    fit one tree instead of ``n_classes`` trees.
    """
    def __init__(self, n_classes):
        if n_classes != 2:
            raise ValueError("{0:s} requires 2 classes.".format(
                self.__class__.__name__))
        # we only need to fit one tree for binary clf.
        super(BinomialDeviance, self).__init__(1)

    def init_estimator(self):
        return LogOddsEstimator()

    def __call__(self, y, pred, sample_weight=None):
        """Compute the deviance (= 2 * negative log-likelihood). """
        # logaddexp(0, v) == log(1.0 + exp(v))
        pred = pred.ravel()
        if sample_weight is None:
            return -2.0 * np.mean((y * pred) - np.logaddexp(0.0, pred))
        else:
            return (-2.0 / sample_weight.sum() *
                    np.sum(sample_weight * ((y * pred) - np.logaddexp(0.0, pred))))

    def negative_gradient(self, y, pred, **kargs):
        """Compute the residual (= negative gradient). """
        return y - expit(pred.ravel())

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, pred, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
        # prevents overflow and division by zero
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

    def _score_to_proba(self, score):
        proba = np.ones((score.shape[0], 2), dtype=np.float64)
        proba[:, 1] = expit(score.ravel())
        proba[:, 0] -= proba[:, 1]
        return proba

    def _score_to_decision(self, score):
        proba = self._score_to_proba(score)
        return np.argmax(proba, axis=1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

下面这是用于计算负梯度值。注意的函数expit就是11+ex11+e−x
代码中的y_pred或者pred表达的就是Fm1(x)Fm−1(x)

    def negative_gradient(self, y, pred, **kargs):
        """Compute the residual (= negative gradient). """
        return y - expit(pred.ravel())
  • 1
  • 2
  • 3

更新叶子节点,关键在于计算numerator和denominator。
另外代码里的residual代表的是负梯度值。

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, pred, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
        # prevents overflow and division by zero
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator
  • 1
  • 2
  • 3
  • 4
  • 5

初始化的类:

class LogOddsEstimator(object):
    """An estimator predicting the log odds ratio."""
    scale = 1.0

    def fit(self, X, y, sample_weight=None):
        # pre-cond: pos, neg are encoded as 1, 0
        if sample_weight is None:
            pos = np.sum(y)
            neg = y.shape[0] - pos
        else:
            pos = np.sum(sample_weight * y)
            neg = np.sum(sample_weight * (1 - y))

        if neg == 0 or pos == 0:
            raise ValueError('y contains non binary labels.')
        self.prior = self.scale * np.log(pos / neg)

    def predict(self, X):
        check_is_fitted(self, 'prior')

        y = np.empty((X.shape[0], 1), dtype=np.float64)
        y.fill(self.prior)
        return y
  • 1
  • 2

其中,下面这个用于初始化,可以看到有一个因子self.scale,这是由于在Sklearn里提供两种loss function用于分类,一种是logloss,一种是指数损失,两者的初始化仅仅只是在系数上不同,前者是1.0,后者是0.5。

    def fit(self, X, y, sample_weight=None):
        # pre-cond: pos, neg are encoded as 1, 0
        if sample_weight is None:
            pos = np.sum(y)
            neg = y.shape[0] - pos
        else:
            pos = np.sum(sample_weight * y)
            neg = np.sum(sample_weight * (1 - y))

        if neg == 0 or pos == 0:
            raise ValueError('y contains non binary labels.')
        self.prior = self.scale * np.log(pos / neg)
  • 1

最后是转化成概率,这里有个细节,就是正样本的概率是放在第2列(从1数起)。

    def _score_to_proba(self, score):
        proba = np.ones((score.shape[0], 2), dtype=np.float64)
        proba[:, 1] = expit(score.ravel())
        proba[:, 0] -= proba[:, 1]
        return proba

总结

至此,GBDT用于回归和分类的两种情况都已经说明完毕,欠缺的可能是源码部分说的不够深入,由于最近时间的关系没办法做到太深入,所以后面找时间会把代码再深入的分析后补充在这。

对于多分类问题也需要单独讨论详细请看文章

参考资料

http://docplayer.net/21448572-Generalized-boosted-models-a-guide-to-the-gbm-package.html(各种loss function的推导结果)
http://xueshu.baidu.com/s?wd=paperuri%3A%28ab7165108163edc94b30781e51819e0c%29&filter=sc_long_sign&sc_ks_para=q%3DGreedy%20function%20approximation%3A%20A%20gradient%20boosting%20machine.&sc_us=13783016239361402484&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8 (本文主要参考的超级著名论文 greedy function approximation: a gradient boosting machine)

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值