对数损失函数
LR中,求参数时用到了极大似然,即求得某些参数使得已知样本出现的概率最大。对于分类问题,训练集我们已知label,如果所有样本都预测准确,那么 y ^ = y \hat{y}=y y^=y。即:
- y = 0 y=0 y=0, P ( Y ∣ X ) P(Y|X) P(Y∣X)尽可能接近0
- y = 1 y=1 y=1, P ( Y ∣ X ) P(Y|X) P(Y∣X)尽可能接近1
则目标为:
∏
i
=
1
N
[
P
(
Y
=
1
∣
X
=
x
i
)
]
y
i
[
1
−
P
(
Y
=
0
∣
X
=
x
i
)
]
1
−
y
i
\prod_{i=1}^{N}\left[P\left(Y=1|X=x_{i}\right)\right]^{y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right]^{1-y_{i}}
∏i=1N[P(Y=1∣X=xi)]yi[1−P(Y=0∣X=xi)]1−yi尽可能的大
进一步用log把求积转化求和,则损失函数定义如下:
L
o
s
s
=
−
∑
i
=
1
N
y
i
[
P
(
Y
=
1
∣
X
=
x
i
)
]
+
1
−
y
i
[
1
−
P
(
Y
=
0
∣
X
=
x
i
)
]
Loss = -\sum_{i=1}^{N}{y_{i}}\left[P\left(Y=1|X=x_{i}\right)\right]+{1-y_{i}}\left[1-P\left(Y=0|X=x_{i}\right)\right]
Loss=−i=1∑Nyi[P(Y=1∣X=xi)]+1−yi[1−P(Y=0∣X=xi)]
假设
y
∈
{
0
,
1
}
y\in\{0,1\}
y∈{0,1},对于某个样本(
x
i
x_i
xi,
y
i
y_i
yi),对数损失函数logloss可以定义为:
L
(
y
i
,
p
(
Y
=
1
∣
x
i
)
)
=
−
{
y
i
log
p
i
+
(
1
−
y
i
)
log
(
1
−
p
i
)
}
L\left(y_{i}, p\left(Y=1|x_{i}\right)\right)=-\left\{y_{i} \log p_{i}+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right\}
L(yi,p(Y=1∣xi))=−{yilogpi+(1−yi)log(1−pi)}
在GBDT中,每一轮都在拟合负梯度,预测时首先是对连续值的累加,最终通过sigmoid函数映射到0-1之间。
f m ( x ) = f m − 1 ( x ) + α m G m ( x ) = ∑ t = 1 m G t ( x ) P ( Y = 1 ∣ X = x i ) = s i g m o i d ( f m ( x i ) ) = p ( x i ) = 1 1 + e ( − f m ( x i ) ) \begin{aligned} &{f_m}\left( x \right) = {f_{m - 1}}\left( x \right) + {\alpha _m}{G_m}\left( x \right) = \sum\limits_{t = 1}^m {{G_t}\left( x \right)}\\ &P(Y=1|X=x_i)=sigmoid(f_m(x_i))=p{(x_i)}=\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}} \end{aligned} fm(x)=fm−1(x)+αmGm(x)=t=1∑mGt(x)P(Y=1∣X=xi)=sigmoid(fm(xi))=p(xi)=1+e(−fm(xi))1
把 f m ( x ) f_m(x) fm(x)代入loss:
− L o s s ( y i , f m ( x i ) ) = y i log ( 1 1 + e ( − f m ( x i ) ) ) + ( 1 − y i ) log ( e ( − f m ( x i ) ) 1 + e ( − f m ( x i ) ) ) = − y i log ( 1 + e ( − f m ( x i ) ) ) + ( 1 − y i ) { log ( e ( − f m ( x i ) ) ) − log ( 1 + e ( − f m ( x i ) ) ) } = − y i log ( 1 + e ( − f m ( x i ) ) ) + log ( e ( − f m ( x i ) ) ) − log ( 1 + e ( − f m ( x i ) ) ) − y i log ( e ( − f m ( x i ) ) ) + y i log ( 1 + e ( − f m ( x i ) ) ) = y i f m ( x i ) − log ( 1 + e f m ( x i ) ) s o : L o s s ( y i , f m ( x i ) ) = − { y i f m ( x i ) − log ( 1 + e f m ( x i ) ) } = − y i f m ( x i ) + log ( 1 + e f m ( x i ) ) \begin{aligned} -Loss(y_i,f_m(x_i))&=y_{i} \log \left(\frac{1}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)+\left(1-y_{i}\right) \log \left(\frac{e^{\left(-f_{m}\left(x_{i}\right)\right)}}{1+e^{\left(-f_{m}\left(x_{i}\right)\right)}}\right)\\ &=-y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\left(1-y_{i}\right)\left\{\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)\right\}\\ &=-y_{i}\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+\log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-\log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)-y_{i} \log \left(e^{\left(-f_{m}\left(x_{i}\right)\right)}\right)+y_{i} \log \left(1+e^{\left(-f_{m}\left(x_{i}\right)\right)}\right) \\ &= y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\\ so:Loss(y_i,f_m(x_i))&=-\{ y_{i} f_{m}\left(x_{i}\right)-\log \left(1+e^{f_{m}\left(x_{i}\right)}\right)\}\\ &=- y_{i} f_{m}\left(x_{i}\right)+\log \left(1+e^{f_{m}\left(x_{i}\right)}\right) \end{aligned} −Loss(yi,fm(xi))so:Loss(yi,fm(xi))=yilog(1+e(−fm(xi))1)+(1−yi)log(1+e(−fm(xi))e(−fm(xi)))=−yilog(1+e(−fm(xi)))+(1−yi){log(e(−fm(xi)))−log(1+e(−fm(xi)))}=−yilog(1+e(−fm(xi)))+log(e(−fm(xi)))−log(1+e(−fm(xi)))−yilog(e(−fm(xi)))+yilog(1+e(−fm(xi)))=yifm(xi)−log(1+efm(xi))=−{yifm(xi)−log(1+efm(xi))}=−yifm(xi)+log(1+efm(xi))
负梯度
那么负梯度为:
δ L δ f = − y + e f 1 + e f = − y + 1 1 + e − f \begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \end{aligned} δfδL=−y+1+efef=−y+1+e−f1
对于第m次迭代, f = f m − 1 f=f_{m-1} f=fm−1,即可求个各个点的负梯度
Sklearn源码
Intro
- sklearn版本:0.22
- 原来代码为_gb.py,方便测试和结果数据,改为_gb_Source.py
- 用iris数据进行代码测试和验证
重点关注下面几个部分:
- 参数初始化、check部分
- 迭代过程中给定的初始值
- 具体每一轮的迭代过程
- 模型储存和预测
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.ensemble._gb_Source import GradientBoostingClassifier
from sklearn.utils.validation import check_array, column_or_1d
from sklearn.tree._tree import DTYPE, DOUBLE
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree._tree import TREE_LEAF
from scipy.sparse import csc_matrix
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
from scipy.special import expit
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils import check_random_state
from sklearn.utils import check_array
from sklearn.utils import column_or_1d
from sklearn.utils import check_consistent_length
from sklearn.utils import deprecated
from sklearn.utils.fixes import logsumexp
from sklearn.utils.stats import _weighted_percentile
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.multiclass import check_classification_targets
from sklearn.exceptions import NotFittedError
from sklearn.dummy import DummyClassifier
import math
from sklearn import tree
iris_data = datasets.load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df["target"] = iris_data.target # 0 1 2
iris_df["target_name"] = iris_df.target.replace({0: "setosa", 1: "versicolor", 2: "virginica"})
iris_df = iris_df.query("target_name in ('virginica', 'versicolor')")
iris_df = iris_df.iloc[0:80]
print(iris_df.target_name.value_counts())
iris_df.head()
versicolor 50
virginica 30
Name: target_name, dtype: int64
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | target_name | |
---|---|---|---|---|---|---|
50 | 7.0 | 3.2 | 4.7 | 1.4 | 1 | versicolor |
51 | 6.4 | 3.2 | 4.5 | 1.5 | 1 | versicolor |
52 | 6.9 | 3.1 | 4.9 | 1.5 | 1 | versicolor |
53 | 5.5 | 2.3 | 4.0 | 1.3 | 1 | versicolor |
54 | 6.5 | 2.8 | 4.6 | 1.5 | 1 | versicolor |
参数初始化、check
调用BaseGradientBoosting类的fit方法时,会完成部分参数的初始化。
初始化主要看两个点:
- check_array把X和y转成数组,column_or_1d把y转成1维数组
- check_consistent_length检验X和y数据量是否一致
self._validate_y
完成了y的encode,把数据转换成0 1 2这样的有序值self.classes_、self.n_classes_
分别表示类别值的枚举和枚举值个数
重点看下_validate_y
函数:
def _validate_y(self, y, sample_weight):
check_classification_targets(y) # check是不是类别型数据,看代码更像通过check变量y的unique数量判断类型
# unique枚举值、转换成数值0,1,2等等,原来的值排序,映射到0,1,2
self.classes_, y = np.unique(y, return_inverse=True)
# np.bincount统计各个值出现的个数,由于y在前面已经排完序了,所以就是各个值出现的个数
# 在统计非0的个数,这个理论上不会出现异常吧。。。
n_trim_classes = np.count_nonzero(np.bincount(y, sample_weight))
if n_trim_classes < 2:
raise ValueError("y contains %d class after sample_weight "
"trimmed classes with zero weights, while a "
"minimum of 2 classes are required."
% n_trim_classes)
self.n_classes_ = len(self.classes_)
return y
涉及到一些np的api,具体可以查文档。总之,_validate_y
把['a','b','c']
映射成[0,1,2]
。self.classes_=已排序的原来字符串,y则被映射过的值替换。另,二分类fit时,调用的是GradientBoostingClassifier中override的_validate_y
,而不是BaseGradientBoosting中的~
gbm = GradientBoostingClassifier(random_state=10,n_estimators=5)
gbm.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5,
n_iter_no_change=None, presort='deprecated',
random_state=10, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
gbm.n_classes_
2
gbm.classes_
array(['versicolor', 'virginica'], dtype=object)
fit时完成y的encode,预测时完成decode。
gbm._validate_y(iris_df.target_name,sample_weight=None)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
迭代初始值确定
gbm.loss
'deviance'
if (self.loss not in self._SUPPORTED_LOSS
or self.loss not in _gb_losses.LOSS_FUNCTIONS):
raise ValueError("Loss '{0:s}' not supported. ".format(self.loss))
# loss_class相当于直接把类赋过来
if self.loss == 'deviance':
loss_class = (_gb_losses.MultinomialDeviance
if len(self.classes_) > 2
else _gb_losses.BinomialDeviance)
else:
loss_class = _gb_losses.LOSS_FUNCTIONS[self.loss]
if self.loss in ('huber', 'quantile'):
self.loss_ = loss_class(self.n_classes_, self.alpha)
else:
self.loss_ = loss_class(self.n_classes_) #传入样本类别数??得到一个类的实例
先看loss咋整的:
_check_params
完成部分参数的初始化和check功能。self.loss
有参数传进来,默认deviance。先验证传过来的值是否在可支持范围内- 根据
self.classes_
的个数判断多分类or二分类 loss_class
相当于是BinomialDeviance类self.loss_
是这个类的一个实例,对于二分类,传入了参数2,但是生成实例时,强制把参数赋为1- 在超类LossFunction中,属性K的值就变为了1,即
self.loss_.K=1
,这个K后面对生成数据的纬度至关重要
class BinomialDeviance(ClassificationLossFunction):
# 注意n_class强制转成1了,也就是K是1了
def __init__(self, n_classes):
if n_classes != 2:
raise ValueError("{0:s} requires 2 classes; got {1:d} class(es)"
.format(self.__class__.__name__, n_classes))
# we only need to fit one tree for binary clf.
super().__init__(n_classes=1)
接下来看初始值
def _init_state(self):
"""Initialize model state and allocate model state data structures. """
self.init_ = self.init
# 返回的是第一轮的预测值,也就是决定了第一次迭代时y^hat,分类默认是根据先验,选择占比多的
# 默认是None,此时直接调用DummyEstimator函数,计算先验
if self.init_ is None:
self.init_ = self.loss_.init_estimator()
self.estimators_ = np.empty((self.n_estimators, self.loss_.K),
dtype=np.object)
self.train_score_ = np.zeros((self.n_estimators,), dtype=np.float64)
# do oob?
if self.subsample < 1.0:
self.oob_improvement_ = np.zeros((self.n_estimators),
dtype=np.float64)
_init_state
完成loss和estimators_
的初始化。注意对于二分类,生成了维度为(self.n_estimators,1)
个空数组,用来存放迭代的回归树
初始值由init
指定:
estimator or ‘zero’, default=None
An estimator object that is used to compute the initial predictions. init has to provide fit and predict_proba. If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.
_init_state
初始化参数,self.init_ = self.init
,把init
值赋给_init
- 后面接了判断,如果
self.init_ is None,self.init_ = self.loss_.init_estimator()
- 调用
self.loss_.init_estimator()
,即BinomialDeviance的init_estimator方法,返回DummyClassifier(strategy=‘prior’) - 以上完成
self._init_state()
过程 self.init_ == 'zero'
时,raw_predictions为(n,1)的数组,所有元素都是0- 否则,
self.init_.fit(X, y, sample_weight=sample_weight)
,调用DummyClassifier.fit
raw_predictions = self.loss_.get_init_raw_predictions(X, self.init_)
,这里有点点复杂,直接看下代码
def get_init_raw_predictions(self, X, estimator):
probas = estimator.predict_proba(X)
# postive类的先验概率值
proba_pos_class = probas[:, 1]
# eps是取非负的最小值
eps = np.finfo(np.float32).eps
# 限制proba_pos_class最大最小值不能超过给定eps和1-eps,超过则等于
proba_pos_class = np.clip(proba_pos_class, eps, 1 - eps)
# 反函数,最后算概率要再反算一遍,所以先用反函数
# log(x / (1 - x)) is the inverse of the sigmoid (expit) function
raw_predictions = np.log(proba_pos_class / (1 - proba_pos_class))
# 转换之后维度变成(n,1),这样就和_update_terminal_regions中的raw_predictions[:, k]
# 对应起来,对于分类问题,k固定是1
return raw_predictions.reshape(-1, 1).astype(np.float64)
prior_model = DummyClassifier(strategy='prior')
prior_model.fit(X=iris_df.iloc[:, 0:4], y=iris_df.target_name)
DummyClassifier(constant=None, random_state=None, strategy='prior')
prior_model.predict_proba(X=iris_df.iloc[:, 0:4])[0:5]
array([[0.625, 0.375],
[0.625, 0.375],
[0.625, 0.375],
[0.625, 0.375],
[0.625, 0.375]])
简单说,如果正负样本为3:5,p(y=1)概率都被预测为0.375,然后通过sigmoid的反函数计算得到raw_predictions。这里需要搞清楚所谓的raw_predictions其实就是个连续值,也就是迭代过程中一步步叠加的累计值。最终都是通过sigmoid转换到0-1之间。上面代码中np函数可以自己查看文档,再了解下。
下面验证下逻辑:
from sklearn.ensemble._gb_losses import BinomialDeviance
gbm_loss = BinomialDeviance(2)
gbm_loss.K
1
可见K被强制赋值1,和入参无关
raw_predictions = gbm_loss.get_init_raw_predictions(X=iris_df.iloc[:, 0:4],estimator=prior_model)
raw_predictions.shape
(80, 1)
raw_predictions[0:5]
array([[-0.51082562],
[-0.51082562],
[-0.51082562],
[-0.51082562],
[-0.51082562]])
再验证下sigmoid转换
from scipy.special import expit
print(expit(-0.51082562),1/(1+np.exp(0.51082562)))
0.37500000088265406 0.37500000088265406
和DummyClassifier预测的先验概率一致
迭代过程
n_stages = self._fit_stages(
X, y, raw_predictions, sample_weight, self._rng, X_val, y_val,
sample_weight_val, begin_at_stage, monitor, X_idx_sorted)
self._fit_stages
主要逻辑没啥,就是for循环self.n_estimators
次,调用self._fit_stage
,需关注每次训练的值和模型如何更新- 即:
raw_predictions = self._fit_stage(i, X, y, raw_predictions, sample_weight, sample_mask,random_state, X_idx_sorted, X_csc, X_csr)
对于每一次迭代,主要步骤如下:
- 根据上一次的raw_predictions计算负梯度
- 回归树预测负梯度,更新叶子结点的预测值,更新树结构
- 更新raw_predictions,把树模型存在
self.estimators_
中
下面走一下单次模型更新逻辑
训练回归树,入参X=特征数据和y=负梯度
# 先把字符串类别的label转成0-1型
y = gbm._validate_y(iris_df.target_name,sample_weight=None)
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
raw_predictions[0:5]
array([[-0.51082562],
[-0.51082562],
[-0.51082562],
[-0.51082562],
[-0.51082562]])
residual = gbm.loss_.negative_gradient(y,raw_predictions)
residual
array([-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375, -0.375,
-0.375, -0.375, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625,
0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625,
0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625,
0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625, 0.625])
def negative_gradient(self, y, raw_predictions, **kargs):
"""Compute the residual (= negative gradient).
Parameters
----------
y : 1d array, shape (n_samples,)
True labels.
raw_predictions : 2d array, shape (n_samples, K)
The raw_predictions (i.e. values from the tree leaves) of the
tree ensemble at iteration ``i - 1``.
"""
# expit(x) = 1/(1+exp(-x))
return y - expit(raw_predictions.ravel())
负梯度的公式前面推导过,代码如上,很清晰。y为真实label值[0,1],raw_predictions
则是上一轮预测的值,验证下~
δ L δ f = − y + e f 1 + e f = − y + 1 1 + e − f = − 0 + 1 / ( 1 + e − ( − 0.51082562 ) ) = 1 / ( 1 + e 0.51082562 ) = 0.3750 \begin{aligned} \frac{{ \delta L}}{{ \delta f}} &=-y+\frac{{\mathop{{e}}\nolimits^{{f}}}}{{1+\mathop{{e}}\nolimits^{{f}}}} \\ &=-y+\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-f}}}} \\ &=-0+1/(1+e^{-(-0.51082562)}) \\ &=1/(1+e^{0.51082562})\\ &=0.3750 \end{aligned} δfδL=−y+1+efef=−y+1+e−f1=−0+1/(1+e−(−0.51082562))=1/(1+e0.51082562)=0.3750
1/(1+math.exp(0.51082562))
0.37500000088265406
拟合回归树&更新叶子节点
参数和源码中可能不同,无所谓,看主要矛盾
dtree = DecisionTreeRegressor(criterion="friedman_mse")
dtree.fit(iris_df.iloc[:, 0:4], residual)
DecisionTreeRegressor(ccp_alpha=0.0, criterion='friedman_mse', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(dtree, out_file=None,
feature_names=iris_data.feature_names,
class_names=iris_data.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
更新叶子节点预测值的公式很直观,但是代码涉及很多tree的方法,不太熟悉,凑合着整吧
def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
residual, raw_predictions, sample_weight):
"""Make a single Newton-Raphson step.
our node estimate is given by:
sum(w * (y - prob)) / sum(w * prob * (1 - prob))
we take advantage that: y - prob = residual
"""
terminal_region = np.where(terminal_regions == leaf)[0]
residual = residual.take(terminal_region, axis=0)
y = y.take(terminal_region, axis=0)
sample_weight = sample_weight.take(terminal_region, axis=0)
numerator = np.sum(sample_weight * residual)
denominator = np.sum(sample_weight *
(y - residual) * (1 - y + residual))
# prevents overflow and division by zero
# 决定叶子结点的值
if abs(denominator) < 1e-150:
tree.value[leaf, 0, 0] = 0.0
else:
tree.value[leaf, 0, 0] = numerator / denominator
dtree.apply(iris_df.iloc[:,0:4])
array([ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 14, 2, 2, 2, 2, 2, 2, 11, 2, 2, 2, 2, 2, 6,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16,
16, 16, 16, 16, 16, 10, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 4, 16, 16, 16, 16, 16, 16, 15, 16, 16, 7], dtype=int64)
dtree.tree_.children_left
array([ 1, 2, -1, 4, -1, 6, -1, -1, 9, 10, -1, -1, 13, 14, -1, -1, -1],
dtype=int64)
第3、5、7等节点是叶子节点,计算第3个节点的预测值是否为y的均值
注意:第3个节点是索引从1开始计数,apply方法返回的是从0开始计数
np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0]
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)
for i in [2, 4, 6, 7]:
print(
"leaf num - %s:" % i, "(check value %s)" %
residual[np.where(dtree.apply(iris_df.iloc[:, 0:4]) == i)[0]].mean(),
"(real value%s)" % dtree.tree_.value[i])
leaf num - 2: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 4: (check value 0.625) (real value[[0.625]])
leaf num - 6: (check value -0.37499999999999994) (real value[[-0.375]])
leaf num - 7: (check value 0.625) (real value[[0.625]])
resudial_sample = residual.take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0])
y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(np.where(dtree.apply(iris_df.iloc[:,0:4]) == 2)[0], axis=0)
numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator
-1.5999999999999999
常规的CART回归树,在叶子节点的预测值or输出值,为落在该叶子节点区域训练样本标签值的均值(在均方误差时,最优)。而二元GBDT中,考虑的不是尽可能接近负梯度,而是直接瞄准最原始的目标,对数损失函数最小。即:
c
t
j
=
arg
min
⏟
c
∑
x
i
∈
R
t
j
L
(
y
i
,
f
t
−
1
(
x
i
)
+
c
)
c_{t j}=\underbrace{\arg \min }_{c} \sum_{x_{i} \in R_{t j}} L\left(y_{i}, f_{t-1}\left(x_{i}\right)+c\right)
ctj=c
argminxi∈Rtj∑L(yi,ft−1(xi)+c)
上式较难优化,一般使用如下近似值代替:
c
t
j
=
∑
x
i
∈
R
t
j
r
t
i
∑
x
i
∈
R
t
j
(
y
i
−
r
t
i
)
(
1
−
y
i
+
r
t
i
)
c_{t j}=\frac{\sum_{x_{i} \in R_{t j}} r_{t i}} { \sum_{x_{i} \in R_{t j}}(y_i-r_{t i})\left(1-y_i+r_{t i}\right)}
ctj=∑xi∈Rtj(yi−rti)(1−yi+rti)∑xi∈Rtjrti
看下代码:
# 单独更新某一个叶子节点的预测值
def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
residual, raw_predictions, sample_weight):
"""Make a single Newton-Raphson step.
our node estimate is given by:
sum(w * (y - prob)) / sum(w * prob * (1 - prob))
we take advantage that: y - prob = residual
"""
# 找到归属于该叶子节点的样本索引
terminal_region = np.where(terminal_regions == leaf)[0]
# 取出对应的样本的负梯度值和label值
residual = residual.take(terminal_region, axis=0)
y = y.take(terminal_region, axis=0)
# 样本权重,暂且不管
sample_weight = sample_weight.take(terminal_region, axis=0)
# sum负梯度
numerator = np.sum(sample_weight * residual)
denominator = np.sum(sample_weight *
(y - residual) * (1 - y + residual))
# prevents overflow and division by zero
# 决定叶子结点的值
if abs(denominator) < 1e-150:
tree.value[leaf, 0, 0] = 0.0
else:
tree.value[leaf, 0, 0] = numerator / denominator
tree作为可变变量,直接修改值,并在各个函数中传递,list、array有类似效果,后续修改残差也是通过这种方式直接更新值的。
下面验证下上述的计算逻辑:
先看第一个分类器的树图,可以便捷的看到叶子节点的值
from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(gbm.estimators_[0][0], out_file=None,
feature_names=iris_data.feature_names,
class_names=iris_data.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])
array([ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 11, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 2, 2, 4,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12,
12, 12, 12, 12, 12, 8, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 4, 12, 12, 12, 12, 12, 12, 11, 12, 12, 5], dtype=int64)
gbm.estimators_[0][0].tree_.children_left
array([ 1, 2, -1, 4, -1, -1, 7, 8, -1, -1, 11, -1, -1], dtype=int64)
通过修改源码,把每次残差保存在self.raw_predictions_list
中
计算叶子节点[3]对应的输出值
leaf2_sample_index = np.where(gbm.estimators_[0][0].apply(iris_df.iloc[:, 0:4])==2)[0]
leaf2_sample_index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int64)
y_sample = gbm._validate_y(iris_df.target_name,sample_weight=None).take(leaf2_sample_index, axis=0)
resudial_sample = gbm.residual_list[0][0].take(leaf2_sample_index, axis=0)
numerator = np.sum(resudial_sample)
denominator = np.sum((y_sample - resudial_sample) * (1 - y_sample + resudial_sample))
numerator / denominator
-1.5999999999999999
gbm.estimators_[0][0].tree_.value[2, 0, 0]
-1.5999999999999999
raw_predictions更新
上面回归树拟合,得到新的回归树
raw_predictions[:, k] += \
learning_rate * tree.value[:, 0, 0].take(terminal_regions, axis=0)
- 二分类时,k=1,多分类先不管
- 用新回归树预测的值+raw_predictions原来的值来更新raw_predictions
- 这就是加法的逻辑
第二次训练
- 更新之后的raw_predictions和y计算更新负梯度
- (X,residual)训练新回归树
- 继续迭代raw_predictions
迭代终止条件
达到预设的n_estimators
停止迭代
迭代过程中的模型存放在self.estimators_
中
模型预测
raw_prediction
模型预测分两步:
- N个模型得到N个结果,raw_prediction
- N个结果
*learning_rate
相加+初始化的值,sigmoid转换成0-1之间的概率
F M ( x ) = F 0 ( x ) + ∑ m = 1 M ∑ j = 1 J m c m , j I ( x ∈ R m , j ) F_{M}(x)=F_{0}(x)+\sum_{m=1}^{M} \sum_{j=1}^{J_{m}} c_{m, j} I\left(x \in R_{m, j}\right) FM(x)=F0(x)+m=1∑Mj=1∑Jmcm,jI(x∈Rm,j)
# 不特殊指定情况下,返回先验,也就是类别占比多的
def _raw_predict_init(self, X):
"""Check input and compute raw predictions of the init estimtor."""
self._check_initialized()
X = self.estimators_[0, 0]._validate_X_predict(X, check_input=True)
if X.shape[1] != self.n_features_:
raise ValueError("X.shape[1] should be {0:d}, not {1:d}.".format(
self.n_features_, X.shape[1]))
if self.init_ == 'zero':
raw_predictions = np.zeros(shape=(X.shape[0], self.loss_.K),
dtype=np.float64)
else:
raw_predictions = self.loss_.get_init_raw_predictions(
X, self.init_).astype(np.float64)
return raw_predictions
# 看下面解释好像是返回预测值的和,具体代码似乎是c写的,隐藏了
def _raw_predict(self, X):
"""Return the sum of the trees raw predictions (+ init estimator)."""
# 初始预测值
raw_predictions = self._raw_predict_init(X)
# 走迭代的过程,求出最终的raw_predictions
predict_stages(self.estimators_, X, self.learning_rate,
raw_predictions)
return raw_predictions
核心代码是pwd文件,也看不到源码。直接按照上面的逻辑,做验证~
初始值和前面初始化中的逻辑一致~gbm.init_
即DummyClassifier
先得到预测概率,再sigmod反函数得到raw_prediction
gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:2,0:4],gbm.init_)
array([[-0.51082562],
[-0.51082562]])
下面获取每个tree的预测值乘以learning_rate
pre_list = []
for i in gbm.estimators_:
pre_list.append(i[0].predict(iris_df.iloc[0:1,0:4])[0]*gbm.learning_rate)
pre_list
[-0.16,
-0.1511286273379727,
-0.14395717809370576,
-0.13806361187211877,
-0.13315505338282463]
gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list)
-1.2371300944526125
predict_proba
sigmoid函数转换成概率
def predict_proba(self, X):
raw_predictions = self.decision_function(X)
try:
# 做了Logit变换
return self.loss_._raw_prediction_to_proba(raw_predictions)
except NotFittedError:
raise
except AttributeError:
raise AttributeError('loss=%r does not support predict_proba' %
self.loss)
gbm.predict_proba(iris_df.iloc[0:1,0:4])
array([[0.77506407, 0.22493593]])
expit(gbm.loss_.get_init_raw_predictions(iris_df.iloc[0:1,0:4],gbm.init_)[0][0]+sum(pre_list))
0.2249359293642033
和直接预测结果一致,逻辑验证完毕
predict
predict有个label decode的逻辑,前面初始化时,通过gbm._validate_y
把label值做了encode, 比如本case中,把'versicolor', 'virginica'
映射为0、1
def predict(self, X):
# 有些代码不是纯py的,总之,最终累加各个模型预测值得到一个结果
raw_predictions = self.decision_function(X)
# 找到最大的那个概率值所在的索引
encoded_labels = \
self.loss_._raw_prediction_to_decision(raw_predictions)
# encoded_labels是[0,1]这样值,return返回其对应的原始y的数据,如(good,bad)、(+1,-1)这样
return self.classes_.take(encoded_labels, axis=0)
-------------------------------------------------------------------------------------
def _raw_prediction_to_proba(self, raw_predictions):
proba = np.ones((raw_predictions.shape[0], 2), dtype=np.float64) # 生成N个[1,1],用来存放概率
proba[:, 1] = expit(raw_predictions.ravel())#做了logit变换,把值映射到0-1之间
proba[:, 0] -= proba[:, 1]
return proba
def _raw_prediction_to_decision(self, raw_predictions):
# 概率转换
proba = self._raw_prediction_to_proba(raw_predictions)
return np.argmax(proba, axis=1) # 返回的好像是索引(即哪个概率大,0or1),代码会把索引进一步映射为对应的类别
decision_function
函数得到raw_prediction
_raw_prediction_to_decision
获取概率大的索引,其实也就是0、1predict
根据索引,映射成对应的原始类别
总结
模型训练
- 初始化第0个弱学习器 F 0 F_0 F0,根据入参情况,可能是DummyClassifier,得到raw_prediction
- 训练
n_estimators
个分类器- 1)根据raw_prediction和y计算负梯度residual
- 2)(特征X,负梯度residual)训练cart回归树
- 3)更新每个叶子节点的输出值
- 4)更新
raw_prediction,raw_prediction+learning_rate*leaf_output
即加上学习率乘tree的预测值 - 继续上述4步,进行迭代
- 分类器达到指定个数,停止迭代
模型预测
- 首先等到初始化值,也就是上面第0个弱学习器所得到的预测值
- 其次
n_estimators
个分类器预测结果乘以学习率 - 上述两结果相加,再通过sigmoid转换,得到概率
Ref
[1] https://www.6aiq.com/article/1583859185789
[2] https://liangyaorong.github.io/blog/2017/scikit-learn%E4%B8%AD%E7%9A%84GBDT%E5%AE%9E%E7%8E%B0/#2.6
[3] https://blog.csdn.net/qq_22238533/article/details/79192579
[4] https://www.cnblogs.com/pinard/p/6140514.html#!comments
折腾了快一个月,我还是太懒,懒得无可救药。聪明和勤奋一个都不占,还想着偷懒,大谬!
2021-05-18 于南京市江宁区九龙湖