github地址:DataScicence欢迎star
集成学习3-Boosting的原理和案例
集成学习2-bagging的原理与案例分析
集成学习1-投票法的原理和案例分析
集成学习4-前向分步算法与GBDT-原理与案例
GBDT的原理
仅记录大致的推导过程,具体原理请参考李航《统计学习方法》第8章
CART回归树
一般情况下,我们会使用树模型来进行分类任务,但是如何用树模型完成回归任务呢?
选择不同的分支策略:使用误差平方和代替条件信息熵
有了分支策略后,我们就可以使用一定的方法,来生成需要的决策树,大致流程如下:
- 输入训练数据集 X n × m X_{n\times m} Xn×m ,n条记录,m个属性
- 选择最优切分变量j和切分点s
- 在每个属性j上,选择一个最佳切分点s,在这个点进行分支,将空间分成R1和R2两部分,每个部分取落入该部分的y的平均值,然后计算损失函数 L j L_j Lj
- 选择损失最小的j和对应的s,
- 对R1和R2两部分再重复上述过程,并最终得到目标决策树
前向分步算法
f 0 ( x ) = 0 f m ( x ) = f m − 1 ( x ) + T ( x ; θ m ) f M ( x ) = ∑ m = 1 M T ( x , θ m ) f_0(x)=0\\ f_m(x) = f_{m-1}(x)+T(x;\theta_m)\\ f_M(x) = \sum_{m=1}^{M}T(x,\theta_m) f0(x)=0fm(x)=fm−1(x)+T(x;θm)fM(x)=m=1∑MT(x,θm)
在第m步中,求解
θ
m
\theta_m
θm:
θ
^
m
=
a
r
g
min
θ
m
∑
i
=
1
N
L
(
y
i
,
f
m
−
1
(
x
i
)
,
T
(
x
i
,
θ
m
)
)
\hat \theta_m = arg\min_{\theta_m}\sum_{i=1}^{N}L(y_i,f_{m-1}(x_i),T(x_i,\theta_m))
θ^m=argθmmini=1∑NL(yi,fm−1(xi),T(xi,θm))
BDT(提升树)
在BDT中,每一个 T ( x i , θ m ) T(x_i,\theta_m) T(xi,θm)都是一个CART回归树,大概流程如下:
- 初始化 f 0 ( x ) = 0 f_0(x) = 0 f0(x)=0
- for m in [1,…M]:
- 计算 θ m \theta_m θm
- f m ( x ) = f m − 1 ( x ) + T ( x ; θ m ) f_m(x) = f_{m-1}(x)+T(x;\theta_m) fm(x)=fm−1(x)+T(x;θm)
- 得到最终决策树 f M ( x ) f_M(x) fM(x)
GBDT(梯度提升树)
在BDT里,由于损失函数是误差平方和,因此每一步前向分步的CART树
T
m
T_m
Tm中,拟合的都是模型
f
m
−
1
(
x
)
f_{m-1}(x)
fm−1(x)的预测残差
KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲\begin{split} …
为了使算法一般化,更加方便一般损失函数的优化,令:
r
i
=
−
[
∂
L
(
y
,
f
(
x
i
)
)
∂
f
(
x
i
)
]
f
(
x
)
=
f
t
−
1
(
x
)
r_i = -\begin{bmatrix}\frac{∂L(y,f(x_i))}{∂f(x_i)}\end{bmatrix}_{f(x) = f_{t-1}(x)}
ri=−[∂f(xi)∂L(y,f(xi))]f(x)=ft−1(x)
所以每一步的新CART树就是拟合损失函数的负梯度,因此被成为GBDT算法
GBDT案例
数据读取
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, ensemble
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
data = datasets.load_diabetes()
X,y = data.data,data.target
import pandas as pd
X = pd.DataFrame(X,columns=data.feature_names)
X.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 |
X.shape
(442, 10)
数据集划分
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1)
模型训练
GradientBoostingRegressor的常用参数:
- loss : {‘ls’, ‘lad’, ‘huber’, ‘quantile’}
- 损失函数,ls:最小平方差;lad:最小绝对偏差;huber:ls和lad的结合;quantile百分比回归
- learning_rate : float, optional (default=0.1)
- 学习率
- n_estimators : int (default=100)
- 弱回归器的数量
- max_depth : integer, optional (default=3)
- 每个弱回归器的最大树深
- min_samples_split : int, float, optional (default=2)
- 每次分支时,父节点最小需要包含的样本数量
- min_samples_leaf : int, float, optional (default=1)
- 每个子节点至少需要包含的样本数量
- subsample : float, optional (default=1.0)
- 训练每个弱回归器时用的样本比例
- 越小,方差越低,偏差可能会增加
- max_features : int, float, string or None, optional (default=None)
- 每次分支时,考虑的最大特征数量
模型属性:
- feature_importances_ : array, shape = [n_features]
- 特征重要度
- oob_improvement_ : array, shape = [n_estimators]
- 每一步迭代时,损失函数的提升值
- train_score_ : array, shape = [n_estimators]
- 每一步迭代时,损失函数的值
- estimators_ : ndarray of DecisionTreeRegressor, shape = [n_estimators, 1]
- 每一步迭代,训练出模型
模型方法:
staged_predict
(X)
- 返回在每次迭代时,x的预测值
params = {'n_estimators': 500,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01,
'loss': 'ls'}
reg = ensemble.GradientBoostingRegressor(**params)
reg.fit(X_train, y_train)
mse = mean_squared_error(y_test, reg.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))
The mean squared error (MSE) on test set: 3753.6048
在测试集上的MSE是3753.6048
迭代次数与训练结果
随着迭代次数的增加,在训练集上的误差仍在不断降低,但是在测试集上的误差逐渐趋于平稳,并略有上升趋势
- GBDT的鲁棒性较好,不会出现迭代次数增加,测试误差大幅升高的现象
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(reg.staged_predict(X_test)):
test_score[i] = reg.loss_(y_test, y_pred)
fig = plt.figure(figsize=(6, 6))
plt.subplot(1, 1, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, reg.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
fig.tight_layout()
plt.show()
特征重要度
sklearn.inspection.``permutation_importance
(estimator, X, y, ***, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None)评价模型输入特征的重要度
- estimator 模型
- X, y 有监督训练的X,y
- scoring 评价函数
- n_repeats=5 重复次数
- sample_weight 样本权重
输出:
result:dict
- importances_mean:ndarray, shape (n_features, )
- Mean of feature importance over
n_repeats
.- importances_std:ndarray, shape (n_features, )
- Standard deviation over
n_repeats
.- importances:ndarray, shape (n_features, n_repeats)
- Raw permutation importance scores.
feature_importance = reg.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(data.feature_names)[sorted_idx])
plt.title('Feature Importance (MDI)')
result = permutation_importance(reg, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()
plt.subplot(1, 2, 2)
plt.boxplot(result.importances[sorted_idx].T,
vert=False, labels=np.array(data.feature_names)[sorted_idx])
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
模型调参
from sklearn.model_selection import GridSearchCV
def Tuning(cv_params, other_params,x_train_array,y_train_):
model2 = ensemble.GradientBoostingRegressor(**other_params)
optimized_GBM = GridSearchCV(estimator=model2,
param_grid=cv_params,
scoring='neg_mean_squared_error',
cv=5,
n_jobs=-1)
optimized_GBM.fit(x_train_array, y_train_)
evalute_result = optimized_GBM.cv_results_['mean_test_score']
print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))
return optimized_GBM
n_estimators
other_params = {'n_estimators': 500,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01,
'loss': 'ls'}
cv_params = {
'n_estimators':np.arange(100,1000,20)
}
opt = Tuning(cv_params,other_params,X_train,y_train)
每轮迭代运行结果:[-3638.72054129 -3508.14247212 -3439.1105988 -3411.26966851
-3379.19731255 -3360.20487765 -3352.48712657 -3338.42691658
-3336.43127856 -3348.41175323 -3358.69080437 -3363.01002418
-3364.25552115 -3368.95643225 -3371.23084952 -3366.64895522
-3368.50938829 -3364.62376276 -3364.70394263 -3366.93076129
-3371.23803066 -3359.48064698 -3369.62505326 -3369.78598905
-3365.66586979 -3370.87512061 -3369.83259861 -3376.26242639
-3380.41740687 -3382.51960697 -3386.68937081 -3386.81829059
-3384.7963568 -3392.2026194 -3392.36595787 -3388.17242236
-3392.12277562 -3398.91747749 -3398.48180057 -3403.58209999
-3402.30716946 -3414.45002626 -3407.77365138 -3411.9467736
-3415.43530731]
参数的最佳取值:{'n_estimators': 260}
最佳模型得分:-3336.431278559966
plt.plot(np.arange(100,1000,20),-opt.cv_results_['mean_test_score'] )
plt.xlabel('n_estimators')
plt.ylabel('MSE')
Text(0, 0.5, 'MSE')
max_depth,min_samples_split
other_params = {'n_estimators': 260,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01,
'loss': 'ls'}
cv_params = {
'max_depth':np.arange(1,10,1),
'min_samples_split':np.arange(1,10,1)
}
opt = Tuning(cv_params,other_params,X_train,y_train)
每轮迭代运行结果:[ nan -3453.55583329 -3453.55583329 -3453.55583329
-3453.55583329 -3453.55583329 -3453.55583329 -3453.55583329
-3453.55583329 nan -3237.90479366 -3237.67289372
-3237.45142037 -3237.68103247 -3237.67289372 -3237.68103247
-3236.78414973 -3235.4499951 nan -3250.74529652
-3254.45251943 -3253.21801558 -3243.64809922 -3243.30183639
-3235.14340157 -3244.82045334 -3242.35242613 nan
-3355.95471162 -3360.62349942 -3338.32275136 -3339.72429801
-3334.6454409 -3347.79331607 -3354.73331653 -3336.45159072
nan -3509.94657776 -3520.39063496 -3550.73975953
-3500.9324746 -3504.3208005 -3507.2927551 -3495.2493565
-3520.18842566 nan -3788.9065026 -3704.52349726
-3771.5302907 -3718.43154173 -3684.04212671 -3638.52076809
-3613.72731613 -3626.09529817 nan -4138.08200942
-4064.96237157 -4059.46113609 -3808.23138872 -3772.45499181
-3767.67837104 -3664.89014985 -3671.63515663 nan
-4292.00667166 -4022.22606624 -4217.50200251 -3899.58259811
-3888.33607955 -3871.88324196 -3738.93376758 -3763.93186532
nan -4822.12744874 -4229.64371762 -4235.21717947
-3990.75377828 -3959.27233923 -3947.54141534 -3800.0615933
-3786.6664412 ]
参数的最佳取值:{'max_depth': 3, 'min_samples_split': 7}
最佳模型得分:-3235.143401566548
import seaborn as sns
tem = pd.DataFrame(opt.cv_results_['mean_test_score'].reshape((9,9)),index=np.arange(1,10),columns=np.arange(1,10))
ax = sns.heatmap(tem, cmap="YlGnBu")
ax.set_xlabel('max_depth')
ax.set_ylabel('min_samples_split')
Text(33.0, 0.5, 'min_samples_split')
颜色越深,模型效果越好
learning_rate
other_params = {'n_estimators': 260,
'max_depth': 3,
'min_samples_split': 7,
'learning_rate': 0.01,
'loss': 'ls'}
cv_params = {
'learning_rate':np.arange(0.001,0.15,0.002)
}
opt = Tuning(cv_params,other_params,X_train,y_train)
每轮迭代运行结果:[-5090.71944225 -3845.05212113 -3417.5273061 -3265.14970145
-3241.40913551 -3251.13260325 -3268.19604547 -3272.649416
-3263.96975898 -3280.53948994 -3286.26207704 -3286.26587173
-3269.90531332 -3298.79342194 -3288.53832222 -3307.99225882
-3288.13691548 -3332.82450247 -3322.07393745 -3376.15111667
-3350.48082774 -3384.51386583 -3380.08816873 -3374.37578837
-3467.71684347 -3411.99228121 -3410.86145503 -3449.86503891
-3435.57844748 -3425.47660987 -3449.68465379 -3470.64282452
-3498.73422247 -3486.79052737 -3515.93874366 -3495.58656961
-3414.83130408 -3504.96138971 -3499.34007735 -3500.92962273
-3471.87179811 -3523.44306855 -3470.4962579 -3555.47838841
-3488.43021746 -3485.1058102 -3503.1991377 -3504.9920395
-3645.26374353 -3586.15554933 -3586.37086011 -3562.26475834
-3656.24829656 -3596.39360247 -3669.45092429 -3603.37561538
-3696.70074299 -3577.8236446 -3688.14665007 -3592.38420614
-3614.35427872 -3604.95552426 -3590.33626323 -3607.23228692
-3522.61196885 -3729.88575432 -3737.59523716 -3840.98280702
-3722.47608075 -3648.92769528 -3733.36891804 -3686.00970047
-3678.30293362 -3722.33284573 -3749.85363686]
参数的最佳取值:{'learning_rate': 0.009000000000000001}
最佳模型得分:-3241.409135508117
plt.plot(np.arange(0.001,0.15,0.002),-opt.cv_results_['mean_test_score'] )
plt.xlabel('leaning_rate')
plt.ylabel('MSE')
Text(0, 0.5, 'MSE')
训练结果
mse = mean_squared_error(y_test, opt.best_estimator_.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))
The mean squared error (MSE) on test set: 3396.9532
测试误差:在测试集上的MSE:3753.6048——>3396.9532