XGBoost使用

最新推荐文章于 2024-08-17 11:24:13 发布

咕叽咕叽小菜鸟

最新推荐文章于 2024-08-17 11:24:13 发布

阅读量1.8k

点赞数 5

分类专栏： MachineLearning 文章标签：机器学习

本文链接：https://blog.csdn.net/u010366748/article/details/111083706

版权

MachineLearning 专栏收录该内容

24 篇文章 12 订阅

订阅专栏

XGBoost使用

1. XGBoost原理简介
2. XGBoost参数说明
3. 使用示例
完整代码地址
参考

本博客中使用到的完整代码请移步至: 我的github：https://github.com/qingyujean/Magic-NLPer，求赞求星求鼓励~~~

集成学习系列文章：

集成学习原理小结（AdaBoost & lightGBM demo）
梯度提升树（GBDT）原理小结
 XGBoost使用
 随机森林（Random Forest）原理小结

1. XGBoost原理简介

XGBoost本质上还是GBDT，但它把算法的速度和效率做到了极致，所以叫X(Extreme)GBoost。

XGBoost主要从以下方面做了优化：

算法本身的优化
- 弱学习器的选择上，除了树模型，还支持线性模型等。
- 在损失函数上，加入了L1和L2正则化项，以防止过拟合。
- 使用损失函数的二阶泰勒展开，在当前模型的值来近似表示残差，然后在每一步中去拟合这个残差。由于对损失函数进行一阶和二阶求导，能更快的收敛。
算法运行效率的优化，对每个弱学习器，比如决策树建立的过程做并行选择。
算法健壮性的优化，对特征的缺失值做了处理。

更详细的原理介绍可参考刘建平老师的博客：XGBoost算法原理小结。

为什么XGBoost要用泰勒展开，优势在哪里？
XGBoost使用了一阶和二阶偏导, 二阶导数有利于梯度下降的更快更准. 使用泰勒展开取得函数做自变量的二阶导数形式, 可以在不选定损失函数具体形式的情况下, 仅仅依靠输入数据的值就可以进行叶子分裂优化计算, 本质上也就把损失函数的选取和模型算法优化/参数选择分开了. 这种去耦合增加了XGBoost的适用性, 使得它按需选取损失函数, 可以用于分类, 也可以用于回归。

2. XGBoost参数说明

先给出XGBoost官方文档的3个重要页面：

（1）XGBoost Parameters 参数说明： https://xgboost.readthedocs.io/en/latest/parameter.html
（2）Notes on Parameter Tuning 调参指南： https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html
（3）Awesome XGBoost 一些非常棒的使用案例、教程和示例： https://github.com/dmlc/xgboost/tree/master/demo

关于 XGBoost Parameters 参数 的一些简要说明：XGBoost参数有三类，分别是通用参数、（基）模型参数（例如时Tree Bosster或者Linear Booster等）、任务参数。下面介绍部分常用的参数：

General Parameters 通用参数
- booster：基模型，默认是gbtree
- verbosity ： (silent)开启静默模式，在后台跑，不输出信息, 1 (warning), 2 (info), 3 (debug)
- nthread：线程个数
Parameters for Tree Booster （树）模型参数，还有其他模型的参数，如gblinear等
- eta：学习率，范围是【0,1】，默认是0.3
- gamma：一个阈值：minimum loss reduction，控制是否对叶子结点做进一步分割，gamma值越大，算法会越保守，就不会很激进的去做下一步分裂，以免过拟合，范围是【0， $\infty$ 】，默认是0。可以控制过拟合
- max_depth：最大树深，范围是【0， $\infty$ 】，默认是6。可以控制过拟合
- min_child_weight：最小孩子权重，范围是【0， $\infty$ 】，默认是1。
- subsample：控制所取训练样本占全量样本的比例，默认是1
- colsample_bytree：建树时考虑column的占比
- colsample_bylevel：在每层做分裂时，考虑column的占比
- colsample_bynode：在每个节点做分裂时，考虑column的占比
- lambda：控制正则化强度，L2正则化
- alpha：控制正则化强度，L1正则化
- scale_pos_weight：控制样本权重，即控制正例与负例的平衡，例如样本本身不均衡的情况。其实就是正负例的权重比值，默认是1，即正负样本权重一样
Learning Task Parameters 任务参数
- objective：目标函数
- eval_metric：评估标准（验证集），例如：“rmse”：均方根误差，“mae”：平均绝对误差，“error”：二分类的错误率等等

【注意】：

xgboost有2套python API：1个是 原生API，1个是 sklearn 风格的API，差别主要体现在参数命名和数据集上，例如原生API 数据集需要放到DMatrix的数据结构里。
例如为了确定估计器的数目，也就是boosting迭代的次数，也可以说是残差树的数目，参数名为n_estimators、num_iterations、num_round、num_boost_round都是等价的，都是num_boost_round的别名。
后面的代码示例中对2种API都会分别进行示例说明使用方法。

python 原生 API 文档： https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training

python sklearn API 文档： https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

关于 XGBoost Parameter Tuning 调参 的一些简要说明：

Control Overfitting：控制过拟合的参数：
- 从树结构的角度：控制树的复杂度：max_depth, min_child_weight，gamma
- 从样本的角度：增加一些随机性：subsample ， colsample_bytree
- 也可以较小学习率eta，但同时也要记得加大num_round训练轮数
Handle Imbalanced Dataset：不均衡数据集的处理
- scale_pos_weight：正负样本比，求损失时正负样本的贡献值不一样
- max_delta_step：

3. 使用示例

xgboost安装：pip3 install xgboost

训练XGBoost简单demo

先指定一组参数param，训练一个简单的模型demo。数据是libsvm格式，很适合用于稀疏存储的数据，例如 0 1:1 9:1 19:1 21:1 24:1 …是指：第1列是target，target=0，后面是index:数值

cwd = '/home/xijian/pycharm_projects/Magic-NLPer/MachineLearning/'
data_dir = cwd+'XGBoostUsage/data/'
# 读取数据集
# xgb.DMatrix()可以直接读取libsvm格式的数据
dtrain = xgb.DMatrix(data_dir + 'agaricus.txt.train')
dtest = xgb.DMatrix(data_dir + 'agaricus.txt.test')

# 设定模型参数
param = {'max_depth':2,  # 树深
         'eta': 1,
         'verbosity': 0,
         'objective': 'binary:logistic'}
         
watch_list = [(dtest, 'eval'), (dtrain, 'train')]
number_round = 10 # 跑10轮（10棵子树） # 通用参数
# 训练
model = xgb.train(param, dtrain, num_boost_round=number_round, evals=watch_list)

输出在训练集和测试集上的指标，默认是错误率：

[0]	eval-error:0.04283	train-error:0.04652
[1]	eval-error:0.02173	train-error:0.02226
[2]	eval-error:0.00621	train-error:0.00706
[3]	eval-error:0.01800	train-error:0.01520
[4]	eval-error:0.00621	train-error:0.00706
[5]	eval-error:0.00000	train-error:0.00123
[6]	eval-error:0.00000	train-error:0.00123
[7]	eval-error:0.00000	train-error:0.00123
[8]	eval-error:0.00000	train-error:0.00123
[9]	eval-error:0.00000	train-error:0.00000

训练好的模型做预测：

# 预测
pred = model.predict(dtest) # 返回的是numpy.array
print(type(pred), pred.dtype, pred.shape) # (1611,)

# groundtruth
labels = dtest.get_label()
print(type(labels), labels.dtype, labels.shape)
print(labels[:10])

error_num = sum([i for i in range(len(pred)) if int(pred[i]>0.5)!=labels[i]])
print(error_num)

输出：

<class 'numpy.ndarray'> float32 (1611,)
[0. 1. 0. 0. 0. 0. 1. 0. 1. 0.]
0

使用交叉验证

number_round = 5 # 通用参数
# nfold=5折
xgb.cv(param, dtrain, number_round, nfold=5, metrics={'error'}, seed=3)

输出：

交叉验证error结果输出

进阶操作-调整样本权重

使用参数fpreproc完成一些预处理工作，例如计算正负样本比例，设置scale_pos_ratio参数

# 看一下正负例的比例，然后调整一下权重
def preproc(dtrain, dtest, param):
    labels = dtrain.get_label()
    ratio = float(np.sum(labels==0))/np.sum(labels==1)
    param['scale_pos_ratio'] = ratio
    return (dtrain, dtest, param)

xgb.cv(param, dtrain, number_round, nfold=5, metrics={'auc'}, seed=3, fpreproc=preproc)
# auc值越接近1效果越好

输出：

交叉验证auc结果输出

进阶操作-自定义目标函数（损失函数）

目标函数：预测与label的接近程度，值越大越好，梯度上升

损失函数：预测与label的差距程度，值越小越好，梯度下降

xgboost中如果要使用自定义目标函数，就要自己提供一阶导数和二阶导数的实现

【注意】：
在逻辑回归章节，求导时，梯度=X.T.dot(h_x-y)，为什么下面只写了(h_x-y)？

因为：这里的p-y.label，其实只是梯度的“公共部分”，因为不同的样本要乘以各自的样本值，所以是无法提前算出来的，因此往往用p-y.label计算梯度的系数部分，实际拿到样本，再乘以各自的数据。

# 自定义目标函数：log似然，交叉验证
# 需要提供一阶导数和二阶导数
def logregobj(pred, dtrain):
    labels = dtrain.get_label()
    pred = 1. / (1+np.exp(-pred)) # sigmoid
    grad = pred - labels # 1阶导数
    hess = pred*(1-pred) # 2阶导数（海森矩阵），sigmoid的导数：g'(x)=g(x)(1-g(x))
    return grad, hess
"""
在逻辑回归章节，求导时，梯度=X.T.dot(h_x-y)，为什么上面只写了(h_x-y)
这里的p-y.label，其实只是梯度的“公共部分”，因为不同的样本要乘以各自的样本值，
所以是无法提前算出来的，因此往往用p-y.label计算梯度的系数部分，实际拿到样本，再乘以各自的数据。
"""

def evalerror(pred, dtrain):
    labels = dtrain.get_label()
    error_num = float(sum(labels!=(pred>0.))) # sigmoid函数g(z)>0.5的话，要z>0.
    return 'error', error_num/len(labels)

# 模型参数
param = {'max_depth':2,  # 树深
         'eta': 1,
         'verbosity': 0,}
watch_list = [(dtest, 'eval'), (dtrain, 'train')]
number_round = 5 # 通用参数

# 自定义目标函数训练
model = xgb.train(param, dtrain,
                  num_boost_round=number_round,
                  evals=watch_list,
                  obj=logregobj, # 目标函数
                  feval=evalerror) # 评价函数

输出：

[0]	eval-rmse:1.59229	train-rmse:1.59597	eval-error:0.04283	train-error:0.04652
[1]	eval-rmse:2.40519	train-rmse:2.40977	eval-error:0.02173	train-error:0.02226
[2]	eval-rmse:2.88253	train-rmse:2.87459	eval-error:0.00621	train-error:0.00706
[3]	eval-rmse:3.62808	train-rmse:3.63621	eval-error:0.01800	train-error:0.01520
[4]	eval-rmse:3.80794	train-rmse:3.83893	eval-error:0.00621	train-error:0.00706

# 5折
xgb.cv(param, dtrain, number_round, nfold=5, seed=3, obj=logregobj, feval=evalerror)

交叉验证error结果输出

用前n棵树做预测

number_round = 5 做了5轮，就会有5棵树产生

但是可以仅使用前几棵树做预测

pred1 = model.predict(dtest, ntree_limit=1)
print(evalerror(pred1, dtest))

输出：

('error', 0.04283054003724395)

pred2 = model.predict(dtest, ntree_limit=2)
print(evalerror(pred2, dtest))

输出：

('error', 0.021725636250775917)

pred3 = model.predict(dtest, ntree_limit=3)
print(evalerror(pred3, dtest))

输出：

('error', 0.006207324643078833)

绘制特征重要度

%matplotlib inline
from xgboost import plot_importance
from matplotlib import pyplot as plt

plot_importance(model, max_num_features=10)

feature importance

与sklearn组合使用

使用sklearn中自带的数据集，如手写数字（10分类问题）、鸢尾花（3分类问题）、波斯顿房价预测（回归问题）等。

sklearn中的KFold、train_test_split等都可以组合使用。下面xgboost的使用，也主要使用sklearn风格的API，例如xxClassifier，xxRegressor、fit、predict等

import pickle
import xgboost as xgb
import numpy as np

from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston

# 10分类问题
# 用XGBoost建模，用sklearn做评估，这里使用混淆矩阵进行评估
# 加载数据(手写数字)
digits = load_digits()
print(digits.keys())
y = digits['target']
X = digits['data']
print(X.shape) # (1797, 64)
print(y.shape)

输出：

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
(1797, 64)
(1797,)

# K折切分器
kf = KFold(n_splits=2, shuffle=True, random_state=1234) # 2折
for train_index, test_index in kf.split(X):
    # 这里model没做任何param的设定，全使用默认值
    xgb_model = xgb.XGBClassifier().fit(X[train_index], y[train_index])
    pred = xgb_model.predict(X[test_index])
    ground_truth = y[test_index]
    print(confusion_matrix(ground_truth, pred))
    print()

输出：

[[ 78   0   0   0   0   0   0   0   1   0]
 [  0  90   0   0   0   1   0   0   0   2]
 [  0   1  82   0   0   0   3   0   0   0]
 [  0   0   1  89   0   0   0   1   0   3]
 [  2   0   0   0 101   0   1   2   1   0]
 [  0   0   0   1   0  96   2   0   0   3]
 [  0   3   0   0   1   0  83   0   1   0]
 [  0   0   0   0   1   0   0  86   0   1]
 [  0   6   1   3   0   1   0   0  71   0]
 [  0   0   0   0   1   1   0   5   2  71]]

[[97  0  0  0  0  1  0  1  0  0]
 [ 0 86  0  1  0  0  1  0  0  1]
 [ 0  0 90  1  0  0  0  0  0  0]
 [ 0  1  0 86  0  1  0  0  1  0]
 [ 0  1  0  0 72  0  0  0  0  1]
 [ 1  0  0  0  0 72  0  0  2  5]
 [ 1  0  0  0  0  1 91  0  0  0]
 [ 0  0  0  0  0  0  0 90  1  0]
 [ 1  2  0  0  0  1  0  2 85  1]
 [ 0  6  0  1  0  1  0  0  1 91]]

# 3分类问题（鸢尾花）
iris = load_iris()
y_iris = iris['target']
X_irsi = iris['data']

kf = KFold(n_splits=2, shuffle=True, random_state=1234) # 2折
for train_index, test_index in kf.split(X_irsi):
    xgb_model = xgb.XGBClassifier().fit(X_irsi[train_index], y_iris[train_index])
    pred = xgb_model.predict(X_irsi[test_index])
    ground_truth = y_iris[test_index]
    print(confusion_matrix(ground_truth, pred))
    print()

输出：

[[25  0  0]
 [ 0 24  1]
 [ 0  0 25]]

[[25  0  0]
 [ 0 23  2]
 [ 0  1 24]]

# 回归问题（boston房价预测）
boston = load_boston()
# print(type(boston))
X_boston = boston['data']
y_boston = boston['target']

kf = KFold(n_splits=2, shuffle=True, random_state=1234) # 2折
for train_index, test_index in kf.split(X_boston):
    # 这里使用回归器：XGBRegressor
    xgb_model = xgb.XGBRegressor().fit(X_boston[train_index], y_boston[train_index])
    pred = xgb_model.predict(X_boston[test_index])
    ground_truth = y_boston[test_index]
    # 回归问题，所以评估换成mse
    print('mse:', mean_squared_error(ground_truth, pred))
    print()

输出：

mse: 11.431804163616869

mse: 15.365480950058584

优化超参数-网格搜索

# 回归问题（boston房价预测）
boston = load_boston()
X_boston = boston['data']
y_boston = boston['target']
xgb_model = xgb.XGBRegressor()

# 参数字典
param_dict = {'max_depth': [2,4,6], # 最大树深
              'n_estimators': [50, 100, 200]} # 树的棵树

rgs = GridSearchCV(xgb_model, param_dict)
rgs.fit(X_boston, y_boston)

输出：

GridSearchCV(estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, random_state=None,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters=None,
                                    verbosity=None),
             param_grid={'max_depth': [2, 4, 6],
                         'n_estimators': [50, 100, 200]})

打印出最佳参数：