XGBoost实战

最新推荐文章于 2024-05-11 05:19:07 发布

*Major*

最新推荐文章于 2024-05-11 05:19:07 发布

阅读量567

点赞数

本文链接：https://blog.csdn.net/qq_41375318/article/details/108111791

版权

$X G B o o s t 实战$

实例代码文件：
Otto Group Product Classification Challenge
—Classify products into the correct category

竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data

一特征工程

1.数据分析

对数据进行探索性的分析的工具包：pandas、matplotlib／seaborn
读取训练数据，取少量样本进行观测，并查看数据规模和数据类型

– 标签、特征意义、特征类型等

分析每列特征的分布

– 直方图
– 包括标签列（对分类问题，可看出类别样本是否均衡）
– 检测奇异点（outliers）

分析每两列特征之间的相关性

– 特征与特征之间信息是否冗余
– 特征与标签是否线性相关

特征工程

二评价指标

在这里插入图片描述

Scikit-Learn：sklearn.metrics

metrics模块还提供为其他目的而实现的预测误差评估函数
– 分类任务的评估函数如表所示，其他任务评估函数请见：
https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

在这里插入图片描述

三参数调优

在这里插入图片描述

通用参数：这部分参数通常我们不需要调整，默认值就好
学习目标参数：与任务有关，定下来后通常也不需要调整
booster参数：弱学习器相关参数，需要仔细调整，会影响模型性能

通用参数

• booster：弱学习器类型
– 可选gbtree（树模型）或gbliner（线性模型）
– 默认为gbtree（树模型为非线性模型，能处理更复杂的任务）

• silent：是否开启静默模式
– 1：静默模式开启，不输出任何信息
– 默认值为0：输出一些中间信息，以助于我们了解模型的状态

• nthread：线程数
– 默认值为-1，表示使用系统所有CPU核

学习目标参数

• objective: 损失函数
– 支持分类／回归／排序

• eval_metric：评价函数

• seed：随机数的种子
– 默认为0
– 设置seed可复现随机数据的结果，也可以用于调整参数

booster参数

弱学习器的参数，尽管有两种booster可供选择，这里只介绍gbtree

learning_rate : 收缩步长 vs. n_estimators：树的数目
– 较小的学习率通常意味着更多弱分学习器
– 通常建议学习率较小（ 𝜂 < 0.1），弱学习器数目n_estimators大
– 可以设置较小的学习率，然后用交叉验证确定n_estimators
行（subsample）列（colsample_bytree、colsample_bylevel）下采样比例
– 默认值均为1，即不进行下采样，使用所有数据
– 随机下采样通常比用全部数据的确定性过程效果更好，速度更快
– 建议值：0.3 - 0.8
树的最大深度： max_depth
– max_depth越大，模型越复杂，会学到更具体更局部的样本
– 需要使用交叉验证进行调优，默认值为6，建议3-10
min_child_weight ：孩子节点中最小的样本权重和
– 如果一个叶子节点的样本权重和小于min_child_weight则分裂过程结束

三个最重要的参数为：树的数目、树的深度和学习率

建议参数调整策略为：

– 采用默认参数配置试试

– 如果系统过拟合了，降低学习率

– 如果系统欠拟合，加大学习率

（常使用XGBoost）建议：

– n_estimators和learning_rate：固定n_estimators为100（数目不大，因为树的深度较大，每棵树比较复杂），然后调整learning_rate

– 树的深度max_depth：从6开始，然后逐步加大

– min_child_weight ： 1Τ𝑠𝑞𝑟𝑡 rare_events
，其中rare_events 为稀有事件的数目

– 列采样colsample_bytree／ colsample_bylevel：在[0.3, 0.5]之间进行网格搜索

– 行采样subsample：固定为1

– gamma: 固定为0.0

参数调优的一般方法：

选择较高的学习率(learning rate)，并选择对应于此学习率的理想的树数量
– 学习率以工具包默认值为0.1。
– XGBoost直接引用函数“cv”可以在每一次迭代中使用交叉验证，并返回理想的树数量（因为交叉验证很慢，所以可以import两种XGBoost：直接引用xgboost（用“cv”函数调整树的数目）和XGBClassifier —xgboost的sklearn包（用GridSearchCV调整其他参数）。
对于给定的学习率和树数量，进行树参数调优
(max_depth,min_child_weight, gamma, subsample, colsample_bytree, colsample_bylevel)
xgboost的正则化参数(lambda, alpha)的调优
降低学习率，确定理想参数

实例

1. 采用缺省参数,此时learning_rate =0.1（较大），观察n_estimators的合适范围
– 参数设为1000，earlystop = 50
– cv函数在n_estimators =699时停止
– cv测试误差为0.482744
在这里插入图片描述

2.1 max_depth 和 min_child_weight 参数调整
– 这两个参数对结果影响很大
– 我们先大范围地粗调参数（步长为2），再小范围微调
– max_depth = range(3,10,2)
– min_child_weight = range(1,6,2)

2.2 max_depth 和 min_child_weight 参数微调
– 在max_depth=7和min_child_weight=5周围微调
– max_depth = [6,7,8]
– min_child_weight = [4,5,6]

2.3 调整max_depth=6 和 min_child_weight=4后，再次调整
n_estimators
– 参数设为1000，earlystop = 50
– cv函数在n_estimators =645时停止

3. gamma参数调整
– 通常缺省值（0）表现还不错，如计算资源允许，可以调整（略）

4. 行列采样参数：subsample和colsample_bytree
– 这两个参数可分别对样本和特征进行采样，从而增加模型的鲁棒性
– 现在[0.3-1.0]之间调整，粗调时步长0.1
– 微调时步长为0.05（略）

注： colsample_bylevel通常0.7-0.8可以得到较好的结果。如计算资源允许，也可以进一步调整

5. 正则参数调整： reg_alpha（L2）和reg_lambda(L0)
– reg_alpha = [ 0.1, 1] # L1, default = 0
– reg_lambda = [0.5, 1, 2] # L2, default = 1
– Best: -0.458950 using {‘reg_alpha’: 1, ‘reg_lambda’: 0.5)

reg_lambda似乎越小越好
reg_alpha似乎越大越好
如计算资源允许，可进一步增大reg_alpha，
减小reg_lambda

6. 降低学习率，调整树的数目
– 调用xgboost的cv函数
• 0.1: 669棵树收敛，cv测试误差为0.469456
• 0.01：2000棵树还没有收敛，cv测试误差为0.501644
• 0.05: 1386棵树收敛，cv测试误差为0.465813
xgb6 = XGBClassifier( learning_rate =0.05,
n_estimators=2000,
max_depth=6,
min_child_weight=4,
subsample=0.7,
colsample_bytree=0.6,
colsample_bylevel=0.7,
reg_alpha = 1,
reg_lambda = 0.5,
objective= ‘multi:softprob’, seed=3)

四 XGBoost并行处理

五 XGBoost总结

• XGBoost是一个用于监督学习的非参数模型
– 目标函数（损失函数、正则项）
– 参数（树的每个分支分裂特征及阈值）
– 优化：梯度下降
• 参数优化
– 决定模型复杂度的重要参数：learning_rate, n_estimators,max_depth, min_child_weight, gamma, reg_alpha, reg_lamba
– 随机采样参数也影响模型的推广性： subsample, colsample_bytree,colsample_bylevel

XGBoost官方文档：https://xgboost.readthedocs.io/en/latest/
– Python API：http://xgboost.readthedocs.io/en/latest/python/python_api.html
• Github： https://github.com/dmlc/xgboost
– 很多有用的资源：https://github.com/dmlc/xgboost/blob/master/demo/README.md
– GPU加速：
https://github.com/dmlc/xgboost/blob/master/plugin/updater_gpu/README.md
• XGBoost原理：XGBoost: A Scalable Tree Boosting System
– https://arxiv.org/abs/1603.02754

XGBoost参数调优：
– https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuningxgboost-with-codes-python/
– 中文版：http://blog.csdn.net/u010657489/article/details/51952785
• Owen Zhang, Winning Data Science Competitions
– https://www.slideshare.net/OwenZhang2/tips-for-data-sciencecompetitions?from_action=save
• XGBoost User Group：
– https://groups.google.com/forum/#!forum/xgboost-user/