XGBoost参数详解

最新推荐文章于 2024-09-07 09:42:54 发布

weijian001

最新推荐文章于 2024-09-07 09:42:54 发布

阅读量2.7k

点赞数 1

分类专栏： machine-learning 文章标签： xgboost python 正则

本文链接：https://blog.csdn.net/wj1066/article/details/78796897

版权

machine-learning 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本文参考自 Complete Guide to Parameter Tuning in XGBoost (with codes in Python)，在其翻译基础上个别地方加上了自己的补充。

XGBoost的优点

正则
- 标准的GBM实现是没有正则的。
- XGBoost也以regularized boosting技术闻名。
并行处理
- XGBoost实现了并行化的处理。
- XGBoost基于boosting方法，原则上不可以并行。但是XGBoost做的是特征粒度上的并行，而不是树粒度上的并行。
- XGBoost支持hadoop上实现。
高度的灵活性
- XGBoost允许用户自定义优化目标和评估准则。
处理缺失值方面
- XGBoost自有一套处理缺失值的方法。
树剪枝方面
- GBM是在节点遇到负损失的时候停止分裂，贪心策略。（预剪枝）
- XGBoost是在分裂抵达max_depth的时候才开始剪枝移除那些没有正收益的分裂。（后剪枝）
内置的交叉验证
- XGBoost在boosting过程中的每一次迭代都运行CV，使得在一次运行中就可以确定最优的boosting迭代次数。
- GBM需要我们使用网格搜索来找最优次数。
可以在存在的模型上继续训练
- 在某些特定的场景下非常有用。
- GBM的sklearn实现也有这个特点。

XGBoost的参数

General Parameters
1. booster [default=gbtree]
  - gbtree：基于树的模型
  - gblinear：线型模型
2. silent [default=0]
  - 0会输出运行信息，1不会输出运行信息，建议保持0有助于理解模型
3. nthread [如果不设置默认与最大的可以线程数相同]
  - 用来控制并行过程，如果想要在所有的核上都运行就让系统自己设置就好
4. num_pbuffer：prediction buffe，系统自动设置
5. num_feature：boosting过程中的特征维数，系统自动设置
Booster Paramters 包含两类Booster
1. eta [default = 0.3]
  - 类似于GBM中的学习率
  - 取值范围：[0,1]，通常设置为0.01~0.2
2. gamma [default=0 别名：min_split_loss]
  - 一个叶子节点继续分裂所需要的最小下降损失。值越大，模型越保守/越不容易过拟合。
  - 取值范围：[0, ∞]
3. max_depth [default = 6]
  - 树的最大深度。值越大模型越复杂/越可能过拟合。设为0表示不限制。
  - 取值范围：[0, ∞]
4. min_child_weight [default=1]
  - 孩子节点需要的最小样本权重和。如果分裂导致一个叶子节点的样本权重和小于预设值，就不会继续分裂了。
  - 在线型模型中，简化为每个节点所需要的最小样本数量（?）。
  - 值越大，模型越保守。
5. min_delta_step [default=0]
  - Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint.
  - 用的比较少，但是在逻辑回归中，如果类别极度不平衡，调整这个值会有帮助。
6. subsample [default = 1]
  - 行采样，不用多说。取值范围：(0, 1]
7. colsample_bytree [default = 1]
  - 列采样，在建立每一棵树的时候对特征的采样比例。取值范围：(0, 1]
8. colsample_bylevel [default = 1]
  - 在每一次分裂时候列采样的比例（?），用的很少。取值范围：(0, 1]
9. alpha [default = 0]
  - 权重上的L1正则
10. lambda [default = 1]
  - 权重上的L2正则
11. tree method [default = ‘auto’] 详见XGBoost论文
  - 这个不是树构建的方法，而是节点分裂的方法，其中
  - ‘auto’: Use heuristic to choose faster one.
    - For small to medium dataset, exact greedy will be used.
    - For very large-dataset, approximate algorithm will be chosen.
    - Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
  - ‘exact’: Exact greedy algorithm.
  - ‘approx’: Approximate greedy algorithm using sketching and histogram.
  - ‘hist’: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching.
  - ‘gpu_exact’: GPU implementation of exact algorithm.
  - ‘gpu_hist’: GPU implementation of hist algorithm.
12. scale_pos_weight [defualt = 1]
  - 正负样本比例。用来控制正负样本的权重，在类别不平衡的时候用处很大。
  - 常用的计算方法：sum(negative cases) / sum(positive cases)
13. 【Linear Booster】中有lambda，alpha，lambda_bias（在偏置上的L2正则，为什么偏置上没有L1正则，因为不重要）。
Learning Task Parameters
1. objective [default=reg:linear] 定义学习目标函数。
  - 常用的：”reg:linear”，”reg:logistic”，”binary:logistic”
  - 可以自定义目标函数，需要传入一阶，二阶导数
2. base_score 几乎不用
3. eval_metric [默认值根据objective]
  - 可以传多个评估指标。python特别注意要传list而不是map。