xgboost参数调优_XGBoost实战和参数详解

最新推荐文章于 2024-09-07 09:42:54 发布

xxxibb

最新推荐文章于 2024-09-07 09:42:54 发布

阅读量1.1k

点赞数

文章标签： xgboost参数调优

本文链接：https://blog.csdn.net/weixin_42190030/article/details/113490273

版权

本文详细介绍了XGBoost的优点，如正则化、并行处理能力，并深入探讨了参数设置，包括eta、gamma、max_depth、min_child_weight等关键参数的含义和作用。还分享了如何通过网格搜索确定max_depth和min_child_weight，以及调优过程中的注意事项，如防止过拟合的策略和正则化参数的调整。

摘要由CSDN通过智能技术生成

xgboost优点

正则化
并行处理？
灵活性，支持自定义目标函数和损失函数，二阶可导
缺失值的处理
剪枝，不容易过拟合
内置了交叉验证

参数的设置

params = {
    'booster': 'gbtree',            
    'objective': 'multi:softmax',  # 多分类的问题
    'num_class': 10,               # 类别数，与 multisoftmax 并用
    'gamma': 0.1,                  # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。
    'max_depth': 12,               # 构建树的深度，越大越容易过拟合
    'lambda': 2,                   # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
    'subsample': 0.7,              # 随机采样训练样本
    'colsample_bytree': 0.7,       # 生成树时进行的列采样
    'min_child_weight': 3,
    'silent': 1,                   # 设置成1则没有运行信息输出，最好是设置为0.
    'eta': 0.007,                  # 如同学习率
    'seed': 1000,
    'nthread': 4,                  # cpu 线程数
}

booster 默认是gbtree ,gblinear
slient 0是打印运行时的信息，1代表缄默方式运行
nthread 运行的线程数
num_pbuffer 缓存区的大小，训练实例的数目，不需要人为进行设置
num_feature 特征的个数，自动进行设置

##############################################################################

eta 防止过拟合的更新步长 0.3
gamma 默认为0
max_depth 6 树的最大深度
min_child_weight 默认是1 ，孩子节点中最小样本的权重之和，小于该值，拆分结束
max_delta_step 0 每个数的权重被估计的值。通常设置为0，没有约束。正数，跟新的过程更加保守，Lr中。样本不均衡，可以设置为大于0的数
subsample 【depault=1】训练模型的子样本占整个样本集合的比例。防止过采样
colsample_btree 1 特征的采样比例

#################################################################################

lambda 正则化l2的惩罚系数
alpha l1正则化的惩罚系数
lambda_bias 在偏智上的L2正则

#################################################################################

objective [ default=reg:linear ]
定义学习任务及相应的学习目标，可选的目标函数如下：
- “reg:linear” —— 线性回归。
- “reg:logistic”—— 逻辑回归。
- “binary:logistic”—— 二分类的逻辑回归问题，输出为概率。
- “binary:logitraw”—— 二分类的逻辑回归问题，输出的结果为wTx。
- “count:poisson”—— 计数问题的poisson回归，输出结果为poisson分布。在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
- “multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
- “multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
- “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

base_score [ default=0.5 ]
- 所有实例的初始化预测分数，全局偏置；
- 为了足够的迭代次数，改变这个值将不会有太大的影响。

eval_metric [ default according to objective ]
- 校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）-
- 用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖’eval_metric’
- 可供的选择如下:
  - “rmse”: root mean square error
  - “logloss”: negative log-likelihood
  - “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
  - “merror”: Multiclass classification error rate. It is calculated as #(wrongcases)#(allcases).
  - “mlogloss”: Multiclass logloss
  - “auc”: Area under the curve for ranking evaluation.
  - “ndcg”:Normalized Discounted Cumulative Gain
  - “map”:Mean average precision
  - “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
  - “ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
seed [ default=0 ]
- 随机数的种子。缺省值为0

章华燕：史上最详细的XGBoost实战zhuanlan.zhihu.com

参数调整

确定boosting参数，预先设定其他参数的初始值

max_depth = 5
min_child_weight = 1
gamma = 0
subsample,colsample_bytree = 0.8
scale_pos_weight = 1
cv 确定 n_estimators

网格搜索确定max_depth 和min_child_weight

确定gamma参数的调优

调整subsample和colsample_bytree 的参数

正则化参数的调优

降低学习速率

Dukey：【转】XGBoost参数调优完全指南（附Python代码）zhuanlan.zhihu.com