【小白入门】XGBoost参数解释，xgboost最终要的部分就是参数的配置

最新推荐文章于 2024-04-30 15:33:58 发布

和你在一起^_^

最新推荐文章于 2024-04-30 15:33:58 发布

阅读量867

点赞数 2

分类专栏： xgboosting 文章标签： xgb

本文链接：https://blog.csdn.net/weixin_42462804/article/details/103192335

版权

xgboosting 专栏收录该内容

3 篇文章 6 订阅

订阅专栏

XGBoost参数，xgboost最终要的部分就是参数的配置

在运行XGboost主要设置的三种类型参数：general parameters，booster parameters和task parameters：

General parameters：

参数控制在提升（boosting）过程中使用哪种booster，常用的booster有树模型（tree）和线性模型（linear model）。
booster [default=gbtree]
gbtree 和 gblinear

silent [default=0]
0表示输出信息， 1表示安静模式

Booster parameters：这取决于使用哪种booster。
一般情况下gbtree比gblinear要更好用
Task parameters：控制学习的场景，例如在回归问题中会使用不同的参数控制排序。
除了以上参数还可能有其它参数，在命令行中使用
Parameters in R Package

In R-package, you can use .(dot) to replace under score in the parameters, for example, you can use max.depth as max_depth. The underscore parameters are also valid in R.

一般参数/General Parameters

booster [default=gbtree]
有两中模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree
silent [default=0]
取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时信息。缺省值为0
nthread [default to maximum number of threads available if not set]
XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数
num_pbuffer [set automatically by xgboost, no need to be set by user]
size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]
boosting过程中用到的特征维数，设置为特征个数。XGBoost会自动设置，不需要手工设置

*集成(增强)参数/booster parameters

eta [default=0.3]
为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为：[0,1]
gamma [default=0] ，这个参数可能调整也可能不调整
为了对树的叶子节点做进一步的分割而必须设置的损失减少的最小值，该值越大，算法越保守.
range: [0,∞]
max_depth [default=6]
数的最大深度。缺省值为6，这个参数一般会调整，对应不同的数据，我们做的操作是不一样的。
取值范围为：[1,∞]
min_child_weight [default=1]
表示子树观测权重之和的最小值，如果树的生长时的某一步所生成的叶子结点，其观测权重之和小于min_child_weight，那么可以放弃该步生长，在线性回归模式中，这仅仅与每个结点所需的最小观测数相对应。该值越大，算法越保守
取值范围为: [0,∞]
max_delta_step [default=0]
如果该值为0，就是没有限制；如果设为一个正数，可以使每一步更新更加保守通常情况下这一参数是不需要设置的，但是在logistic回归的训练集中类极端不平衡的情况下，将这一参数的设置很有用，将该参数设为1-10可以控制每一步更新
取值范围为：[0,∞]
subsample [default=1]
用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的冲整个样本集合中随机的抽取出50%的子样本建立树模型，这能够防止过拟合。
取值范围为：(0,1]
colsample_bylevel [default=1]
用来控制树的每一级的每一次分裂，对列数的采样的占比。一般不太用这个参数，因为subsample参数和colsample_bytree参数可以起到相同的作用。
range: (0,1]
colsample_bytree [default=1]
在建立树时对特征采样的比例。缺省值为1
取值范围：(0,1]
lambda [default=1, alias: reg_lambda]
L2 权重的L2正则化项
alpha [default=0, alias: reg_alpha]
L1 权重的L1正则化项
tree_method, string [default=’auto’]
The tree construction algorithm used in XGBoost(see description in the reference paper)
Distributed and external memory version only support approximate algorithm.
Choices: {‘auto’, ‘exact’, ‘approx’}
‘auto’: Use heuristic to choose faster one.
For small to medium dataset, exact greedy will be used.
For very large-dataset, approximate algorithm will be chosen.
Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
‘exact’: Exact greedy algorithm.
‘approx’: Approximate greedy algorithm using sketching and histogram.
sketch_eps, [default=0.03]
This is only used for approximate greedy algorithm.
This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
range: (0, 1)
scale_pos_weight, [default=1]
在各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛
一个可以考虑的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: R, py1, py2, py3
updater, [default=’grow_colmaker,prune’]
A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
‘grow_colmaker’: non-distributed column-based construction of trees.
‘distcol’: distributed tree construction with column-based data splitting mode.
‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
‘grow_local_histmaker’: based on local histogram counting.
‘grow_skmaker’: uses the approximate sketching algorithm.
‘sync’: synchronizes trees in all distributed nodes.
‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
‘prune’: prunes the splits where loss < min_split_loss (or gamma).
In a distributed setting, the implicit updater sequence value would be adjusted as follows:
‘grow_histmaker,prune’ when dsplit=’row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
‘grow_histmaker,refresh,prune’ when dsplit=’row’ and prob_buffer_row < 1
‘distcol’ when dsplit=’col’
refresh_leaf, [default=1]
This is a parameter of the ‘refresh’ updater plugin. When this flag is true, tree leafs as well as tree nodes’ stats are updated. When it is false, only node stats are updated.
process_type, [default=’default’]
A type of boosting process to run.
Choices: {‘default’, ‘update’}
‘default’: the normal boosting process which creates new trees.
‘update’: starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: ‘refresh’, ‘prune’. With ‘update’, one cannot use updater plugins that create new nrees.

任务参数Task Parameters

objective [ default=reg:linear ]
定义学习任务及相应的学习目标，可选的目标函数如下：
“reg:linear” –线性回归。
“reg:logistic” –逻辑回归。
“binary:logistic” –二分类的逻辑回归问题，输出为概率。
“binary:logitraw” –二分类的逻辑回归问题，输出的结果为wTx。
“count:poisson” –计数问题的poisson回归，输出结果为poisson分布。
在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
“multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
“multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）
用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖’eval_metric’
The choices are listed below:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
training repeatively
seed [ default=0 ]
随机数的种子。缺省值为0

命令行参数/Console Parameters

The following parameters are only used in the console version of xgboost

use_buffer [ default=1 ]

是否为输入创建二进制的缓存文件，缓存文件可以加速计算。缺省值为1

num_round

boosting迭代计算次数。

data

输入数据的路径

test:data

测试数据的路径

save_period [default=0]

表示保存第i*save_period次迭代的模型。例如save_period=10表示每隔10迭代计算XGBoost将会保存中间结果，设置为0表示每次计算的模型都要保持。

task [default=train] options: train, pred, eval, dump

train：训练明显
pred：对测试数据进行预测
eval：通过eval[name]=filenam定义评价指标
dump：将学习模型保存成文本格式

model_in [default=NULL]

指向模型的路径在test, eval, dump都会用到，如果在training中定义XGBoost将会接着输入模型继续训练
————————————————
版权声明：本文为CSDN博主「chenXin@Gauss」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/lc574260570/article/details/81606857

和你在一起^_^

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
【小白入门】XGBoost参数解释，xgboost最终要的部分就是参数的配置

XGBoost参数，xgboost最终要的部分就是参数的配置在运行XGboost主要设置的三种类型参数：general parameters，booster parameters和task parameters：General parameters：参数控制在提升（boosting）过程中使用哪种booster，常用的booster有树模型（tree）和线性模型（linear model）。...
复制链接

扫一扫