LightGBM、XGBoost参数调优，调参经验，参数介绍

只要开始永远不晚
已于 2023-01-19 17:21:00 修改
阅读量835
点赞数
分类专栏： # 机器学习数据挖掘文章标签：数据挖掘 lightgbm xgboost
于 2021-09-07 13:48:41 首次发布
本文链接：https://blog.csdn.net/haohaizijhz/article/details/120156351
版权
机器学习同时被 2 个专栏收录
4 篇文章 4 订阅
订阅专栏
数据挖掘
4 篇文章 1 订阅
订阅专栏
对比参考

参考自：LightGBM参数介绍_一路前行1的博客-CSDN博客_lightgbm参数
调参思路

LightGBM 调参方法（具体操作） - Byron_NG - 博客园
lightgbm介绍

深入理解LightGBM - 知乎
XGBoost参数介绍

xgboost参数
https://xgboost.readthedocs.io/en/latest/parameter.html
1全局参数

可以使用 xgboost.config_context() (Python) 或 xgb.set.config() (R) 在全局范围内设置以下参数。
详细程度：打印消息的详细程度。 有效值为 0（https://xgboost.readthedocs.io/en/latest/parameter.html静默）、1（警告）、2（信息）和 3（调试）。
verbosity: Verbosity of printing messages. Valid values of 0 (silent), 1 (warning), 2 (info), and 3 (debug).
use_rmm：是否使用 RAPIDS Memory Manager (RMM) 分配 GPU 内存。 此选项仅在启用 RMM 插件的情况下构建（编译）XGBoost 时适用。 有效值为真和假。Valid values are true and false.


2一般参数

一般参数与我们使用哪个助推器进行助推有关，通常是树或线性模型
booster [default= gbtree ]
Which booster to use. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
verbosity [default=1]
Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). Sometimes XGBoost tries to change configurations based on heuristics（启发式）, which is displayed as warning message. If there’s unexpected behaviour, please try to increase value of verbosity.
validate_parameters [default to false, except for Python, R and CLI interface]
When set to True, XGBoost will perform validation of input parameters to check whether a parameter is used or not. The feature is still experimental. It’s expected to have some false positives.实验阶段，有假阳，尽量先不用。
nthread [default to maximum number of threads available if not set]
Number of parallel threads used to run XGBoost. When choosing it, please keep thread contention and hyperthreading in mind.
nthread [如果未设置，默认为可用的最大线程数]
用于运行 XGBoost 的并行线程数。 选择它时，请牢记线程争用和超线程。
disable_default_eval_metric [default= false]
Flag to disable default metric. Set to 1 or true to disable.
disable_default_eval_metric [默认=假]
标志以禁用默认指标。 设置为 1 或 true 以禁用。
num_feature [set automatically by XGBoost, no need to be set by user]
Feature dimension used in boosting, set to maximum dimension of the feature
num_feature [由XGBoost自动设置，无需用户设置]
提升中使用的特征维度，设置为特征的最大维度


3 助推器参数

助推器参数取决于您选择的助推器，此处主要介绍树形助推器的参数。
eta [default=0.3, alias: learning_rate]

Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
range: [0,1]
gamma [default=0, alias: min_split_loss]

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
range: [0,∞]
max_depth [default=6]

Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. exact tree method requires non-zero value.
range: [0,∞]

eta [默认=0.3，别名：learning_rate]

更新中使用的步长收缩以防止过度拟合。 在每个 boosting 步骤之后，我们可以直接得到新特征的权重，并且 eta 收缩特征权重，使 boosting 过程更加保守。
范围：[0,1]
gamma [默认=0，别名：min_split_loss]

在树的叶节点上进行进一步分区所需的最小损失减少。 gamma越大，算法就越保守。
范围：[0,∞]
最大深度 [默认=6]

树的最大深度。 增加这个值会使模型更复杂，更容易过拟合。 0 表示没有深度限制。 请注意，XGBoost 在训练深度树时会过度消耗内存。 精确树方法需要非零值。
范围：[0,∞]

min_child_weight [default=1]

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.
range: [0,∞]
max_delta_step [default=0]

Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.
range: [0,∞]
subsample [default=1]

Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
range: (0,1]
min_child_weight [默认=1]

孩子需要的实例权重（粗麻布）的最小总和。如果树划分步骤导致一个叶子节点的实例权重之和小于 min_child_weight，则构建过程将放弃进一步的划分。在线性回归任务中，这仅对应于每个节点中需要的最小实例数。 min_child_weight 越大，算法就越保守。
范围：[0,∞]
max_delta_step [默认=0]

我们允许每个叶子输出的最大增量步长。如果该值设置为 0，则表示没有约束。如果将其设置为正值，则有助于使更新步骤更加保守。通常不需要这个参数，但当类极度不平衡时，它可能有助于逻辑回归。将其设置为 1-10 的值可能有助于控制更新。
范围：[0,∞]
子样本 [默认=1]

训练实例的子样本比率。将其设置为 0.5 意味着 XGBoost 将在生长树之前随机采样一半的训练数据。这将防止过度拟合。每次提升迭代都会进行一次二次采样。
范围：（0,1]

sampling_method [default= uniform]

The method to use to sample the training instances.
uniform: each training instance has an equal probability of being selected. Typically set subsample >= 0.5 for good results.
gradient_based: the selection probability for each training instance is proportional to the regularized absolute value of gradients (more specifically. subsample may be set to as low as 0.1 without loss of model accuracy. Note that this sampling method is only supported when tree_method is set to gpu_hist; other tree methods only support uniform sampling.

colsample_bytree, colsample_bylevel, colsample_bynode [default=1]

This is a family of parameters for subsampling of columns.
All colsample_by* parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.
colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.
colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split.

Using the Python or the R package, one can set the feature_weights for DMatrix to define the probability of each feature being selected when using column sampling. There’s a similar parameter for fit method in sklearn interface.
采样方法 [默认 = 统一]

用于对训练实例进行采样的方法。
uniform：每个训练实例被选中的概率相等。通常设置 subsample >= 0.5 以获得良好的结果。
gradient_based：每个训练实例的选择概率与梯度的正则化绝对值成正比（更具体地说。subsample 可以设置为低至 0.1 而不会损失模型精度。请注意，仅当 tree_method 设置为时才支持此采样方法gpu_hist；其他树方法只支持统一采样。

colsample_bytree, colsample_bylevel, colsample_bynode [默认=1]

这是用于对列进行二次抽样的一系列参数。
所有 colsample_by* 参数的范围为 (0, 1]，默认值为 1，并指定要进行二次采样的列的分数。
colsample_bytree 是构建每棵树时列的子样本比率。对每棵构建的树进行一次二次抽样。
colsample_bylevel 是每个级别的列的子样本比率。对于树中达到的每个新深度级别，都会进行一次二次采样。从为当前树选择的列集中对列进行二次抽样。
colsample_bynode 是每个节点（拆分）的列的子样本比率。每次评估新拆分时，都会进行一次二次抽样。从为当前级别选择的列集中对列进行二次抽样。
colsample_by* 参数累积工作。例如，具有 64 个特征的组合 {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} 将在每次拆分时留下 8 个特征可供选择。

使用 Python 或 R 包，可以为 DMatrix 设置 feature_weights 以定义在使用列采样时选择每个特征的概率。 sklearn 界面中的 fit 方法也有类似的参数。

lambda [default=1, alias: reg_lambda]
L2 regularization term on weights. Increasing this value will make model more conservative.

alpha [default=0, alias: reg_alpha]
L1 regularization term on weights. Increasing this value will make model more conservative.
lambda [默认=1，别名：reg_lambda]
权重的 L2 正则化项。 增加此值将使模型更加保守。

alpha [默认=0，别名：reg_alpha]
权重的 L1 正则化项。 增加此值将使模型更加保守。

tree_method string [default= auto]

The tree construction algorithm used in XGBoost. See description in the reference paper and Tree Methods.
XGBoost supports approx, hist and gpu_hist for distributed training. Experimental support for external memory is available for approx and gpu_hist.
Choices: auto, exact, approx, hist, gpu_hist, this is a combination of commonly used updaters. For other updaters like refresh, set the parameter updater directly.
auto: Use heuristic to choose the fastest method.
For small dataset, exact greedy (exact) will be used.
For larger dataset, approximate algorithm (approx) will be chosen. It’s recommended to try hist and gpu_hist for higher performance with large dataset. (gpu_hist)has support for external memory.
Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
exact: Exact greedy algorithm. Enumerates all split candidates.
approx: Approximate greedy algorithm using quantile sketch and gradient histogram.
hist: Faster histogram optimized approximate greedy algorithm.
gpu_hist: GPU implementation of hist algorithm.

tree_method 字符串 [默认 = 自动]

XGBoost 中使用的树构造算法。请参阅参考论文和树方法中的描述。
XGBoost 支持 approx、hist 和 gpu_hist 进行分布式训练。外部存储器的实验性支持可用于 approx 和 gpu_hist。
选择：auto、exact、approx、hist、gpu_hist，这是常用更新器的组合。对于刷新等其他更新器，直接设置参数更新器。
auto：使用启发式选择最快的方法。
对于小数据集，将使用精确贪心（exact）。
对于较大的数据集，将选择近似算法（近似）。建议尝试使用 hist 和 gpu_hist 以获得更大的数据集性能。 (gpu_hist) 支持外部存储器。
由于旧行为始终在单机中使用精确贪婪，因此在选择近似算法以通知此选项时，用户将获得一条消息。
精确：精确的贪心算法。枚举所有拆分候选人。
approx：使用分位数草图和梯度直方图的近似贪心算法。
hist：更快的直方图优化近似贪心算法。
gpu_hist：hist 算法的 GPU 实现。

sketch_eps [default=0.03]

Only used for updater=grow_local_histmaker.
This roughly translates into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
Usually user does not have to tune this. But consider setting to a lower number for more accurate enumeration of split candidates.
range: (0, 1)
sketch_eps [默认=0.03]

仅用于 updater=grow_local_histmaker。
这大致转化为 O(1 / sketch_eps) 个 bin。 与直接选择分箱数相比，这具有草图准确性的理论保证。
通常用户不必对此进行调整。 但考虑设置为较小的数字，以便更准确地枚举拆分候选者。
范围：(0, 1)

scale_pos_weight [default=1]

Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances). See Parameters Tuning for more discussion. Also, see Higgs Kaggle competition demo for examples: R, py1, py2, py3.

scale_pos_weight [默认=1]

控制正负权重的平衡，对不平衡的类很有用。 需要考虑的典型值：sum(negative instances) / sum(positive instances)。 有关更多讨论，请参阅参数调整。 此外，请参阅 Higgs Kaggle 竞赛演示以获取示例：R、py1、py2、py3。

updater

A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. The following updaters exist:
grow_colmaker: non-distributed column-based construction of trees.
grow_histmaker: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
grow_local_histmaker: based on local histogram counting.
grow_quantile_histmaker: Grow tree using quantized histogram.
grow_gpu_hist: Grow tree with GPU.
sync: synchronizes trees in all distributed nodes.
refresh: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
prune: prunes the splits where loss < min_split_loss (or gamma) and nodes that have depth greater than max_depth.
更新者

一个逗号分隔的字符串，定义了要运行的树更新程序的序列，提供了一种构建和修改树的模块化方式。 这是一个高级参数，通常会根据其他一些参数自动设置。 但是，它也可以由用户显式设置。 存在以下更新程序：
grow_colmaker：基于非分布式列的树构造。
grow_histmaker：分布式树构造，基于直方图计数的全局建议，具有基于行的数据拆分。
grow_local_histmaker：基于局部直方图计数。
grow_quantile_histmaker：使用量化直方图生长树。
grow_gpu_hist：使用 GPU 生长树。
sync：同步所有分布式节点中的树。
refresh：根据当前数据刷新树的统计信息和/或叶子值。 请注意，不执行数据行的随机子采样。
prune：修剪 loss < min_split_loss（或 gamma）的分割和深度大于 max_depth 的节点。


refresh_leaf [default=1]

This is a parameter of the refresh updater. When this flag is 1, tree leafs as well as tree nodes’ stats are updated. When it is 0, only node stats are updated.
process_type [default= default]

A type of boosting process to run.
Choices: default, update
default: The normal boosting process which creates new trees.
update: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updaters is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iterations performed. Currently, the following built-in updaters could be meaningfully used with this process type: refresh, prune. With process_type=update, one cannot use updaters that create new trees.
grow_policy [default= depthwise]

Controls a way new nodes are added to the tree.
Currently supported only if tree_method is set to hist, approx or gpu_hist.
Choices: depthwise, lossguide
depthwise: split at nodes closest to the root.
lossguide: split at nodes with highest loss change.

refresh_leaf [默认=1]

这是刷新更新程序的参数。当此标志为 1 时，更新树叶和树节点的统计信息。当它为 0 时，仅更新节点统计信息。
process_type [默认=默认]

一种要运行的提升过程。
选择：默认、更新
默认值：创建新树的正常提升过程。
更新：从现有模型开始，仅更新其树。在每次提升迭代中，从初始模型中提取一棵树，为该树运行指定的更新程序序列，并将修改后的树添加到新模型中。新模型将具有相同或更少数量的树，具体取决于执行的提升迭代次数。目前，以下内置更新程序可以有意义地用于此进程类型：刷新、修剪。使用 process_type=update，不能使用创建新树的更新程序。
grow_policy [默认=深度]

控制将新节点添加到树的方式。
目前仅当 tree_method 设置为 hist、approx 或 gpu_hist 时才支持。
选择：depthwise, lossguide
depthwise：在离根最近的节点处分裂。
lossguide：在损失变化最大的节点处分裂。


max_leaves [default=0]

Maximum number of nodes to be added. Not used by exact tree method.
max_bin, [default=256]

Only used if tree_method is set to hist, approx or gpu_hist.
Maximum number of discrete bins to bucket continuous features.
Increasing this number improves the optimality of splits at the cost of higher computation time.
predictor, [default= auto]

The type of predictor algorithm to use. Provides the same results but allows the use of GPU or CPU.
auto: Configure predictor based on heuristics.
cpu_predictor: Multicore CPU prediction algorithm.
gpu_predictor: Prediction using GPU. Used when tree_method is gpu_hist. When predictor is set to default value auto, the gpu_hist tree method is able to provide GPU based prediction without copying training data to GPU memory. If gpu_predictor is explicitly specified, then all data is copied into GPU, only recommended for performing prediction tasks.
num_parallel_tree, [default=1]

Number of parallel trees constructed during each iteration. This option is used to support boosted random forest.
monotone_constraints

Constraint of variable monotonicity. See Monotonic Constraints for more information.
interaction_constraints

Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nest list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See Feature Interaction Constraints for more information.

max_leaves [默认=0]

要添加的最大节点数。不被精确树方法使用。
max_bin，[默认=256]

仅在 tree_method 设置为 hist、approx 或 gpu_hist 时使用。
存储连续特征的最大离散箱数。
增加这个数字以增加计算时间为代价提高了分割的最优性。
预测器，[默认=自动]

要使用的预测器算法的类型。提供相同的结果，但允许使用 GPU 或 CPU。
auto：基于启发式配置预测器。
cpu_predictor：多核CPU预测算法。
gpu_predictor：使用 GPU 进行预测。当 tree_method 为 gpu_hist 时使用。当 predictor 设置为默认值 auto 时，gpu_hist 树方法能够提供基于 GPU 的预测，而无需将训练数据复制到 GPU 内存。如果显式指定了 gpu_predictor，则所有数据都复制到 GPU 中，仅推荐用于执行预测任务。
num_parallel_tree，[默认值=1]

每次迭代期间构建的并行树的数量。此选项用于支持增强随机森林。
monotone_constraints

变量单调性的约束。有关详细信息，请参阅单调约束。
交互约束

表示允许交互的交互约束。约束必须以嵌套列表的形式指定，例如[[0, 1], [2, 3, 4]]，其中每个内部列表是一组允许相互交互的特征索引。有关详细信息，请参阅特征交互约束。

补充参数
Additional parameters for hist, gpu_hist and approx tree method

single_precision_histogram, [default= false]

Use single precision to build histograms instead of double precision.
max_cat_to_onehot

New in version 1.6.
Note

The support for this parameter is experimental.
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Only relevant for regression and binary classification. Also, exact tree method is not supported

hist、gpu_hist 和近似树方法的附加参数
单精度直方图，[默认=假]
使用单精度而不是双精度来构建直方图。

max_cat_to_onehot
1.6 版中的新功能。
笔记
对该参数的支持是实验性的。
决定 XGBoost 是否应该对分类数据使用基于单热编码的拆分的阈值。 当类别数小于阈值时，选择one-hot编码，否则将类别划分为子节点。 仅与回归和二元分类相关。 此外，不支持精确的树方法


4 学习任务参数
指定学习任务和相应的学习目标。 例如：objective，base_score，eval_metric，seed，seed_per_iteration