【07】参数

最新推荐文章于 2022-06-11 22:40:45 发布

LuCh1Monster

最新推荐文章于 2022-06-11 22:40:45 发布

阅读量393

点赞数

分类专栏： LightGBM

本文链接：https://blog.csdn.net/LuCh1Monster/article/details/106725823

版权

LightGBM 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

文章目录

1. 参数格式
2. 核心参数
3. 用于控制模型学习过程的参数
4. IO 参数
5. 目标参数
6. 度量参数
7. 网络参数
8. GPU 参数
9. 模型参数
10. 其他

该页面包含了 LightGBM 的所有参数。

外部链接 : Laurae++ Interactive Documentation

更新于 2017/8/4

以下参数的 default 已经修改:

min_data_in_leaf = 100 => 20
min_sum_hessian_in_leaf = 10 => 1e-3
num_leaves = 127 => 31
num_iterations = 10 => 100

1. 参数格式

参数的格式为 key1=value1 key2=value2 ... ，并且，在配置文件和命令行中均可以设置参数。使用命令行设置参数时，在 = 前后都不应该有空格。一行只能包含一个参数。你可以使用 # 进行注释。

如果一个参数在命令行和配置文件中均出现了，LightGBM 将会使用命令行中的该参数。

2. 核心参数

config: 默认为 ""
- type: string
- alias: config_file (配置文件的路径)
task: 默认为 train
- type: enum
- options: train，predict、convert_model
  - train: alias=training，用于训练
  - predict: alias=prediction，test，用于预测。
  - convert_model: 用于将模型文件转换为 if-else 格式，在转换模型参数中了解更多信息
application: 默认为 regression
- type: enum
- alias: objective, app
- options: regression, regression_l1, huber, fair, poisson, quantile, quantile_l2, binary, multiclass, multiclassova, xentropy, xentlambda, lambdarank
  - 回归 application
    - regression_l2: L2损失，alias: regression, mean_squared_error, mse
    - regression_l1: L1损失，alias: mean_absolute_error, mse
    - huber: Huber loss
    - fair: Fair Loss
    - poisson: Poisson regression
    - quantile: Quantile regression
    - quantile_l2: 与 quantile 类似，也应该被设置
  - binary: 二分log_loss 分类
  - 多类别分类 application
    - multiclass: softmax 目标函数，应该设置好 num_class
    - multiclassova: One-vs-All 二元目标函数，应该设置好 num_class
  - 交叉熵 application
    - xentropy: 交叉熵的目标函数 (可选线性权重), alias=cross_entropy
    - xentlambda: 交叉熵的替代参数化，alias=cross_entropy_lambda
    - label 是在 [0, 1] 间隔中的任何东西
  - lambdarank: lambdarank application
    - 在 lambdarank 任务中 label 应该是 int 类型，而较大的数字表示较高的相关性 (例如， 0: bad, 1: fair, 2: good, 3:perfect )
    - label_gain: 可以用来设置 int label 的 gain(weight) 增益(权重)
boosting: 默认为 gbdt
- type: enum
- options: gbdt, rf, dart, goss
- alias: boost, boosting_type
- gdbt: traditional Gradient Boosting Decision Tree (传统梯度提升决策树)
- rf: 随机森林
- dart: Dropouts meet Multiple Additive Regression Trees
- goss: Gradient-based One-Side Sampling (基于梯度的单面采样)
data: 默认为 ""
- type: string
- alias: train, train_data
- 训练数据: LightGBM 将从这个数据训练
valid: 默认为 ""
- type: multi-string
- alias: test, valid_data, test_data
- 验证/测试数据: LightGBM 将输出这些数据的指标
- 支持多个验证数据，使用 , 分开
num_iterations: 默认为 100
- type: int
- alias: num_iteration, num_tree, num_trees, num_round, num_rounds, num_boost_round
- boosting iterations/trees 的数量，或者说 boosting 的迭代次数
- 注意: 对于 Python/R 包，这个参数是被忽略的，使用 train 和 cv 的输入参数 num_boost_round (Python) 或者 nrounds® 来代替
- 注意: 在内部，LightGBM 对于 multiclass 问题设置 num_class * num_iterations 颗树
learning_rate: 默认为 0.1
- type: double
- alias: shrinkage_rate (收敛率)
- 在 dart 中，它还影响了 dropped trees 的归一化权重
num_leaves: 默认为 31
- type: int
- alias: num_leaf (在一棵树中的叶子数量)
tree_learner: 默认为 serial
- type: enum
- alias: tree
- options: serial, feature, data, voting
  - serial: 单个 machine tree 学习器
  - feature: feature parallel tree learner (特征并行学习器)，alias: feature_parallel
  - data: data parallel learner (数据并行学习器)，alias: data_parallel
  - voting: voting parallel tree learner (投票并行学习器)， alias: voting_parallel
  - 参考 Parallel Learning Guide(并行学习指南) 来了解更多细节
num_threads: 默认为 OperMP_default
- type: int
- alias: num_thread, nthread
- LightGBM 的线程数
- 为了获取最好的速度，将其设置为 real CPU cores (真实CPU内核) 数量，而不是线程数(大多数 CPU 使用 hyper-threading 来为每个 CPU core 生成2个线程)
- 当你的数据集小的时候，不要将它设置的过大(比如，当数据集有 10,000 行时不要使用 64 线程)
- 请注意，任务管理器或任何类似的 CPU 监视工具可能会报告未被充分利用的内核。这是正常的。
- 对于并行学习，不应该使用全部的 CPU cores，因为这会导致网络性能不佳
device: 默认为cpu
- options: cpu、gpu
- 为树学习选择设备，你可以使用 gpu 来获得更快的学习速度
- 注意: 建议使用较小的 max_bin (例如 63) 来获得更快的速度
- 注意: 请参考安装指南来构建 GPU 版本
max_depth: 默认为 -1
- type: int
- 树模型最大深度的限制。当 data 很小的时候，这被用来处理 overfit (过拟合)。树仍然通过 leaf-wise 生长
- 0 表示没有限制
min_data_in_leaf: 默认为 20
- type: int
- alias: min_data_per_leaf, min_data, min_child_samples
- 一个叶子中的最小数据量。可以用这个来处理过拟合。
min_sum_hessian_in_leaf: 默认为 1e-3
- type: int
- alias: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight
- 一个叶子节点中最小的 sum hessian。类似于 min_data_in_leaf，它可以用来处理过拟合。

3. 用于控制模型学习过程的参数

max_depth: 默认为 -1
- type: int
- 树模型最大深度的限制。当 data 很小的时候，这被用来处理 overfit (过拟合)。树仍然通过 leaf-wise 生长
- 0 表示没有限制
min_data_in_leaf: 默认为 20
- type: int
- alias: min_data_per_leaf, min_data, min_child_samples
- 一个叶子中的最小数据量。可以用这个来处理过拟合。
min_sum_hessian_in_leaf: 默认为 1e-3
- type: int
- alias: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight
- 一个叶子节点中最小的 sum hessian。类似于 min_data_in_leaf，它可以用来处理过拟合。
feature_fraction: 默认为 1.0
- type: double
- 0.0 < feature_fraction < 1.0
- alias: sub_feature、colsample_bytree
- 如果 feature_fraction 小于 1.0，LightGBM 将会在每次迭代中随机选择部分特征。例如，如果设置为 0.8，将会在每棵树训练之前选择 80% 的特征
- 可以用来加速训练
- 可以用来处理过拟合
feature_fraction_seed: 默认为 2
- type: int
- feature_fraction 的随机数种子
bagging_fraction: 默认为 1.0
- type: double
- 0.0 < bagging_fraction < 1.0
- alias: sub_row、subsample
- 类似于 feature_fraction，但是它将不在进行重采样的情况下随机选择部分数据
- 可以用来加速训练
- 可以用来处理过拟合
- 注意: 为了启动 bagging，bagging_freq 应该设置为非零值
bagging_freq: 默认为 0
- type: int
- alias: bagging_fraction_seed
- bagging 的频率，0 意味着禁用 bagging。K 意味着每次 k 次迭代 bagging
- 注意: 为了启动 bagging，bagging_fraction 设置适当
bagging_seed: 默认为 3，bagging 随机数种子
- type: int
- alias: bagging_fraction_seed
early_stopping_round: 默认为 0
- type: int
- alias: early_stopping_rounds，early_stopping
- 如果一个验证集的度量在 early_stopping_round 循环中没有提升，将停止训练
lambda_l1: 默认值为 0，L1 正则
- type: int
- alias: reg_alpha
lambda_l2: 默认值为 0，L2 正则
- type: int
- alias: reg_lambda
max_split_gain: 默认为 0，指向切分的最小增益
- type: double
- alias: min_gain_to_split
drop_rate: 默认值为 0.1
- type: double
- 仅仅在 dart 时使用
skip_drop: 默认值为 0.5
- type: double
- 仅仅在 dart 时使用，跳过 drop 的概率
max_drop: 默认值为 50
- type: int
- 仅仅在 dart 时使用，一次迭代中删除树的最大数量
- <=0 意味着没有限制
uniform_drop: 默认为 false
- type: bool
- 仅仅在 dart 时使用，如果想要均匀的删除，将它设置为 true
xgboost_dart_mode: 默认值为 false
- type: bool
- 仅仅在 dart 时使用，如果想要使用 xgboost dart 模式，将它设置为 true
drop_seed: 默认值为 4
- type: int
- 仅仅在 dart 时使用，选择 dropping models 的随机数种子
top_rate: 默认值为 0.2
- type: double
- 仅仅在 goss 时使用，大梯度数据的保留比例
other_rate: 默认值为 0.1
- type: int
- 每个分类组的最小数据量
max_cat_threshold: 默认值为 32
- type: int
- 用于分类特征
- 限制分类特征的最大阈值
cat_smooth: 默认值为 10
- type: double
- 用于分类特征
- 这可以降低噪声在分类特征中的影响，尤其是对数据很少的类别
cat_l2: 默认值为 10
- type: double
- 分类切分中的 L2 正则
max_cat_to_onehot: 默认值为 4
- type: int
- 当一个特征的类别数小于或等于 max_cat_to_onehot 时，one-vs-other 切分算法将会被使用
top_k: 默认为 20
- type: int
- alias: topk
- 被使用在 Voting parallel
- 将它设置为更大值可以获得更准确的结果，但会减慢训练速度

4. IO 参数

max_bin: 默认值为 255
- type: int
- 工具箱的最大特征决定了容量，工具箱的最小特征值可能会降低训练的准确性，但是可能会增加一些一般的影响(处理过度学习)
- LightGBM 将根据 max_bin 自动压缩内存。例如，如果 max_bin=255，那么 LightGBM 将使用 uint8t 的特性值
max_data_in_bin: 默认值为 3
- type: int
- 单个数据想的最小数，使用此方法避免 one-data-one-bin (可能会过度学习)
data_ranom_seed: 默认值为 1
- type: int
- 并行学习数据分隔中的随机种子 (不包括并行功能)
output_model: 默认为 LightGBM_model.txt
- type: string
- alias: model_output, model_out
- 培训中输出的模型文件名
input_model: 默认为 ""
- type: string
- alias: model_input, model_in
- 输入模型的文件名
- 对于 prediction 任务，该模型将用于预测数据
- 对于 train 任务，培训将从该模型继续
output_result: 默认为 LightGBM_predict_result.txt
- type: string
- alias: predict_result, prediction_result
- prediction 任务的预测结果文件名
model_format: 默认为 text
- type: multi-enum
- options: text, proto
- 保存和加载模型的格式
- text: 使用文本字符串
- proto: 使用协议缓冲二进制格式
- 您可以通过使用 , 来进行多种格式的保存，例如 text, proto。在这种情况下，model_format 将作为后缀添加 output_model
- 注意: 不支持多种格式的加载
- 注意: 要使用这个参数，您需要使用 build 版本
pre_partition: 默认为 false
- type: bool
- alias: is_pre_partition
- 用于并行学习 (不包括功能并行)
- true 如果训练数据 pre-partitioned，不同的机器使用不同的分区
is_sparse: 默认为 true
- type: bool
- alias: is_enable_sparse, enable_sparse
- 用于 enable/disable 稀疏优化，设置 false 就禁用稀疏优化
two_round: 默认为 false
- type: bool
- alias: two_round_loading, use_two_round_loading
- 默认情况下，LightGBM 将把数据文件映射到内存，并从内存加载特性。这将提供更快的数据加载速度。但当数据文件很大时，内存可能会耗尽
- 如果数据文件太大，不能放在内存中，将把它设置为 true
save_binary: 默认为 false
- type: bool
- alias: is_save_binary, is_save_binary_file
- 如果设置为 true，LightGBM 则将数据集(包括验证数据) 保存在二进制文件中。可以加快数据加载速度。
verbosity: 默认为 1
- type: int
- alias: verbose
- <0: 致命的 ; =0: 错误(警告); >0: 信息
header: 默认为 false
- type: bool
- alias: has_header
- 如果输入数据有标识头，则在此设置 true
label: 默认为 ""
- type: string
- alias: label_column
- 指定标签列
- 用于索引的数字，e.g. label=0 意味着 column_0 是标签列
- 为列名添加前缀 name:，e.g. label=name:is_click
weight: 默认为 ""
- type: string
- alias: weight_column
- 列的指定
- 用于索引的数字，e.g. weight=0 标识 column_0 是权重点
- 为列名添加前缀 name:，e.g. weight=name:weight
- 注意: 索引从 0 开始。当传递 type 为索引时，它不计算标签列，例如当标签为 0 时，权重为列 1，正确的参数是权重值为 0
query: 默认值为 ""
- type: string
- alias: query_column, group, group_column
- 指定 query/group ID 列
- 用数字做索引，e.g. query=0 意味着 column_0 是这个查询的 id
- 为列名添加前缀 name:，e.g. query=name:query_id
- 注意: 数据应按照 query_id，索引从 0 开始。当传递 type 为索引时，它不计算标签列。例如当标签为列 0，查询 id 为列 1时，正确的参数是查询 =0
ignore_column: 默认为 ""
- type: string
- alias: ignore_feature, blacklist
- 在培训中指定一些忽略的列
- 用数字做索引，e.g. ignore_column=0,1,2 意味着 column_0 , column_1 和 column_2 将被忽略
- 为列名添加前缀 name:，e.g. ignore_column=name:c1,c2,c3 意味着 c1, c2 和 c3 将被忽略
- 注意: 只在从文件直接加载数据的情况下工作
- 注意: 索引从 0 开始，它不包括标签栏
categorical_feature: 默认为 ""
- type: string
- alias: categorical_column, cat_feature, cat_column
- 指定分类特征
- 用数字做索引，e.g. categorical_feature=0,1,2 意味着 column_0 , column_1 和 column_2 是分类特征
- 为列名添加前缀 name:，e.g. categorical_feature=name:c1,c2,c3 意味着 c1, c2 和 c3 是分类特征
- 注意: 只支持分类与 int type，索引从 0 开始，并不包括标签栏
- 注意: 负值将被是我 missing values
prediction_raw_score: 默认为 false
- type: bool
- alias: raw_score, is_predict_raw_score
- 只用于 prediction 任务
- 设置为 true 只预测原始分数
- 设置为 false 只预测分数
predict_leaf_index: 默认为 false
- type: bool
- alias: leaf_index, is_predict_leaf_index
- 只用于 prediction 任务
- 设置为 true，使用所有树的叶子索引进行预测
predict_contrib: 默认为 false
- type: bool
- alias: contrib, is_predict_contrib
- 设置为 true 预估 SHAP values，这代表了每个特征对每个预测的贡献。生成的特征 + 1 的值，其中最后一个值是模型输出的预期值，而不是训练数据
bin_construct_sample_cnt: 默认为 200000
- type: int
- alias: subsample_for_bin
- 用来构建直方图的数据的数量
- 在设置更大的数据时，会提供更好的培训效果，但会增加数据加载时间
- 如果数据非常稀疏，则将其设置为更大的值
num_iteration_predict: 默认为 -1
- type: int
- 只用于 prediction 任务
- 用于指定在预测中使用多少训练的迭代
- <=0 意味着没有限制
pred_early_stop: 默认为 false
- type: bool
- 如果 true 将使用提前停止来加速预测，可能影响精度
pred_early_stop_freq: 默认为 10
- type: int
- 检查早期 early-stopping 的频率
pred_early_stop_margin: 默认为 10.0
- type: int
- 设置为 false 禁用丢失值的特殊句柄
use_missing：默认为 true
- type: bool
- 设置为 true 将所有的 0 都视为缺失值(包括 libsvm/sparse 矩阵中未显示的值)
- 设置为 false 使用 na 代表缺失值
init_score_file: 默认为 ""
- type: string
- 训练初始分数文件的路径，"" 将使用 train_data_file + .init(如果存在)
valid_init_score_file: 默认值为 ""
- type: multi-string
- 验证初始分数文件的路径，将使用 valid_data_file + .init(如果存在)
- 通过 , 对 multi-validation 进行分离

5. 目标参数

sigmoid: 默认值为 1.0
- type: double
- sigmoid 函数的参数，将用于 binary 分类和 lambdarank
alpha: 默认值为 1.0
- type: double
- Fair loss 的参数，将用于 regression 任务
gaussian_eta: 默认值为 1.0
- type: double
- 控制高斯函数的宽度的参数，将用于 regression_l1 和 huber 损失
poisson_max_delta_step: 默认值为 0.7
- type: double
- Poisson regresion 的参数用于维护优化
scale_pos_weight: 默认值为 1.0
- 正值的权重 binary 分类任务
boost_from_average: 默认值为 true
- type: bool
- 只用于 regression 任务
- 将初始分数调整为更快收敛速度的平均值
is_unbalance: 默认值为 false
- type: bool
- alias: unbalanced_sets
- 用于 binary 分类
- 如果训练数据不平衡，设置为 true
max_position: 默认值为 20
- type: int
- 用于 lambdarank
- 将在 NDCG 位置优化
label_gain: 默认值为 0,1,3,7,15,31,63,...
- type: multi-double
- 用于 lambdarank
- 有关获得标签，例如，如果使用默认标签增益，这个 2 的标签则是 3
- 使用 , 分隔
num_class: 默认值为 1
- type: int
- alias: num_classes
- 只用于 multiclass 分类
reg_sqrt: 默认值为 false
- type: bool
- 适合 sqrt(label) 相反，预测结果也会自动转换成 pow2(prediction)

6. 度量参数

metric: 默认值为 l2(回归)，binary_logloss(二分类)，ndcg(lambdarank)
- type: multi-enum
- options: l1, l2, ndcg, auc, binary_loss, binary_loss, …
  - l1: 绝对值损失， alias: mean_absolute_error, mae
  - l2: 平方损失，alias: mean_square_error, mse
  - l2_root: 平方根损失，alias: root_mean_square_error, rmse
  - quantile: Quantile regression
  - huber: Huber loss
  - fair: Fair loss
  - poisson: Poisson regression
  - ndcg: NDCG
  - map: MAP
  - auc: AUC
  - binary_logloss: log_loss
  - binary_error: 样本中 0 的正确分类，1 错误分类
  - multi_logloss: multi-class 损失日志分类
  - multi_error: error rate for multi-class 多分类出错率
  - xentropy: cross-entropy (交叉熵，与可选的线性权重)，alias: cross_entropy
  - xentlambda: “intensity-weighted” 交叉熵，alias: cross_entropy_lambda
  - kldiv: Kullback-Leibler divergence，alias: kullback_leibler
  - 支持多指标，使用 , 分隔
- metric_freq: 默认值为 1
  - type: int
  - 频率指标输出
- train_metric: 默认值为 false
  - type: bool
  - alias: training_metric, is_training_metric
  - 如果你需要输出训练的度量结果，则设置为 true
- ndcg_at: 默认值为 1,2,3,4,5
  - type: multi-int
  - alias: ndcg_eval_at, eval_at
  - NDCG 职位评估，使用 , 分隔

7. 网络参数

以下参数用于并行学习，只用于基本(socket) 版本。

num_machines: 默认值为 1
- type: int
- alias: num_machine
- 用于并行学习的并行学习应用程序的数量
- 需要在 socket 和 mpi 版本中设置这个
local_listen_port: 默认值为 12400
- type: int
- alias: local_port
- 监听本地机器的 TCP 端口
- 在训练之前，您应该在防火墙设置开放此端口
time_out: 默认为 120
- type: int
- 允许 socket 几分钟内超时
machine_list_file: 默认值为 ""
- type: string
- alias: mllist
- 为这个并行学习应用程序列出机器的文件
- 每一行包含一个 IP 和一个端口为一台机器。格式是 ip:port，由空格分隔

8. GPU 参数

gpu_platform_id: 默认值为 -1，意味着整个系统平台
- type: int
- OpenCL 平台 ID，通常每个 GPU 供应商都会公开一个 OpenCL 平台。
gpu_device_id: 默认值为 -1，意味着整个平台上的设备
- type: int
- OpenCL 设备 ID 在指定的平台上。在选定的平台上的每一个 GPU 都有一个唯一的设备 ID
gpu_use_dp: 默认值为 false
- type: bool
- 设置为 true 在 GPU 上使用双精度 GPU (默认使用单精度)

9. 模型参数

该特性仅在命令行版本中得到支持。

covert_model_language: 默认值为 ""
- type: string
- 只支持 cpp
- 如果 covert_model_langeuage 设置为 task 时，该模型也将转换为 train。
convert_model: 默认值为 gbdt_prediction.cpp
- type: string
- 转换模型的输出文件名

10. 其他

10.1 持续训练输入分数

LightGBM 支持对初始得分进行持续的训练，它使用一个附加的文件来存储这些初始值，如下:

0.5
-0.1
0.9
...

它意味着最初的得分第一个数据行是 0.5，第二个是 -0.1 等。初始得分文件与数据文件逐行对应，每一行有一个分数。

如果数据文件的名称是 train.txt，最初的分数文件应该被命名为 train.text.init 与作为数据文件在同一文件夹。在这种情况下，LightGBM 将自动加载初始得分文件，如果它存在的话。

10.2 权重数据

LightGBM 加权训练，它使用一个附加文件来存储权重数据，如下:

1.0
0.5
0.8
...

它意味的第一行数据的权重是 1.0，第二个是 0.5 等。权重文件按行与数据文件行相对应，每行的权重都有一个权重。

如果数据文件的名称是 train.txt，最初的权重文件应该被命名为 train.text.weight 与作为数据文件在同一文件夹。在这种情况下，LightGBM 将自动加载初始权重文件，如果它存在的话。

update: 现在可以在数据文件中指定 weight 列。请参阅以上参数。

10.3 查询数据

对于 LambdaRank 的学习，它需要查询信息来训练数据。LightGBM 使用一个附加文件来存储查询数据，如下:

27
18
67
...

它意味着第一个 27 行样本属于一个查询和下一个 18 行属于另一个，等等。注意: 数据应该由查询来排序。

如果数据文件的名称是 train.txt，这个查询文件应该被命名为 train.text.query 与作为数据文件在同一文件夹。在这种情况下，LightGBM 将自动加载初始查询文件，如果它存在的话。

update**: 现在可以在数据文件中指定 query/group id 列。请参阅以上参数。

LuCh1Monster

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【07】参数

文章目录1. 参数格式2. 核心参数3. 用于控制模型学习过程的参数4. IO 参数5. 目标参数6. 度量参数7. 网络参数8. GPU 参数9. 模型参数10. 其他10.1 持续训练输入分数10.2 权重数据10.3 查询数据该页面包含了 LightGBM 的所有参数。外部链接 : Laurae++ Interactive Documentation更新于 2017/8/4以下参数的 default 已经修改:min_data_in_leaf = 100 => 20min_su
复制链接

扫一扫