Py之lightgbm：lightgbm的简介、安装、使用方法之详细攻略

一个处女座的程序猿

已于 2023-03-02 23:06:00 修改

阅读量1.7w

点赞数 10

分类专栏： Python_Libraries ML 文章标签：决策树机器学习算法

于 2019-01-14 16:54:13 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/86480147

版权

ML 同时被 2 个专栏收录

520 篇文章

订阅专栏

Python_Libraries

263 篇文章

订阅专栏

Py之lightgbm：lightgbm的简介、安装、使用方法之详细攻略

lightgbm的简介

1、LightGBM 的优势

2、lightgbm与xgboost各种性能比较

2、LGBMClassifier()函数/LGBMRegressor()函数的简介

ML之lightgbm：LGBMClassifier()函数/LGBMRegressor()函数的简介、具体案例、调参技巧之详细攻略

3、ML之lightgbm：利用lightgbm算法(原生接口【特征重要性/加载模型线上推理】、原生接口结合sklearn接口【网格搜索调参/基于最佳参数预测】)两种形式实现波士顿房价回归预测实现代码

lightgbm的简介

LightGBM 是一个梯度 boosting 框架, 使用基于学习算法的决策树. 它是分布式的, 高效的。LightGBM是个快速的，分布式的，高性能的基于决策树算法的梯度提升框架。可用于排序，分类，回归以及很多其他的机器学习任务中。
GBDT是受欢迎的机器学习算法，当特征维度很高或数据量很大时，有效性和可拓展性没法满足。lightgbm提出GOSS(Gradient-based One-Side Sampling)和EFB(Exclusive Feature Bundling)进行改进。lightgbm与传统的gbdt在达到相同的精确度时，快20倍。
在竞赛题中，我们知道XGBoost算法非常热门，它是一种优秀的拉动框架，但是在使用过程中，其训练耗时很长，内存占用比较大。在2017年年1月微软在GitHub的上开源了一个新的升压工具--LightGBM。在不降低准确率的前提下，速度提升了10倍左右，占用内存下降了3倍左右。因为他是基于决策树算法的，它采用最优的叶明智策略分裂叶子节点，然而其它的提升算法分裂树一般采用的是深度方向或者水平明智而不是叶，明智的。因此，在LightGBM算法中，当增长到相同的叶子节点，叶明智算法比水平-wise算法减少更多的损失。因此导致更高的精度，而其他的任何已存在的提升算法都不能够达。与此同时，它的速度也让人感到震惊，这就是该算法名字灯的原因。

LightGBM 中文文档：http://lightgbm.apachecn.org/#/

lightgbm github：GitHub - microsoft/LightGBM: A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
lightgbm pypi：lightgbm · PyPI

1、LightGBM 的优势

速度和内存使用的优化	减少分割增益的计算量通过直方图的相减来进行进一步的加速减少内存的使用减少并行学习的通信代价
稀疏优化
准确率的优化	Leaf-wise (Best-first) 的决策树生长策略类别特征值的最优分割
网络通信的优化
并行学习的优化	特征并行数据并行投票并行
GPU 支持可处理大规模数据

2、lightgbm与xgboost各种性能比较

(1)、比较效率

为了, 我们只运行没有任何测试或者度量输出的训练进程，并且我们不计算 IO 的时间。如下是耗时的对比表格：

Data	xgboost	xgboost_hist	LightGBM
Higgs	3794.34 s	551.898 s	238.505513 s
Yahoo LTR	674.322 s	265.302 s	150.18644 s
MS LTR	1251.27 s	385.201 s	215.320316 s
Expo	1607.35 s	588.253 s	138.504179 s
Allstate	2867.22 s	1355.71 s	348.084475 s

我们发现在所有数据集上 LightGBM 都比 xgboost 快。

(2)、比较准确率

为了比较准确率, 我们使用数据集测试集部分的准确率进行公平比较。

Data	Metric	xgboost	xgboost_hist	LightGBM
Higgs	AUC	0.839593	0.845605	0.845154
Yahoo LTR	NDCG<sub>1</sub>	0.719748	0.720223	0.732466
NDCG<sub>3</sub>	0.717813	0.721519	0.738048
NDCG<sub>5</sub>	0.737849	0.739904	0.756548
NDCG<sub>10</sub>	0.78089	0.783013	0.796818
MS LTR	NDCG<sub>1</sub>	0.483956	0.488649	0.524255
NDCG<sub>3</sub>	0.467951	0.473184	0.505327
NDCG<sub>5</sub>	0.472476	0.477438	0.510007
NDCG<sub>10</sub>	0.492429	0.496967	0.527371
Expo	AUC	0.756713	0.777777	0.777543
Allstate

(3)、比较内存消耗

我们在运行训练任务时监视 RES，并在 LightGBM 中设置 two_round=true （将增加数据载入时间，但会减少峰值内存使用量，不影响训练速度和准确性）以减少峰值内存使用量。

Data	xgboost	xgboost_hist	LightGBM
Higgs	4.853GB	3.784GB	0.868GB
Yahoo LTR	1.907GB	1.468GB	0.831GB
MS LTR	5.469GB	3.654GB	0.886GB
Expo	1.553GB	1.393GB	0.543GB
Allstate	6.237GB	4.990GB

lightgbm的安装

pip install lightgbm

lightgbm的使用方法

1、基础函数用法

lightgbm.Dataset

class lightgbm.Dataset(data, label=None, max_bin=None, reference=None, weight=None, group=None, init_score=None, silent=False, feature_name='auto', categorical_feature='auto', params=None, free_raw_data=True)

Parameters:

data (string__, numpy array or scipy.sparse) – Data source of Dataset. If string, it represents the path to txt file.
label (list__, numpy 1-D array or None__, optional (default=None)) – Label of the data.
max_bin (int or None__, optional (default=None)) – Max number of discrete bins for features. If None, default value from parameters of CLI-version will be used.
reference (Dataset or None__, optional (default=None)) – If this is Dataset for validation, training data should be used as reference.
weight (list__, numpy 1-D array or None__, optional (default=None)) – Weight for each instance.
group (list__, numpy 1-D array or None__, optional (default=None)) – Group/query size for Dataset.
init_score (list__, numpy 1-D array or None__, optional (default=None)) – Init score for Dataset.
silent (bool__, optional (default=False)) – Whether to print messages during construction.
feature_name (list of strings or 'auto'__, optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of strings or int__, or 'auto'__, optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used.
params (dict or None__, optional (default=None)) – Other parameters.
free_raw_data (bool__, optional (default=True)) – If True, raw data is freed after constructing inner Dataset.