LightGBM 二元分类、多类分类、 Python的回归和分类器应用

最新推荐文章于 2025-03-25 10:53:36 发布

python机器学习建模

最新推荐文章于 2025-03-25 10:53:36 发布

阅读量8.2k

点赞数 6

分类专栏： python风控模型 python生物信息学文章标签：回归 python 多分类 lightgbm 分类器

本文链接：https://blog.csdn.net/fulk6667g78o8/article/details/121893789

版权

python风控模型同时被 2 个专栏收录

199 篇文章

订阅专栏

python生物信息学

72 篇文章

订阅专栏

LightGBM是一个梯度提升框架，它使用基于树的学习算法。与其他提升算法相比，它被设计为分布式且高效。可以用于比较的模型是 XGBoost，它也是一种提升方法，与其他算法相比，它的表现非常出色。

然而XGBoost是数据集的好算法升超过10000行ESS，对于大型数据集，所以不推荐。而LightGBM可以处理大量数据，占用内存少，具有并行和GPU 学习，准确率好，训练速度和效率更快。那么是什么让 LightGBM 成为一个更好的模型，对于一个模型来说，它使叶子明智，而其他算法则是明智地增长。

LightGBM 树叶生长

其他算法级别增长

LightGBM的安装：

如果您使用的是 Anaconda：

conda install -c conda-forge lightgbm

LightGBM 支持的操作类型：

回归
二元分类
多类分类
交叉熵
兰姆酒

在本文中，我将向您展示如何执行_二元分类_、多类分类_和_回归。在开始之前，我想提醒您，我们将使用的数据集是玩具数据集（记录较少），因此它们在 LightGBM中容易过度拟合。为了避免过度拟合，我们可以使用max_depth值。您可能会怀疑 max_depth 用于按级别增长，请放心，即使指定了 max_depth，树也会按叶方向增长。

#importing libraries
import numpy as np
from collections import Counter
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer,load_boston,load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error,roc_auc_score,precision_score
pd.options.display.max_columns = 999

1.使用乳腺癌数据集的二元分类

#loading the breast cancer dataset
X=load_breast_cancer()
df=pd.DataFrame(X.data,columns=X.feature_names)
Y=X.target 
#scaling the features using Standard Scaler
sc=StandardScaler()
sc.fit(df)
X=pd.DataFrame(sc.fit_transform(df))
#train_test_split 
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=0)
#converting the dataset into proper LGB format 
d_train=lgb.Dataset(X_train, label=y_train)
#Specifying the parameter
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='binary' #Binary target feature
params['metric']='binary_logloss' #metric for binary classification
params['max_depth']=10
#train the model 
clf=lgb.train(params,d_train,100) #train the model on 100 epocs
#prediction on the test set
y_pred=clf.predict(X_test)

我们已经创建、训练和测试了模型。现在我们将使用来自 sklearn 的roc_auc_score指标来检查模型的执行情况。存储在 y_pred 中的预测看起来像这样[0.04558262, 0.89328757, 0.97349586, 0.97226278, 0.950874]所以我们需要将它们转换成正确的格式。

if>=0.5 ---> 1
else ---->0
#rounding the values
y_pred=y_pred.round(0)
#converting from float to integer
y_pred=y_pred.astype(int)
#roc_auc_score metric
roc_auc_score(y_pred,y_test)
#0.9672167056074766

2.使用Wine数据集进行多类分类

#loading the dataset
X1=load_wine()
df_1=pd.DataFrame(X1.data,columns=X1.feature_names)
Y_1=X1.target
#Scaling using the Standard Scaler
sc_1=StandardScaler()
sc_1.fit(df_1)
X_1=pd.DataFrame(sc_1.fit_transform(df_1))
#train-test-split
X_train,X_test,y_train,y_test=train_test_split(X_1,Y_1,test_size=0.3,random_state=0)
#Converting the dataset in proper LGB format
d_train=lgb.Dataset(X_train, label=y_train)
#setting up the parameters
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='multiclass' #Multi-class target feature
params['metric']='multi_logloss' #metric for multi-class
params['max_depth']=10
params['num_class']=3 #no.of unique values in the target class not inclusive of the end value
#training the model
clf=lgb.train(params,d_train,100)  #training the model on 100 epocs
#prediction on the test dataset
y_pred_1=clf.predict(X_test)
#printing the predictions
y_pred_1
[0.95819859, 0.02205037, 0.01975104],
[0.05465546, 0.09575231, 0.84959223],
[0.20955298, 0.69498737, 0.09545964],
[0.95852959, 0.02096561, 0.02050481],
[0.04243184, 0.92053949, 0.03702867],....

在多类问题中，模型产生num_class(3)概率，如上面的输出所示。我们可以使用numpy.argmax()方法输出具有最合理结果的类。

#argmax() method 
y_pred_1 = [np.argmax(line) for line in y_pred_1]
#printing the predictions
[0,2,1,0,1,0,0,2,...]
#using precision score for error metrics
precision_score(y_pred_1,y_test,average=None).mean()
# 0.9545454545454546

3.使用波士顿数据集进行回归

#loading the Boston Dataset
X=load_boston()
df=pd.DataFrame(X.data,columns=X.feature_names)
Y=X.target
#Scaling using the Standard Scaler
sc=StandardScaler()
sc.fit(df)
X=pd.DataFrame(sc.fit_transform(df))
#train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=0)
#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(X_train, label=y_train)
#Declaring the parameters
params={}
params['learning_rate']=0.03
params['boosting_type']='gbdt' #GradientBoostingDecisionTree
params['objective']='regression'#regression task
params['n_estimators']=100
params['max_depth']=10
#model creation and training
clf=lgb.train(params,d_train,100)
#model prediction on X_test
y_pred=clf.predict(X_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)
# 0.9672167056074766

我已经介绍了使用 LightGBM 创建模型的基础知识。我建议您练习使用相同的数据集，调整参数并尝试改进模型预测。我还建议您熟悉模型的不同参数，以便在使用 API 时可以更自由地移动。一旦您了解 LightGBM，我向您保证，这将成为您执行任何任务的首选算法，因为它快速、轻便且准确无误。

QQ学习群：1026993837 领资料
LightGBM 二元分类、多类分类、使用 Python 的回归就为大家介绍到这里，欢迎学习《Python数据分析与机器学习项目实战》bye！