记录2024年的一次lightgbm 夯实的调参流程

# 参考了  魔镜杯风控算法实战(三)调参篇
https://zhuanlan.zhihu.com/p/104376738

但是原文章中一些参数已经在最新的lightgbm 中被废弃了,auc-mean改名为valid auc-mean!所以,代码也需要做一些细微的调整才可以运行起来,所以我写出这篇经验供大家参考。

1)先调参:
num_boost_round(n_estimators)

使用lgb自带的cv来计算,先限定learning_rate为0.1,这样调整别的参数的时候可以迭代得更快一些。

import time
import lightgbm as lgb
from lightgbm import log_evaluation, early_stopping
start = time.time()

lgb_train = lgb.Dataset(X_train,y_train)
lgb_test = lgb.Dataset(X_test,y_test,reference=lgb_train)
  
base_parmas={'boosting_type':'gbdt',
             'learning_rate':0.1,
             'num_leaves':31,
             'max_depth':-1,
             'bagging_fraction':0.7,
             'feature_fraction':0.7,
             'lambda_l1':0,
             'lambda_l2':0,
             'min_data_in_leaf':20,
             'min_sum_hessian_inleaf':0.001,
             'metric':'auc'}


callbacks = [log_evaluation(period=100), early_stopping(stopping_rounds=30)]

cv_result = lgb.cv(train_set=lgb_train,
                   num_boost_round=1000,
                   nfold=5,
                   stratified=True,
                   shuffle=True,
                   params=base_parmas,
                   metrics='auc',
          
                   callbacks=callbacks,
                   seed=0)
end = time.time()
print('迭代次数:{}'.format(len(cv_result['valid auc-mean'])))
print('交叉验证的AUC:{}'.format(max(cv_result['valid auc-mean'])))
print('运行时间为{}秒'.format(round(end-start,0)))

得到:
迭代次数:444 交叉验证的AUC:0.7941171177211479 运行时间为3.0秒

注意我改动了cv_result['valid auc-mean'])  auc-mean已经更名为了'valid auc-mean

2)然后:

下面配合GridSearchCV时必须使用sklearn接口的lightgbm。

  • num_leaves

这里选择先粗调再细调的策略,偷懒的话也可以直接细调,一步到位,但是代码会跑的久一些。

start = time.time()
params1={'num_leaves':list(range(50,150,5))}
model_lgb1=lgb.LGBMClassifier(
             learning_rate=0.1,
             n_estimators=444,
             max_depth=-1,
             bagging_fraction=0.7,
             feature_fraction=0.7,
             lambda_l1=0,
             lambda_l2=0,
             min_data_in_leaf=20,
             min_sum_hessian_inleaf=0.001)
grid_search1=GridSearchCV(estimator=model_lgb1,cv=5,param_grid=params1,n_jobs=-1,scoring='roc_auc')
grid_search1.fit(X_train,y_train)
end = time.time()
print('最优参数为:{}'.format(grid_search1.best_params_))
print('最优分数为:{}'.format(grid_search1.best_score_))
print('运行时间为{}秒'.format(round(end-start,0)))
最优参数为:{'num_leaves': 55}
最优分数为:0.7928556904381894
运行时间为38.0秒
start = time.time()
params2={'num_leaves':list(range(30,60,2))}
model_lgb2=lgb.LGBMClassifier(
             learning_rate=0.1,
             n_estimators=444,
             max_depth=-1,
             bagging_fraction=0.7,
             feature_fraction=0.7,
             lambda_l1=0,
             lambda_l2=0,
             min_data_in_leaf=20,
             min_sum_hessian_inleaf=0.001)
grid_search2=GridSearchCV(estimator=model_lgb2,cv=5,param_grid=params2,n_jobs=-1,scoring='roc_auc')
grid_search2.fit(X_train,y_train)
end = time.time()
print('最优参数为:{}'.format(grid_search2.best_params_))
print('最优分数为:{}'.format(grid_search2.best_score_))
print('运行时间为{}秒'.format(round(end-start,0)))
最优参数为:{'num_leaves': 38}
最优分数为:0.7930652393271933
运行时间为16.0秒

3)调优:bagging_fraction,feature_fraction

start = time.time()
params3={'bagging_fraction':[i/10 for i in range(4,11,1)],'feature_fraction':[i/10 for i in range(4,11,1)]}
model_lgb3=lgb.LGBMClassifier(
             learning_rate=0.1,
             n_estimators=444,
             num_leaves=38,
             max_depth=-1,
             lambda_l1=0,
             lambda_l2=0,
             min_data_in_leaf=20,
             min_sum_hessian_inleaf=0.001)
grid_search3=GridSearchCV(estimator=model_lgb3,cv=5,param_grid=params3,n_jobs=-1,scoring='roc_auc')
grid_search3.fit(X_train,y_train)
end = time.time()
print('最优参数为:{}'.format(grid_search3.best_params_))
print('最优分数为:{}'.format(grid_search3.best_score_))
print('运行时间为{}秒'.format(round(end-start,0)))
最优参数为:{'bagging_fraction': 0.4, 'feature_fraction': 0.4}
最优分数为:0.7930652393271933
运行时间为55.0秒

4)调优:lambda_l1,lambda_l2

start = time.time()
params4={'lambda_l1':[0,0.001,0.01,0.03,0.08,0.3,0.5],'lambda_l2':[0,0.001,0.01,0.03,0.08,0.3,0.5]}
model_lgb4=lgb.LGBMClassifier(
             learning_rate=0.1,
             n_estimators=444,
             num_leaves=38,
             max_depth=-1,
             bagging_fraction=0.4,
             feature_fraction=0.4,
             min_data_in_leaf=20,
             min_sum_hessian_inleaf=0.001)
grid_search4=GridSearchCV(estimator=model_lgb4,cv=5,param_grid=params4,n_jobs=-1,scoring='roc_auc')
grid_search4.fit(X_train,y_train)
end = time.time()
print('最优参数为:{}'.format(grid_search4.best_params_))
print('最优分数为:{}'.format(grid_search4.best_score_))
print('运行时间为{}秒'.format(round(end-start,0)))
最优参数为:{'lambda_l1': 0.3, 'lambda_l2': 0.08}
最优分数为:0.793414505029456
运行时间为47.0秒

请注意原文还需要调优min_data_in_leaf,min_sum_hessian_inleaf,但实际上这两个参数不需要调优,已经和新版的lightgbm冲突,不用再调了

5)上面几个确定好了以后,我们设置一个比较小的learning_rate 0.005,来确定最终的num_boost_round

start = time.time()
final_parmas={'boosting_type':'gbdt',
             'learning_rate':0.005,
             'num_leaves':38,
             'max_depth':-1,
             'bagging_fraction':0.4,
             'feature_fraction':0.4,
             'lambda_l1':0.3,
             'lambda_l2':0.08,


             'metric':'auc'}

callbacks = [log_evaluation(period=100), early_stopping(stopping_rounds=30)]
cv_result = lgb.cv(train_set=lgb_train,
                   num_boost_round=444,
             
                   nfold=5,
                   stratified=True,
                   shuffle=True,
                   params=final_parmas,
                   metrics='auc',
                   callbacks=callbacks,

                   seed=0)
end = time.time()
print('最大的迭代次数:{}'.format(len(cv_result['valid auc-mean'])))
print('交叉验证的AUC:{}'.format(max(cv_result['valid auc-mean'])))
print('运行时间为{}秒'.format(round(end-start,0)))

最大的迭代次数:444
交叉验证的AUC:0.7607449893610381
运行时间为2.0秒

6)最后进行最后建模:最优参数已经全部计算好了,我们把它代入模型并直接将AUC可视化,看一下优化以后的结果如何

from sklearn import metrics
final_model = lgb.LGBMClassifier(learning_rate=0.005,
             n_estimators=500,
             num_leaves=38,
             max_depth=-1,
             bagging_fraction=0.4,
             feature_fraction=0.4,
             lambda_l1=0.3,
             lambda_l2=0.08,
)
final_model.fit(X_train,y_train)
final_pre= final_model.predict_proba(X_test)[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test,final_pre)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize=(6,6))
plt.title('Validation ROC')
plt.plot(fpr, tpr, 'b', label = 'Val AUC = %0.3f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
[LightGBM] [Warning] feature_fraction is set=0.4, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.4
[LightGBM] [Warning] lambda_l1 is set=0.3, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.3
[LightGBM] [Warning] lambda_l2 is set=0.08, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.08
[LightGBM] [Warning] bagging_fraction is set=0.4, subsample=1.0 will be ignored. Current value: bagging_fraction=0.4
[LightGBM] [Warning] feature_fraction is set=0.4, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.4
[LightGBM] [Warning] lambda_l1 is set=0.3, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.3
[LightGBM] [Warning] lambda_l2 is set=0.08, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.08
[LightGBM] [Warning] bagging_fraction is set=0.4, subsample=1.0 will be ignored. Current value: bagging_fraction=0.4
[LightGBM] [Info] Number of positive: 2715, number of negative: 2676
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000100 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 254
[LightGBM] [Info] Number of data points in the train set: 5391, number of used features: 1
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503617 -> initscore=0.014469
[LightGBM] [Info] Start training from score 0.014469
[LightGBM] [Warning] feature_fraction is set=0.4, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.4
[LightGBM] [Warning] lambda_l1 is set=0.3, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.3
[LightGBM] [Warning] lambda_l2 is set=0.08, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.08
[LightGBM] [Warning] bagging_fraction is set=0.4, subsample=1.0 will be ignored. Current value: bagging_fraction=0.4

7)评估结果 混淆矩阵输出:

# 如果需要更详细的分类报告(包括精确度、召回率、F1分数等)  
from sklearn.metrics import classification_report  

y_pred=final_model.predict(X_test)

print("Classification Report:")  
print(classification_report(y_test, y_pred))

Classification Report: precision recall f1-score support

       0       0.69      0.67      0.68      1175
       1       0.67      0.69      0.68      1136

accuracy                           0.68      2311

整个流程就完成了,说下最终感受,这一圈代码搞完算上查资料花了接近一小时,但是模型效果并没有比默认参数提高多少。

我贴出默认lightGBM参数的建模:

# 不网格搜索  lgbm
import numpy as np  
from sklearn.model_selection import train_test_split  
from imblearn.over_sampling import RandomOverSampler  
from lightgbm import LGBMClassifier  
  
# 假设 data_total 是您的原始数据DataFrame  
# data_total['ABS'] 是特征列,data_total['财务舞弊'] 是目标列  
  
# 转换数据类型为numpy数组  
X = np.array(data_total['ABS']).reshape(-1, 1)  
y = np.array(data_total['财务舞弊'])  
  
# 上采样  
ros = RandomOverSampler(sampling_strategy='auto', random_state=0)  
X_resampled, y_resampled = ros.fit_resample(X, y)  
  
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=25)  
  
# 初始化LightGBM分类器,这里使用默认参数  
clf = LGBMClassifier(random_state=0)  
  
# 训练模型  
clf.fit(X_train, y_train)  
  
# 在测试集上评估模型  
# 注意:LGBMClassifier的score方法默认返回准确率,对于不平衡数据集可能不是最佳选择  
test_score = clf.score(X_test, y_test)  
print("Test score (accuracy) with default parameters:", test_score)  
  
# 如果您想使用其他指标(如ROC AUC)来评估测试集,可以使用sklearn的metrics模块  
from sklearn.metrics import roc_auc_score  
y_pred_proba = clf.predict_proba(X_test)[:, 1]  # 获取正类的预测概率  
test_auc = roc_auc_score(y_test, y_pred_proba)  
print("Test AUC with default parameters:", test_auc)

在Python中进行LightGBM调参可以通过设置一系列参数来实现。首先,可以调整学习率和估计器的数目。学习率(learning_rate)控制每个估计器对于前一个估计器的权重。较小的学习率可以使模型更加稳定,但可能需要更多的估计器来达到最佳性能。估计器的数目(num_estimators)表示要使用的决策树的数量,较大的数目可能会增加模型的复杂度。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [LightGBM调参](https://blog.csdn.net/weixin_41917143/article/details/110421742)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *2* [提升机器算法LightGBM(图解+理论+增量训练python代码+lightGBM调参方法)](https://blog.csdn.net/lamusique/article/details/95631638)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *3* [LightGBM 如何调参](https://blog.csdn.net/weixin_44116269/article/details/103269604)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值