lightGBM 调参

1. 概述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

1.lgb.cv函数使用方法

(1)参数

  • params 基学习器的参数。

  • train_set 训练集。

  • nfold : n折交叉验证。

  • metrics(metrics, metric_types) 评价标准。 default =" "
    要在评估集上评估的指标

  • num_boost_round 最大迭代次数。

  • early_stopping_rounds 早停轮数。

  • verbose_eval 每间隔n个迭代就显示一次进度

  • stratified 默认True,是否采用分层抽样,建议采用。

  • shuffle 默认True,是否洗牌,不建议采用。

  • seed 相当于random_state

(2)param需要填写的参数

  • objective树的类型。 回归:regression;二分类:binary;多分类:multiclass;排序等。

  • boosting :默认:gbdt;还有:rf,dart。

  • n_jobs

  • learning_rate:学习率,默认0.1

  • num_leaves
    一棵树的最大叶子数,default =31, type = int, 别名:num_leaf,max_leaves,max_leaf,max_leaf_nodes, 约束:1 < num_leaves <= 131072

  • max_depth:默认 =-1, 类型 = int
    限制树模型的最大深度。这用于处理#data较小时的过度拟合。树仍然在叶子上生长
    <= 0意味着没有限制

  • subsample: ,default =1.0, type = double,
    bagging_fraction 别名:sub_row,subsample,bagging, 约束:0.0 < bagging_fraction <= 1.0
    like feature_fraction,但这会随机选择部分数据而不重新采样
    可用于加速训练
    可用于处理过拟合
    注意:要启用套袋,bagging_freq也应设置为非零值

  • colsample_bytree
    feature_fraction , default =1.0, type = double, 别名:sub_feature,colsample_bytree, 约束:0.0 < feature_fraction <= 1.0
    feature_fraction如果小于,LightGBM 将在每次迭代(树)上随机选择一个特征子集1.0。例如,如果将其设置为0.8,LightGBM 将在训练每棵树之前选择 80% 的特征
    可用于加速训练
    可用于处理过拟合

  • num_threads , 默认 =0, type = int, 别名:num_thread , nthread, nthreads,n_jobs
    仅用于train,prediction和refit任务或语言特定包的对应功能
    LightGBM 的线程数
    0表示 OpenMP 中的默认线程数
    为获得最佳速度,请将其设置为实际 CPU 内核数,而不是线程数(大多数 CPU 使用超线程来为每个 CPU 内核生成 2 个线程)
    如果您的数据集很小,请不要将其设置得太大(例如,对于 10,000 行的数据集,不要使用 64 个线程)
    请注意,任务管理器或任何类似的 CPU 监控工具可能会报告内核未充分利用。这个是正常的
    对于分布式学习,不要使用所有 CPU 内核,因为这会导致网络通信性能下降
    注意:请不要在训练过程中更改此项,尤其是通过外部包同时运行多个作业时,否则可能会导致不良错误

2.GridSearchCV调参

LightGBM的调参过程和RF、GBDT等类似,其基本流程如下:

  • 1.首先选择较高的学习率,大概0.1附近,这样是为了加快收敛的速度。这对于调参是很有必要的。

  • 2.对决策树基本参数调参

  • 3.正则化参数调参

  • 4.最后降低学习率,这里是为了最后提高准确率

第一步:学习率和迭代次数

我们先把学习率先定一个较高的值,这里取** learning_rate = 0.1**,其次确定估计器boosting/boost/boosting_type的类型,不过默认都会选gbdt

迭代的次数,也可以说是残差树的数目,参数名为n_estimators/num_iterations/num_round/num_boost_round。我们可以先将该参数设成一个较大的数,然后在cv结果中查看最优的迭代次数,具体如代码。

在这之前,我们必须给其他重要的参数一个初始值。初始值的意义不大,只是为了方便确定其他参数。下面先给定一下初始值:

以下参数根据具体项目要求定:

'boosting_type'/'boosting': 'gbdt'
'objective': 'binary'
'metric': 'auc'

以下是我选择的初始值:

'max_depth': 5     # 由于数据集不是很大,所以选择了一个适中的值,其实4-10都无所谓。
'num_leaves': 30   # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
'subsample'/'bagging_fraction':0.8           # 数据采样
'colsample_bytree'/'feature_fraction': 0.8   # 特征采样
 

下面用LightGBM的cv函数进行确定:

import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
 
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
params = {    
          'boosting_type': 'gbdt',
          'objective': 'binary',
          'metric': 'auc',
          'nthread':4,
          'learning_rate':0.1,
          'num_leaves':30, 
          'max_depth': 5,   
          'subsample': 0.8, 
          'colsample_bytree': 0.8, 
    }
    
data_train = lgb.Dataset(X_train, y_train)
cv_results = lgb.cv(params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',early_stopping_rounds=50,seed=0)
print('best n_estimators:', len(cv_results['auc-mean']))
print('best cv score:', pd.Series(cv_results['auc-mean']).max())

第二步:确定max_depth和num_leave

这是提高精确度的最重要的参数。这里我们引入sklearn里的GridSearchCV()函数进行搜索。

from sklearn.grid_search import GridSearchCV

params_test1={'max_depth': range(3,8,1), 'num_leaves':range(5, 100, 5)}
              
gsearch1 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, n_estimators=188, max_depth=6, bagging_fraction = 0.8,feature_fraction = 0.8), 
                       param_grid = params_test1, scoring='roc_auc',cv=5,n_jobs=-1)
gsearch1.fit(X_train,y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

第三步:确定min_data_in_leaf和max_bin in

  • min_data_in_bin ,默认 =3,类型 = int,约束:min_data_in_bin > 0
    一个 bin 内的最小数据数量
    使用它来避免 one-data-one-bin(潜在的过度拟合)

第四步:确定feature_fraction、bagging_fraction、bagging_freq

第五步:确定lambda_l1和lambda_l2

  • lambda_l1 , default =0.0, type = double, 别名:reg_alpha,l1_regularization, 约束:lambda_l1 >= 0.0
    L1 正则化
  • lambda_l2 🔗︎ , default =0.0, type = double, 别名:reg_lambda,lambda,l2_regularization, 约束:lambda_l2 >= 0.0
    L2 正则化

第六步:确定 min_split_gain

第七步:降低学习率,增加迭代次数,验证模型

3.LightGBM的cv函数调参

这种方式比较省事儿,写好代码自动寻优,但需要有调参经验,如何设置较好的参数范围有一定的技术含量,这里直接给出代码。

import pandas as pd
import lightgbm as lgb
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
 
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
 
### 数据转换
print('数据转换')
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
 
### 设置初始参数--不含交叉验证参数
print('设置参数')
params = {
          'boosting_type': 'gbdt',
          'objective': 'binary',
          'metric': 'auc',
          'nthread':4,
          'learning_rate':0.1
          }
 
### 交叉验证(调参)
print('交叉验证')
max_auc = float('0')
best_params = {}
 
# 准确率
print("调参1:提高准确率")
for num_leaves in range(5,100,5):
    for max_depth in range(3,8,1):
        params['num_leaves'] = num_leaves
        params['max_depth'] = max_depth
 
        cv_results = lgb.cv(
                            params,
                            lgb_train,
                            seed=1,
                            nfold=5,
                            metrics=['auc'],
                            early_stopping_rounds=10,
                            verbose_eval=True
                            )
            
        mean_auc = pd.Series(cv_results['auc-mean']).max()
        boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
            
        if mean_auc >= max_auc:
            max_auc = mean_auc
            best_params['num_leaves'] = num_leaves
            best_params['max_depth'] = max_depth
if 'num_leaves' and 'max_depth' in best_params.keys():          
    params['num_leaves'] = best_params['num_leaves']
    params['max_depth'] = best_params['max_depth']
 
# 过拟合
print("调参2:降低过拟合")
for max_bin in range(5,256,10):
    for min_data_in_leaf in range(1,102,10):
            params['max_bin'] = max_bin
            params['min_data_in_leaf'] = min_data_in_leaf
            
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['auc'],
                                early_stopping_rounds=10,
                                verbose_eval=True
                                )
                    
            mean_auc = pd.Series(cv_results['auc-mean']).max()
            boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
 
            if mean_auc >= max_auc:
                max_auc = mean_auc
                best_params['max_bin']= max_bin
                best_params['min_data_in_leaf'] = min_data_in_leaf
if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
    params['min_data_in_leaf'] = best_params['min_data_in_leaf']
    params['max_bin'] = best_params['max_bin']
 
print("调参3:降低过拟合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
    for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
        for bagging_freq in range(0,50,5):
            params['feature_fraction'] = feature_fraction
            params['bagging_fraction'] = bagging_fraction
            params['bagging_freq'] = bagging_freq
            
            cv_results = lgb.cv(
                                params,
                                lgb_train,
                                seed=1,
                                nfold=5,
                                metrics=['auc'],
                                early_stopping_rounds=10,
                                verbose_eval=True
                                )
                    
            mean_auc = pd.Series(cv_results['auc-mean']).max()
            boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
 
            if mean_auc >= max_auc:
                max_auc=mean_auc
                best_params['feature_fraction'] = feature_fraction
                best_params['bagging_fraction'] = bagging_fraction
                best_params['bagging_freq'] = bagging_freq
 
if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
    params['feature_fraction'] = best_params['feature_fraction']
    params['bagging_fraction'] = best_params['bagging_fraction']
    params['bagging_freq'] = best_params['bagging_freq']
 
 
print("调参4:降低过拟合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
    for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
        params['lambda_l1'] = lambda_l1
        params['lambda_l2'] = lambda_l2
        cv_results = lgb.cv(
                            params,
                            lgb_train,
                            seed=1,
                            nfold=5,
                            metrics=['auc'],
                            early_stopping_rounds=10,
                            verbose_eval=True
                            )
                
        mean_auc = pd.Series(cv_results['auc-mean']).max()
        boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
 
        if mean_auc >= max_auc:
            max_auc=mean_auc
            best_params['lambda_l1'] = lambda_l1
            best_params['lambda_l2'] = lambda_l2
if 'lambda_l1' and 'lambda_l2' in best_params.keys():
    params['lambda_l1'] = best_params['lambda_l1']
    params['lambda_l2'] = best_params['lambda_l2']
 
print("调参5:降低过拟合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
    params['min_split_gain'] = min_split_gain
    
    cv_results = lgb.cv(
                        params,
                        lgb_train,
                        seed=1,
                        nfold=5,
                        metrics=['auc'],
                        early_stopping_rounds=10,
                        verbose_eval=True
                        )
            
    mean_auc = pd.Series(cv_results['auc-mean']).max()
    boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
 
    if mean_auc >= max_auc:
        max_auc=mean_auc
        
        best_params['min_split_gain'] = min_split_gain
if 'min_split_gain' in best_params.keys():
    params['min_split_gain'] = best_params['min_split_gain']
 
print(best_params)

结果如下:

{'bagging_fraction': 0.7,
 'bagging_freq': 30,
 'feature_fraction': 0.8,
 'lambda_l1': 0.1,
 'lambda_l2': 0.0,
 'max_bin': 255,
 'max_depth': 4,
 'min_data_in_leaf': 81,
 'min_split_gain': 0.1,
 'num_leaves': 10}

我们将训练得到的参数代入模型

model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.01, n_estimators=1000, max_depth=4, num_leaves=10,max_bin=255,min_data_in_leaf=81,bagging_fraction=0.7,bagging_freq= 30, feature_fraction= 0.8,
lambda_l1=0.1,lambda_l2=0,min_split_gain=0.1)
model.fit(X_train,y_train)
y_pre=model.predict(X_test)
print("acc:",metrics.accuracy_score(y_test,y_pre))
print("auc:",metrics.roc_auc_score(y_test,y_pre))

```
  • 7
    点赞
  • 77
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在Python中进行LightGBM调参可以通过设置一系列参数来实现。首先,可以调整学习率和估计器的数目。学习率(learning_rate)控制每个估计器对于前一个估计器的权重。较小的学习率可以使模型更加稳定,但可能需要更多的估计器来达到最佳性能。估计器的数目(num_estimators)表示要使用的决策树的数量,较大的数目可能会增加模型的复杂度。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [LightGBM调参](https://blog.csdn.net/weixin_41917143/article/details/110421742)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *2* [提升机器算法LightGBM(图解+理论+增量训练python代码+lightGBM调参方法)](https://blog.csdn.net/lamusique/article/details/95631638)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *3* [LightGBM 如何调参](https://blog.csdn.net/weixin_44116269/article/details/103269604)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 33.333333333333336%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值