XGBoost 与 LightGBM 调参

最新推荐文章于 2023-12-13 15:17:26 发布

Harry-L

最新推荐文章于 2023-12-13 15:17:26 发布

阅读量518

点赞数

分类专栏：机器学习文章标签： XGBoost LightGBM

本文链接：https://blog.csdn.net/l75326747/article/details/84400776

版权

机器学习专栏收录该内容

12 篇文章 1 订阅

订阅专栏

一、安装

xgboost：

安装包路径：https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
找到合适的包，使用pip安装
如 pip install xgboost-0.81-cp27-cp27m-win_amd64.whl

lightgbm：

可直接通过 pip install lightgbm 安装

二、介绍

xgboost 与 lightgbm 均有原生模型及仿sklearn 接口，两者均可以达到建模的目的，这里就不详细介绍。两者调用方法及参数设置都及其类似，仅在少数参数上略有不同。使用是只要会用其一，便可举一反三。至于原理上的东西，这里就不做赘述，可参考：XGBoost、LightGBM的详细对比介绍

三、调参

以下以xgboost为例，lightgbm除个别参数名字不同外，大体与之类似

步骤：

0、设置初始参数

给出一堆初始参数

best_params = {
    'learning_rate': 0.1,
    'n_estimators': 1000,
    'max_depth': 5,
    'min_child_weight': 1,
    'gamma': 0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'nthread': 4,
    'scale_pos_weight': 1,
    'seed': 112
}

1、寻找初始迭代次数

设定一个稍大的学习率（0.1），利用xgboost里的cv方法去确定一个合适的迭代次数（也就是树的个数）

定义一个函数，用来寻找迭代次数：

def xgb_fit(alg, X, y, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    """定义一个函数帮助产生xgboost模型及其效果"""
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(X.values, label=y.values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
        # print(cvresult)
        alg.set_params(n_estimators=cvresult.shape[0])

    # Fit the algorithm on the data
    alg.fit(X, y, eval_metric='auc')

    # Predict training set:
    dtrain_predictions = alg.predict(X)
    dtrain_predprob = alg.predict_proba(X)[:, 1]

    # Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(y.values, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(y.values, dtrain_predprob))

    feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    plt.show()
    # xgboost’s sklearn 没有 feature_importances，但 get_fscore() 有相同的功能

找出最优的迭代次数：

xgb_0 = XGBClassifier(**best_params)
st = time.time()
xgb_fit(xgb_0, train_X, train_y)
et = time.time()
print("最优迭代次数 {}，耗时 {:.2f}s".format(xgb_0.n_estimators, et - st))
best_params['n_estimators'] = xgb_0.n_estimators

2、调参

根据参数的重要性，分别对参数进行调参，利用GridSearchCV，先进行大范围粗调再小范围精调，参数重要性排序：

max_depth(5; 3~10)和min_child_weight(1; 1~10)
gamma(0; 1e-5~1~100)
subsample(0.8; 0.5~1.0)和colsample_bytree(0.8; 0.5~1.0)
reg_alpha(0; 1e-5~1~100)和reg_lambda(1; 0.5, 1, 5, 10)

2.0 定于一个用于调参的函数：

def xgb_grid_fit(X, y, basic_params, param_grid, scoring='roc_auc', n_jobs=-1, iid=False):

    """使用 gridsearsh 对 xbg 模型进行超参选择"""
    st = time.time()
    grid = GridSearchCV(XGBClassifier(**basic_params), param_grid=param_grid, 
                        scoring=scoring, n_jobs=n_jobs, iid=iid)
    grid.fit(X, y)
    et = time.time()

    print("分数：{:.4f}，参数{}，耗时{:.2f}s".format(grid.best_score_, grid.best_params_, et - st))
    return grid.best_params_

2.1 max_depth(5; 3~10)和min_child_weight(1; 1~10)

params_rough = {
    'max_depth': [3, 5, 7, 9],
    'min_child_weight': [2, 4, 6, 8]
}

params_tiny = {
    'max_depth': [2, 3, 4],
    'min_child_weight': [7, 8, 9]
}

this_params = xgb_grid_fit(train_X, train_y, best_params, params_tiny)

# 粗调结果：分数：0.6833，参数{'gamma': 2.0}，耗时435.08s
# 精调结果：分数：0.6833，参数{'gamma': 2.0}，耗时435.08s
best_params.update(**this_params)

2.2 gamma

params_rough = {
    'gamma': [2 * i / 10.0 + 1.6 for i in range(5)],6
}
params_tiny = {
    'gamma': [2 * i / 10.0 + 1.6 for i in range(5)],
}

this_params = xgb_grid_fit(train_X, train_y, best_params, params_tiny)

# 粗调结果：分数：0.6835，参数{'colsample_bytree': 0.7, 'subsample': 0.8}，耗时1007.92s
# 精调结果：分数：0.6837，参数{'colsample_bytree': 0.85, 'subsample': 0.65}，耗时387.17s
best_params.update(**this_params)

2.3 subsample(0.8; 0.5~1.0)和 colsample_bytree(0.8; 0.5~1.0)

params_rough = {
'subsample': [i / 10.0 + 0.6 for i in range(5)],
'colsample_bytree': [i / 10.0 + 0.6 for i in range(5)],
}

this_params = xgb_grid_fit(train_X, train_y, best_params, params_rough)

# 结果：分数：0.6835，参数{'colsample_bytree': 0.7, 'subsample': 0.8}，耗时1007.92s

best_params.update(**this_params)