一、安装
xgboost:
安装包路径:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
找到合适的包,使用pip安装
如 pip install xgboost-0.81-cp27-cp27m-win_amd64.whl
lightgbm:
可直接通过 pip install lightgbm 安装
二、介绍
xgboost 与 lightgbm 均有原生模型及 仿sklearn 接口,两者均可以达到建模的目的,这里就不详细介绍。两者调用方法及参数设置都及其类似,仅在少数参数上略有不同。使用是只要会用其一,便可举一反三。至于原理上的东西,这里就不做赘述,可参考:XGBoost、LightGBM的详细对比介绍
三、调参
以下以xgboost为例,lightgbm除个别参数名字不同外,大体与之类似
步骤:
0、设置初始参数
给出一堆初始参数
best_params = {
'learning_rate': 0.1,
'n_estimators': 1000,
'max_depth': 5,
'min_child_weight': 1,
'gamma': 0,
'subsample': 0.8,
'colsample_bytree': 0.8,
'nthread': 4,
'scale_pos_weight': 1,
'seed': 112
}
1、寻找初始迭代次数
设定一个稍大的学习率(0.1),利用xgboost里的cv方法去确定一个合适的迭代次数(也就是树的个数)
定义一个函数,用来寻找迭代次数:
def xgb_fit(alg, X, y, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
"""定义一个函数帮助产生xgboost模型及其效果"""
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(X.values, label=y.values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
# print(cvresult)
alg.set_params(n_estimators=cvresult.shape[0])
# Fit the algorithm on the data
alg.fit(X, y, eval_metric='auc')
# Predict training set:
dtrain_predictions = alg.predict(X)
dtrain_predprob = alg.predict_proba(X)[:, 1]
# Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(y.values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y.values, dtrain_predprob))
feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
plt.show()
# xgboost’s sklearn 没有 feature_importances,但 get_fscore() 有相同的功能
找出最优的迭代次数:
xgb_0 = XGBClassifier(**best_params)
st = time.time()
xgb_fit(xgb_0, train_X, train_y)
et = time.time()
print("最优迭代次数 {},耗时 {:.2f}s".format(xgb_0.n_estimators, et - st))
best_params['n_estimators'] = xgb_0.n_estimators
2、调参
根据参数的重要性,分别对参数进行调参,利用GridSearchCV,先进行大范围粗调再小范围精调,参数重要性排序:
- max_depth(5; 3~10)和min_child_weight(1; 1~10)
- gamma(0; 1e-5~1~100)
- subsample(0.8; 0.5~1.0)和colsample_bytree(0.8; 0.5~1.0)
- reg_alpha(0; 1e-5~1~100)和reg_lambda(1; 0.5, 1, 5, 10)
2.0 定于一个用于调参的函数:
def xgb_grid_fit(X, y, basic_params, param_grid, scoring='roc_auc', n_jobs=-1, iid=False):
"""使用 gridsearsh 对 xbg 模型进行超参选择"""
st = time.time()
grid = GridSearchCV(XGBClassifier(**basic_params), param_grid=param_grid,
scoring=scoring, n_jobs=n_jobs, iid=iid)
grid.fit(X, y)
et = time.time()
print("分数:{:.4f},参数{},耗时{:.2f}s".format(grid.best_score_, grid.best_params_, et - st))
return grid.best_params_
2.1 max_depth(5; 3~10)和min_child_weight(1; 1~10)
params_rough = {
'max_depth': [3, 5, 7, 9],
'min_child_weight': [2, 4, 6, 8]
}
params_tiny = {
'max_depth': [2, 3, 4],
'min_child_weight': [7, 8, 9]
}
this_params = xgb_grid_fit(train_X, train_y, best_params, params_tiny)
# 粗调结果:分数:0.6833,参数{'gamma': 2.0},耗时435.08s
# 精调结果:分数:0.6833,参数{'gamma': 2.0},耗时435.08s
best_params.update(**this_params)
2.2 gamma
params_rough = {
'gamma': [2 * i / 10.0 + 1.6 for i in range(5)],6
}
params_tiny = {
'gamma': [2 * i / 10.0 + 1.6 for i in range(5)],
}
this_params = xgb_grid_fit(train_X, train_y, best_params, params_tiny)
# 粗调结果:分数:0.6835,参数{'colsample_bytree': 0.7, 'subsample': 0.8},耗时1007.92s
# 精调结果:分数:0.6837,参数{'colsample_bytree': 0.85, 'subsample': 0.65},耗时387.17s
best_params.update(**this_params)
2.3 subsample(0.8; 0.5~1.0)和 colsample_bytree(0.8; 0.5~1.0)
params_rough = {
'subsample': [i / 10.0 + 0.6 for i in range(5)],
'colsample_bytree': [i / 10.0 + 0.6 for i in range(5)],
}
this_params = xgb_grid_fit(train_X, train_y, best_params, params_rough)
# 结果:分数:0.6835,参数{'colsample_bytree': 0.7, 'subsample': 0.8},耗时1007.92s
best_params.update(**this_params)
3 重新寻找最优迭代次数
具体方法同第1步,亦可调小learning_rate(0.01),利用cv找到合适的迭代次数,但耗时相对较长。