xgboost调参实践

最新推荐文章于 2024-03-12 01:45:49 发布

Rover Ramble

最新推荐文章于 2024-03-12 01:45:49 发布

阅读量308

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/rover2002/article/details/105514715

版权

机器学习专栏收录该内容

24 篇文章 1 订阅

订阅专栏

xgboost调参笔记

complete-guide-parameter-tuning-xgboost

github code: Parameter_Tuning_XGBoost_with_Example

中文翻译

以二分类为例，主要思路就是用网格搜索做参数调优。

数据源：

The data here is taken form the Data Hackathon3.x - -http://datahack.analyticsvidhya.com/contest/data-hackathon-3x
如果找不到合适的二分类数据，可以用from sklearn.datasets import make_hastie_10_2数据集。

特征工程

缺失值使用中位数填充，

data.apply(lambda x: sum(x.isnull()))，
data[‘Amount_Applied’].fillna(data[‘Amount_Applied’].median(),inplace=True)
删除City这种取值过多的特征。查看len(data[‘City’].unique())
可以对数值型特征离散化，pd.cut分箱分桶。
对类别特征Device_Type、Gender等用LabelEncoder处理后最后用get_dummies做one-hot编码。
其他特征转化、drop、组合等处理。

注意我import了两种XGBoost：

xgb - 直接引用xgboost。接下来会用到其中的“cv”函数。
XGBClassifier - 是xgboost的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

在向下进行之前，我们先定义一个函数，它可以帮助我们建立XGBoost models
并进行交叉验证。好消息是你可以直接用下面的函数，以后再自己的models中也可以使用它。

Define a function for modeling and cross-validation

This function will do the following:

fit the model
determine training accuracy
determine training AUC
determine testing AUC
update n_estimators with cv function of xgboost package
plot Feature Importance

注意：Replace xgb1.booster() with xgb1.get_booster()

# test_results = pd.read_csv('test_results.csv')
def modelfit(alg, dtrain, dtest, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        xgtest = xgb.DMatrix(dtest[predictors].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
        
    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob))
    
#     Predict on testing data:
#     dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1]
#     results = test_results.merge(dtest[['ID','predprob']], on='ID')
#     print('AUC Score (Test): %f' % metrics.roc_auc_score(results['Disbursed'], results['predprob']))
                
    feat_imp = pd.Series(alg.get_booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances', figsize=(14, 4))
    plt.ylabel('Feature Importance Score')

xgboost 特征重要性

注意xgboost的sklearn包没有“feature_importance”这个量度，但是get_fscore()函数有相同的功能。

predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
        learning_rate =0.1,
        n_estimators=1000,
        max_depth=5,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective= 'binary:logistic',
        nthread=4,
        scale_pos_weight=1,
        seed=27)
modelfit(xgb1, train, test, predictors)

使用上面的modelfit() ，得到特征重要性如图:
在这里插入图片描述

分步调节最优参数

step1: 加载数据，设置初始参数

max_depth = 5 :默认6树的最大深度，这个参数的取值最好在3-10之间。
min_child_weight = 1 :默认是1决定最小叶子节点样本权重和。如果是一个极不平衡的分类问题，某些叶子节点下的值会比较小，这个值取小点。
gamma = 0 :默认0，在0.1到0.2之间就可以。树的叶子节点上作进一步分裂所需的最小损失减少。这个参数后继也是要调整的。
subsample, colsample_bytree = 0.8 :样本采样、列采样。典型值的范围在0.5-0.9之间。
scale_pos_weight = 1 :默认1,如果类别十分不平衡取较大正值。

step2: 网格搜索

引用库

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_validate,GridSearchCV
from sklearn import metrics

固定其他已调好的参数，验证param_testN中的best_params.

param_test3 = {
    'subsample':[i/10.0 for i in range(6,10)],
    'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=150, max_depth=5,
                                        min_child_weight=10, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
                       param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors], train[target])
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

从输出结果看当前最佳值

  ......
  
  'mean_test_score': array([0.83749862, 0.83905418, 0.83884314, 0.83902026, 0.83814523,
         0.83788037, 0.83866718, 0.84048344, 0.83756248, 0.83808899,
         0.84111194, 0.83955118, 0.83635295, 0.83623314, 0.83893753,
         0.83788233]),
  'std_test_score': array([0.00867607, 0.00580994, 0.0078614 , 0.00751303, 0.00582818,
         0.00723515, 0.00681313, 0.00620886, 0.00815349, 0.00798908,
         0.00638452, 0.00579401, 0.00875388, 0.00561627, 0.00631244,
         0.00675911]),
  'rank_test_score': array([14,  4,  7,  5,  9, 12,  8,  2, 13, 10,  1,  3, 15, 16,  6, 11])},
 {'colsample_bytree': 0.8, 'subsample': 0.8},
 0.8411119400279029)

结果表明：

colsample_bytree=0.8 subsample=0.8 最好

step3: 继续修改其他参数，依次得出最佳值

主要参数调整过程

n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds. 总共迭代的次数，亦即决策树的个数。
max_depth 和 min_child_weight 对最终结果有较大的影响。

max_depth 树的最大深度，值越大，树越复杂。这个可以用来控制过拟合，典型值是3-10，默认是6。
min_child_weight 决定最小叶子节点样本权重和。用于避免过拟合。当它的值较大时，可以避免模型学习到局部的特殊样本，但是如果值过高，会导致欠拟合。

gamma (default=0, alias: min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了一个结点被分割时，所需要的最小损失函数减小的大小。
subsample 和 colsample_bytree。Subsample是样本的采样率，如果设置成0.5，那么Xgboost会随机选择一般的样本作为训练集。colsample_bytree表示构造每棵树时，列采样率（一般是feature采样率）
正则化参数