XGBoost实战与调优

最新推荐文章于 2024-07-11 19:07:01 发布

置顶法相

最新推荐文章于 2024-07-11 19:07:01 发布

阅读量4.9k

点赞数 3

分类专栏：数据挖掘文章标签： xgboost 调优网格搜索实战

本文链接：https://blog.csdn.net/weixin_38569817/article/details/76354004

版权

数据挖掘专栏收录该内容

39 篇文章 10 订阅

订阅专栏

首先，python和Anaconda都没有自带xgboost。windows下安装xgboost非常方便。在前面的文章中，提供了下载地址和详细的安装步骤。

你可以在python中，输入import xgboost来检查是否安装成功。

下面，我们正式进入xgboost实战与效果检验。在前面的文章《决策树模型调优》中，我们通过PCA，特征选择，简单的特征组合将决策树模型

预测广告的准确率从最初的93.3%提高到95.6%。那么，是否还能进一步提高呢，答案的是肯定的。接下来就来看看xgboost的威力。

在这里我们不贴上处理数据的代码，用的是前面决策树模型预测时一样的数据。

from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
params = {
    'learning_rate':0.1,
    'n_estimators':500,
    'max_depth':5,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    'learning_rate':np.linspace(0.01,0.2,20)  #得到最佳参数0.01，Accuracy：96.4
}
grid = GridSearchCV(clf,grid_params)
grid.fit(X,y)
print(grid.best_params_)
print("Accuracy:{0:.1f}%".format(100*grid.best_score_))

    只对参数learning_rate进行调优，我们就将准确率从95.6提高到96.4%。    接下来，我们设置learning_rate=0.01，调整参数n_estimatorsparams = {
    'learning_rate':0.01,
    'n_estimators':500,
    'max_depth':5,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    # 'learning_rate':np.linspace(0.01,0.2,20),  #得到最佳参数0.01，Accuracy：96.4%
     'n_estimators':list(range(100,601,100)),  #得到最佳参数500，Accuracy：96.4%
}

我们发现，准确率兵并没有提高，继续对参数调优。

params = {
    'learning_rate':0.01,
    'n_estimators':500,
    'max_depth':5,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    # 'learning_rate':np.linspace(0.01,0.2,20),  #得到最佳参数0.01，Accuracy：96.4%
    #  'n_estimators':list(range(100,601,100)),  #得到最佳参数500，Accuracy：96.4%
    #先对这两个参数进行调优，是因为它们对最终结果影响很大。首先我们先大范围粗调，然后再进行小范围微调
    'max_depth':list(range(3,15,1)),
    'min_child_weight':list(1,6,2),            #得到最佳参数{'max_depth':12,'min_child_weight:1'},Accuracy：96.5%
}

这次，我们有了0.1%的提高。

params = {
    'learning_rate':0.01,
    'n_estimators':500,
    'max_depth':12,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    # 'learning_rate':np.linspace(0.01,0.2,20),  #得到最佳参数0.01，Accuracy：96.4%
    #  'n_estimators':list(range(100,601,100)),  #得到最佳参数500，Accuracy：96.4%
    #先对这两个参数进行调优，是因为它们对最终结果影响很大。首先我们先大范围粗调，然后再进行小范围微调
    # 'max_depth':list(range(3,15,1)),
    # 'min_child_weight':list(1,6,2),            #得到最佳参数{'max_depth':12,'min_child_weight:1'},Accuracy：96.5%
    #在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。
    # 这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。
    'gamma':[i/10.0 for i in range(0,5)],      #得到最佳参数0，Accuracy：96.5%
}

这次准确率并没有提高，那么是否已经达到上限了呢？我们继续尝试。

params = {
    'learning_rate':0.01,
    'n_estimators':500,
    'max_depth':12,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.8,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    # 'learning_rate':np.linspace(0.01,0.2,20),  #得到最佳参数0.01，Accuracy：96.4%
    #  'n_estimators':list(range(100,601,100)),  #得到最佳参数500，Accuracy：96.4%
    #先对这两个参数进行调优，是因为它们对最终结果影响很大。首先我们先大范围粗调，然后再进行小范围微调
    # 'max_depth':list(range(3,15,1)),
    # 'min_child_weight':list(1,6,2),            #得到最佳参数{'max_depth':12,'min_child_weight:1'},Accuracy：96.5%
    #在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。
    # 这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。
    # 'gamma':[i/10.0 for i in range(0,5)],      #得到最佳参数0，Accuracy：96.5%
    'subsample':[i/10.0 for i in range(6,10)],
    'colsample_bytree':[i/10.0 for i in range(6,10)],
    #得到最佳参数{'subsample':0.8,'colsample_bytree:0.7'},Accuracy：96.6%
}

我们又提高了0.1%。接下来调整两个正则化参数，分别为L1正则项和L2正则项。

params = {
    'learning_rate':0.01,
    'n_estimators':500,
    'max_depth':12,
    'min_child_weight':1,
    'gamma':0,
    'subsample':0.8,
    'colsample_bytree':0.7,
    'objective':'binary:logistic',
    #在各类别样本十分不平衡时，把这个参数设定为正值，可以是算法更快收敛
    'scale_pos_weight':1
}
clf = XGBClassifier(**params)
grid_params = {
    # 'learning_rate':np.linspace(0.01,0.2,20),  #得到最佳参数0.01，Accuracy：96.4%
    #  'n_estimators':list(range(100,601,100)),  #得到最佳参数500，Accuracy：96.4%
    #先对这两个参数进行调优，是因为它们对最终结果影响很大。首先我们先大范围粗调，然后再进行小范围微调
    # 'max_depth':list(range(3,15,1)),
    # 'min_child_weight':list(1,6,2),            #得到最佳参数{'max_depth':12,'min_child_weight:1'},Accuracy：96.5%
    #在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。
    # 这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。
    # 'gamma':[i/10.0 for i in range(0,5)],      #得到最佳参数0，Accuracy：96.5%
    # 'subsample':[i/10.0 for i in range(6,10)],
    # 'colsample_bytree':[i/10.0 for i in range(6,10)],
    #得到最佳参数{'subsample':0.8,'colsample_bytree:0.7'},Accuracy：96.6%
    'reg_alpha':np.linspace(0,0.05,5),
    'reg_lambda':np.linspace(0,0.05,5)   #得到最佳参数{'reg_alpha':0,'reg_lambda:0.0125'},Accuracy：96.6%
    #模型的效果提升已经很缓慢，要想让模型的表现有质的提升，需要依靠其他手段。
    #诸如，特征工程（feature enginering），模型组合（ensemble），以及堆叠（stacking）等。
}