xgb调参--简洁版

但是后续再深入测试的时候发现，用各种指标去验证（比如rmse）的时候，结果却和最好的模型是一样的，并不是和最后一轮的模型一样，再深入的研究之后在源码中发现了这么一段代码，XGBoost在调用predict的时候tree_limit参数如果没指定默认用的就是best_ntree_limit，也就是在预测时候，用的还是最好的模型

二、调参方向/目的

1、过拟合

直接调整控制模型复杂度参数
- max_depth --> 调低
- min_child_weight --> 调高
- gamma --> 调高
增加随机性，使得训练对噪声具有鲁棒性
- subsample --> 调低
- colsample_bytree --> 调低
- eta and num_round --> 调低eta,调高num_round

2、优化性能

tree_method, 设置为 hist 或者 gpu_hist来加快计算速度

3、正负样本不均衡

如果想优化整体效果(AUC)
- 设置正样本权重系数scale_pos_weight
- 使用AUC作为评估标准
如果想提升准确率率(预测正确的概率)

- 设置max_delta_step为1-10之间，有助于收敛

三、调参方法

常用集成学习比较好总要需要调参的参数：

1、sklearn.model_selection.GridSearchCV——（网格搜索）

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import GridSearchCV


#导入训练数据
traindata = pd.read_csv("/traindata_4_3.txt",sep = ',')
traindata = traindata.set_index('instance_id')
trainlabel = traindata['is_trade']
del traindata['is_trade']
print(traindata.shape,trainlabel.shape)


#分类器使用 xgboost
clf1 = xgb.XGBClassifier()

#设定网格搜索的xgboost参数搜索范围，值搜索XGBoost的主要6个参数
param_dist = {
'n_estimators':range(80,200,4),
'max_depth':range(2,15,1),
'learning_rate':np.linspace(0.01,2,20),
'subsample':np.linspace(0.7,0.9,20),
'colsample_bytree':np.linspace(0.5,0.98,10),
'min_child_weight':range(1,9,1)
}


#GridSearchCV参数说明，clf1设置训练的学习器
#param_dist字典类型，放入参数搜索范围
#scoring = 'neg_log_loss'，精度评价方式设定为“neg_log_loss“
#n_iter=300，训练300次，数值越大，获得的参数精度越大，但是搜索时间越长
#n_jobs = -1，使用所有的CPU进行训练，默认为1，使用1个CPU
grid = GridSearchCV(clf1,param_dist,cv = 3,scoring = 'neg_log_loss',n_iter=300,n_jobs = -1)

#在训练集上训练
grid.fit(traindata.values,np.ravel(trainlabel.values))
#返回最优的训练器
best_estimator = grid.best_estimator_
print(best_estimator)
#输出最优训练器的精度



#自定义损失函数logloss
#===============================我是华丽丽的分割线===============================
import numpy as np
from sklearn.metrics import make_scorer
import scipy as sp

def logloss(act, pred):
    epsilon = 1e-15
    pred = sp.maximum(epsilon, pred)
    pred = sp.minimum(1-epsilon, pred)
    ll = sum(act*sp.log(pred) + sp.subtract(1, act)*sp.log(sp.subtract(1, pred)))
    ll = ll * -1.0/len(act)
    return ll

#这里的greater_is_better参数决定了自定义的评价指标是越大越好还是越小越好
loss = make_scorer(logloss, greater_is_better=False)
score = make_scorer(logloss, greater_is_better=True)

2、sklearn.model_selection.RandomizedSearchCV——（随机搜索）

sklearn.model_selection.RandomizedSearchCV( estimator, param_distributions, n_iter=10, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False, )

RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当作一个分布进行采样这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数（数值越大，获得的参数精度越大，但是搜索时间越长

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.grid_search import RandomizedSearchCV


#导入训练数据
traindata = pd.read_csv("/traindata.txt",sep = ',')
traindata = traindata.set_index('instance_id')
trainlabel = traindata['is_trade']
del traindata['is_trade']
print(traindata.shape,trainlabel.shape)


#分类器使用 xgboost
clf1 = xgb.XGBClassifier()

#设定搜索的xgboost参数搜索范围，值搜索XGBoost的主要6个参数
param_dist = {
'n_estimators':range(80,200,4),
'max_depth':range(2,15,1),
'learning_rate':np.linspace(0.01,2,20),
'subsample':np.linspace(0.7,0.9,20),
'colsample_bytree':np.linspace(0.5,0.98,10),
'min_child_weight':range(1,9,1)
}

#RandomizedSearchCV参数说明，clf1设置训练的学习器
#param_dist字典类型，放入参数搜索范围
#scoring = 'neg_log_loss'，精度评价方式设定为“neg_log_loss“
#n_iter=300，训练300次，数值越大，获得的参数精度越大，但是搜索时间越长
#n_jobs = -1，使用所有的CPU进行训练，默认为1，使用1个CPU
grid = RandomizedSearchCV(clf1,param_dist,cv = 3,scoring = 'neg_log_loss',n_iter=300,n_jobs = -1)

#在训练集上训练
grid.fit(traindata.values,np.ravel(trainlabel.values))
#返回最优的训练器
best_estimator = grid.best_estimator_
print(best_estimator)
#输出最优训练器的精度
print(grid.best_score_)

3、hyperopt——（贝叶斯优化）

网格搜索速度慢，但在搜索整个搜索空间方面效果很好，而随机搜索很快，但可能会错过搜索空间中的重要点。

hyperopt：是python中的一个用于"分布式异步算法组态/超参数优化"的类库。使用它几乎可以摆脱繁杂的超参数优化过程，自动获取最佳的超参数。广泛意义上，可以将带有超参数的模型看作是一个必然的非凸函数，因此hyperopt几乎可以稳定的获取比手工更加合理的调参结果。尤其对于调参比较复杂的模型而言，其更是能以远快于人工调参的速度同样获得远远超过人工调参的最终性能。

def hyperopt_eval_func(params, X, y):
'''利用params里定义的模型和超参数，对X进行fit，并返回cv socre。
Args:
@params: 模型和超参数
@X:输入参数
@y:真值
Return:
@score: 交叉验证的损失值
''' 

int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
for p in int_feat:
params[p] = int(params[p]) 

clf = XGBClassifier(**params) 

#用cv结果来作为评价函数
from sklearn.model_selection import KFold
shuffle = KFold(n_splits=5, shuffle=True)
score = -1 * cross_val_score(clf, X, y, scoring='f1', cv=shuffle).mean()

return score

def hyperopt_binary_model(params):
'''hyperopt评价函数，在hyperopt_eval_func外面包围了一层，增加一些信息输出
Args:
@params:用hyperopt调参优化得到的超参数
Return:
@loss_status: loss and status

''' 
global best_loss, count, binary_X, binary_y 
count += 1 

clf_type = params['type'] 
del params['type']
loss = hyperopt_eval_func(params, binary_X, binary_y)
print(count, loss)
if loss < best_loss:
ss = 'count:%d new best loss: %4.3f , using %s'%(count, loss, clf_type) 
print(ss) 
best_loss = loss

loss_status = {'loss': loss, 'status': STATUS_OK}
return loss_status

def get_best_model(best):
'''根据hyperopt搜索的参数，返回对应最优score的模型
Args:
@best:最优超参数
Return:
@clf: xgb model
''' 
int_feat = ['n_estimators', 'max_depth', 'min_child_weight']
for p in int_feat:
best[p] = int(best[p])

#fix the random state
best['seed'] = 2018 
clf = XGBClassifier(**best)

return clf


def get_best_model(X_train, y_train, predictors, max_evals_num=10):
'''利用hyperopt得到最优的xgb model
Args:
@X_train: 训练样本X 数据
@y_train: 训练样本y target
@predictors: 用于预测的特征
@max_evals_num: hyperopt调参时的次数，次数越多，模型越优，但是也越耗费时间
Return:
@clf: 最优model
'''
space = { 
'type': 'xgb',
'n_estimators': hp.quniform('n_estimators', 50,400,50), ##50~400，每间隔50
'max_depth': hp.quniform('max_depth', 2, 8, 1), 
##'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2)) 
'learning_rate': hp.uniform('learning_rate', 0.01, 0.1), 
'min_child_weight': hp.quniform('min_child_weight', 2, 8, 1),
'gamma': hp.uniform('gamma', 0, 0.2),
'subsample': hp.uniform('subsample', 0.7, 1.0),
'colsample_bytree': hp.uniform('colsample_bytree', 0.7, 1.0) 
} 

#hyperopt train
global count, best_loss, binary_X, binary_y
count = 0
best_loss = 1000000
binary_X = X_train
binary_y = y_train
trials = Trials()
best = fmin(hyperopt_binary_model, space, algo=tpe.suggest, max_evals=max_evals_num, trials=trials)
print( 'best param:{}'.format(best))
print('best trans cv mse on train:{}'.format(best_loss)) 


clf = get_best_model(best)

return clf

定义参数空间可选择函数：

hp.pchoice(label,p_options)以一定的概率返回一个p_options的一个选项。这个选项使得函数在搜索过程中对每个选项的可能性不均匀。
hp.uniform(label,low,high)参数在low和high之间均匀分布。
hp.quniform(label,low,high,q),参数的取值round(uniform(low,high)/q)*q，适用于那些离散的取值。
hp.loguniform(label,low,high) 返回根据 exp（uniform（low，high））绘制的值，以便返回值的对数是均匀分布的。
优化时，该变量被限制在[exp（low），exp（high）]区间内。
hp.randint(label,upper) 返回一个在[0,upper)前闭后开的区间内的随机整数。
hp.normal(label, mu, sigma) where mu and sigma are the mean and standard deviation σ , respectively. 正态分布，返回值范围没法限制。
hp.qnormal(label, mu, sigma, q)
hp.lognormal(label, mu, sigma)
hp.qlognormal(label, mu, sigma, q)

from hyperopt import hp
from hyperopt.pyll.stochastic import sample

learning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}

learning_rate_dist = []
for _ in range(10000):
learning_rate_dist.append(sample(learning_rate)['learning_rate'])

plt.figure(figsize = (8, 6))
sns.kdeplot(learning_rate_dist, color = 'r', linewidth = 2, shade = True)
plt.title('Learning Rate Distribution', size = 18)
plt.xlabel('Learning Rate', size = 16)
plt.ylabel('Density', size = 16)

在这里插入图片描述

num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)}
num_leaves_dist = []
for _ in range(10000):
num_leaves_dist.append(sample(num_leaves)['num_leaves'])

plt.figure(figsize = (8,6))
sns.kdeplot(num_leaves_dist, linewidth = 2, shade = True)
plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16)

在这里插入图片描述

四、xgb各种参数含义

详见xgb文档，此处不赘述

参考文档

https://xgboost.readthedocs.io/en/latest/parameter.html

https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

https://zhuanlan.zhihu.com/p/95304498 [参数详解]

https://blog.csdn.net/pearl8899/article/details/80820067

https://blog.csdn.net/han_xiaoyang/article/details/52665396

cnblogs.com/wj-1314/p/10422159.html [网格搜索和随机搜索]

https://blog.csdn.net/sanjianjixiang/article/details/104528478/ [lightgbm优化方法，hyperopt很详细]

https://www.jianshu.com/p/e1bda6355452 [hyperopt搜索空间、搜索算法、详解]

yuncy_lucky

关注

15
点赞
踩
96

收藏

觉得还不错? 一键收藏
2
评论
xgb调参--简洁版

一、常调参数1、max_depth[默认6]树分裂最大深度，常用3~10之间树越深越容易过拟合（越深模型会学到越具体越局部的样本）树越深也会消耗更多内存且会使得训练时间变长（由于xgb会一直分裂到max_depth指定的值，再回过头来剪枝）2、eta[默认0.3]学习率，常用0.01~0.5之间太大准确率不高、难以收敛(梯度值可能在最优解附近晃荡，不收敛)太小运行速...
复制链接

扫一扫