list @size 验证_【ML】xgboostAPI-网格搜索-交叉验证

先看下xgb原生api训练的参数:

 def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
              maximize=False, early_stopping_rounds=None, evals_result=None,
              verbose_eval=True, xgb_model=None, callbacks=None)
  • 目前关注以下几个参数
    • params,模型参数,见下面
    • dtrain,训练数据,DMatrix格式
    • num_boost_round,迭代次数
    • evals,验证集
    • early_stopping_rounds,提前停止
    • evals_result,保存训练结果

导包

import numpy as np
import xgboost as xgb

from sklearn.datasets import load_breast_cancer, load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

分类

# data
[12] train-error:0.00000 test-error:0.06140
[13] train-error:0.00000 test-error:0.05263
Stopping. Best iteration:
[3] train-error:0.00000 test-error:0.05263
error= 0.05

回归:

data = load_boston()
x_data = data.data
y_data = data.target                                
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=999)

dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test, y_test)


# params
params = {
    'objective': 'reg:squarederror',                 # 回归,平方误差目标函数
    'max_depth': 2,
    'eta': 1
}
num_round = 50
watch_list = [(dtrain, 'train'), (dtest, 'test')]
evals_result = {}                                   # 保存训练结果

# train
bst = xgb.train(params, 
                dtrain, 
                num_round, 
                evals=watch_list, 
                early_stopping_rounds=10, 
                evals_result=evals_result)

# 回归输出的是直接的预测值,可以直接比较
preds = bst.predict(dtest)
labels = dtest.get_label()

# rmse
print(np.sqrt(mean_squared_error(preds,labels)))
[16] train-rmse:2.20856 test-rmse:3.68877
[17] train-rmse:2.15026 test-rmse:3.62916
[18] train-rmse:2.09039 test-rmse:3.62334
Stopping. Best iteration:
[8] train-rmse:2.78789 test-rmse:3.56546
rmse: 3.623343

多分类

# 模拟数据:构造100*10的样本特征,一共有4个类别
np.random.seed(1994)

kRows = 500
kCols = 10
kClasses = 4                    


x_data = np.random.randn(kRows, kCols)
y_data = np.random.randint(0, 4, size=kRows)

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=999)

dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test, y_test)


params = {
    'objective': 'multi:softmax',                 # 多分类
    'num_class': 4,                               # 必须指定类别个数
    'max_depth': 2
}


num_round = 50
watch_list = [(dtrain, 'train'), (dtest, 'test')]
evals_result = {}                                   # 保存训练结果

# train
bst = xgb.train(params, 
                dtrain, 
                num_round, 
                evals=watch_list, 
                early_stopping_rounds=10, 
                evals_result=evals_result)

# 多分类输出的直接是类别,可以直接比较
preds = bst.predict(dtest)
labels = dtest.get_label()

error = sum([1 for i in range(len(preds)) if preds[i] != labels[i]]) / float(len(preds))

print(f'error={error: .2f}'
123
[11] train-merror:0.38000 test-merror:0.74000
[12] train-merror:0.36500 test-merror:0.74000
[13] train-merror:0.35500 test-merror:0.75000
Stopping. Best iteration:
[3] train-merror:0.47750 test-merror:0.67000
error= 0.70

训练的时候指定了early_stopping_rounds,那么在预测的时候可以指定只用最好的迭代次数来预测:

preds = bst.predict(dtest, ntree_limit=bst.best_iteration)

调参

常用的模型参数如下:

  • num_boost_round:模型迭代次数,即训练了多少棵树,越大越容易过拟合,默认100
  • max_depth:一棵树的最大深度,越大越容易过拟合,默认6
  • min_child_weight:子节点最小的权重分数,比如当前节点分裂了两个子节点,如果子节点的得分小于该值,就停止分裂。回归任务表示叶子节点最少的样本数。越大越容易防止过拟合,默认1
  • subsample:是一个分数,表示每次训练一棵树选取多少比例的样本,越大越容易过拟合,默认1
  • colsample_bytree:是一个分数,表示训练一棵树选多少比列的特征,越大越容易过拟合,默认1
  • gamma:正则项控制叶子节点的系数,即当分裂节点的时候,使损失下降多少才可以分裂节点,越大越容易防止过拟合,默认0,
  • lambda:正则项控制分数w的系数,l2正则系数,越大越容易防止过拟合,默认1,
  • eta:学习率,shrinkage思想,每棵树的系数,加权值, 越大越容易防止过拟合,默认0.3
  • 还有一些配置项参数,就不涉及调参了,booster指定弱分类器模型,verbosity打印日志级别等等
  • 更多可参考官网:https://xgboost.readthedocs.io/en/latest/parameter.html#

一般调参步骤:

1. 先确定num_boost_round=100

2. 再调max_depth,min_child_weight,subsample,colsample_bytree

3. 再调gamma,lambda

4. 再调eta和num_boost_round

依次调参的时候,都可以用上一步调好的参数训练,用验证集看看还有什么问题,然后适当来调节下一步参数的范围?

然后xgboost原生API是不支持sklearn的网格搜索的,要么自己写循环调用原生的cv验证,要么写一个自定义类用sklearn的网格搜索。

# 自定义支持sklearn的评估器
class MyXGBoostEstimator:
    def __init__(self, **params):
        self.params = params
        if 'num_boost_round' in self.params:
            self.num_boost_round = self.params['num_boost_round']
        # 默认是回归,如果是分类,要修改objective
        self.params.update({'silent':1, 'objective':'reg:squarederror', 'seed':0})
        
    def fit(self, x_train, y_train):
        dtrain = xgb.DMatrix(x_train, y_train)
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round)
        
    def predict(self, x_pred):
        dpred = xgb.DMatrix(x_pred)
        return self.bst.predict(dpred)
    
    def kfold(self, x_train, y_train, nfold=5):
        dtrain = xgb.DMatrix(x_train, y_train)
        
        cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                           nfold=nfold, early_stopping_rounds=10)
        
        return cv_rounds.iloc[-1,:]

    def get_params(self, deep=True):
        return self.params
    
    def set_params(self, **params):
        self.params.update(params)
        return self

必须要实现上述的方法,fit,predict是为了训练预测,kfold是进行cv验证,get和set是进行网格搜索的时候设置不同参数

# 评估函数:RMSE
def score_fn(y_pred, y_true):
    return np.sqrt(mean_squared_error(y_pred, y_true))

cv_score_fn = make_scorer(score_fn, greater_is_better=False)

进行网格搜索:

训练:max_depth,min_child_weight,subsample,colsample_bytree

param_grid = {'max_depth': list([5, 6, 7]), 
             'min_child_weight': list([1, 5, 9]),
             'subsample': list([0.8, 1]),
             'colsample_bytree': list([0.8, 1])}


model = MyXGBoostEstimator(num_boost_round=100)

grid = GridSearchCV(model, param_grid=param_grid, cv=3, scoring=cv_score_fn)

grid.fit(x_train, y_train)

print(grid.best_params_)
{'colsample_bytree': 1,
'max_depth': 7,
'min_child_weight': 5,
'subsample': 0.8}

训练:gamma,lambda,

# 将已经训练好的参数设置为默认参数
params = {'num_boost_round': 100,
         'colsample_bytree': 1,
         'max_depth': 7,
         'min_child_weight': 5,
         'subsample': 0.8}

# 这两个参数,调的话要看前面训练loss的下降是情况,可以先拿前面的参数去训一下,如果过拟合严重,再来调这个
param_grid = {'gamma': list([0, 0.1, 0.5]), 
             'lambda': list([1, 1.5])}

model = MyXGBoostEstimator(**params)

grid = GridSearchCV(model, param_grid=param_grid, cv=3, scoring=cv_score_fn)

grid.fit(x_train, y_train)

print(grid.best_params_)
{'gamma': 0, 'lambda': 1}
  • 可以看到这里参数基本调出来还是默认

训练:eta, num_boost_round

params = {'colsample_bytree': 1,
         'max_depth': 7,
         'min_child_weight': 5,
         'subsample': 0.8}

etas = [0.3, 0.5, 1]
num_boost_rounds = [15, 50, 100]
nfold = 3

best_eta, best_round = 0, 0
best_score = 10000000


for eta in etas:
    for num_boost_round in num_boost_rounds:
        params.update({'eta': eta})
        dtrain = xgb.DMatrix(x_train, y_train)
        
        cv_rounds = xgb.cv(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
                           nfold=nfold, early_stopping_rounds=10)
        
        # cv返回的是num_boost_round轮评估指标的train-mean,train-std,test-mean,test-std
        # 这里取最后一轮的测试集评估指标均值作为score
        score = cv_rounds.iloc[-1,-2]
        
        if score < best_score:
            best_score = score
            best_eta = eta
            best_round = num_boost_round
            
            
print(f'eta: {best_eta}, num_boost_round: {best_round}')
eta: 0.3, num_boost_round: 100

参数得到后,用最优参数训练看看结果:

params = {'colsample_bytree': 1,
         'max_depth': 7,
         'min_child_weight': 5,
         'subsample': 0.8,
         'eta': 0.3,
         'gamma': 0, 
         'lambda': 1}

dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test, y_test)

watch_list = [(dtrain, 'train'), (dtest, 'test')]
evals_result = {}                                   # 保存训练结果

# train
bst = xgb.train(params=params, dtrain=dtrain, num_boost_round=100,
                evals=watch_list, 
                early_stopping_rounds=10, 
                evals_result=evals_result)
[25] train-rmse:0.97110 test-rmse:2.83420
[26] train-rmse:0.95397 test-rmse:2.80972
[27] train-rmse:0.88680 test-rmse:2.82232
Stopping. Best iteration:
[17] train-rmse:1.30593 test-rmse:2.77281
  • 相对于之前的回归,loss有明显的下降。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值