【笔记】XGB调参

板栗烧鸡加雪碧

已于 2024-02-13 09:05:47 修改

阅读量188

点赞数

文章标签：笔记人工智能机器学习深度学习 python 大数据

于 2024-02-12 18:47:49 首次发布

原文链接：https://www.bilibili.com/read/cv12892786/

版权

超参数调整实战：scikit-learn配合XGBoost的竞赛top20策略_网易订阅在XGBoost中通过Early Stop避免过拟合（练习）_xgboost如何使用eval_set降低过拟合-CSDN博客

1、XGBoost模型评价指标：

XGBoost提供了一系列的模型评价指标，包括但不限于：

   - “rmse” 代表均方根误差
   - “mae” 代表平均绝对误差
   - “logloss” 代表二元对数损失
   - “mlogloss” 代表m-元对数损失
   - “error” 代表分类错误率
   - “auc” 代表ROC曲线下面积

2、XGB常规参数设置

XGBoost防止过拟合的方法_xgboost 过拟合-CSDN博客

下面是三个超参数的一般实践最佳值，可以先将它们设定为这个范围，然后画出 learning curves，再调解参数找到最佳模型：

learning_rate ＝ 0.1 或更小，越小就需要多加入弱学习器；
tree_depth ＝ 2～8；
subsample ＝训练集的 30%～80%；
接下来我们用 GridSearchCV 来进行调参会更方便一些：

可以调的超参数组合有：

树的个数和大小 (n_estimators and max_depth).
学习率和树的个数 (learning_rate and n_estimators).
行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel).
————————————————

→【个人心得】在评估模型效能的时候，我通常使用MCC来评估模型性能

def find_best_threshold_for_mcc(y_true, y_scores):
    best_threshold = 0
    best_mcc = -1  # MCC ranges from -1 to 1
    for threshold in np.arange(0, 1, 0.01):
        y_pred = (y_scores >= threshold).astype(int)
        mcc = matthews_corrcoef(y_true, y_pred)
        if mcc > best_mcc:
            best_mcc = mcc
            best_threshold = threshold
    return best_threshold, best_mcc

# 使用新的函数来找到最大化MCC的阈值
best_threshold_mcc, best_mcc = find_best_threshold_for_mcc(y_data, model.predict_proba(X_data)[:, 1])

# 使用找到的最佳阈值来计算性能指标
y_pred_data_mcc = (model.predict_proba(X_data)[:, 1] >= best_threshold_mcc).astype(int)
data_metrics_mcc = calc_metrics(y_data, y_pred_data_mcc, model.predict_proba(X_data)[:, 1])

3、常用的XGB调参方式

超参数调整实战：scikit-learn配合XGBoost的竞赛top20策略_网易订阅

#imported libsimport numpy as np

import pandas as pd

from xgboost importXGBClassifier

importmatplotlib.pyplot as plt

from scipy import stats

import seaborn as sns

fromsklearn.model_selection importtrain_test_split

fromsklearn.preprocessing importStandardScaler

fromsklearn.pipeline import Pipeline

fromsklearn.model_selection importRandomizedSearchCV, GridSearchCV

import sys

train = pd.read_csv("train.csv")

X = train.drop(['DEFCON_Level','ID'],axis=1)

y = train['DEFCON_Level']

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3,random_state=42)

#For classification #Random Search

xgb_pipeline = Pipeline([('scaler',StandardScaler()), ('classifier',XGBClassifier())])

params = {

'min_child_weight': [1, 5, 10],

'gamma': [0.5, 1, 1.5, 2, 5],

'subsample': [0.6, 0.8, 1.0],

'colsample_bytree': [0.6, 0.8, 1.0],

'max_depth': [3, 4, 5]

}

random_search =RandomizedSearchCV(xgb_pipeline, param_distributions=params, n_iter=100,

scoring='f1_weighted', n_jobs=4, verbose=3,random_state=1001 )

random_search.fit(X_train,y_train)

#OR#Grid Search

xgb_pipeline = Pipeline([('scaler',StandardScaler()), ('classifier',XGBClassifier())])

gbm_param_grid = {

'classifier__learning_rate':np.array([0.01,0.001]),

'classifier__n_estimators':np.array([100,200,300,400]),

'classifier__subsample':np.array([0.7,0.8,0.9]),

'classifier__max_depth':np.array([10,11,12,13,14,15,16,17]),

'classifier__lambda':np.array([1]),

'classifier__gamma':np.array([0])

#'classifier__colsample_bytree':np.arange(0,1.1,.2)

}

grid_search =GridSearchCV(estimator=xgb_pipeline, param_grid=gbm_param_grid, n_jobs= -1,

scoring='f1_weighted', verbose=10)

grid_search.fit(X_train,y_train)

#Print out best parameters

print(random_search.best_params_)

print(grid_search.best_params_)

#Print out scores on validation set

print(random_search.score(X_test,y_test))

随机搜索优化

#随机搜索优化
#Random Search

xgb_pipeline = Pipeline([('scaler',StandardScaler()),

('classifier',XGBClassifier())])

params = {'min_child_weight': [1, 5, 10],

'gamma': [0.5, 1, 1.5, 2, 5],

'subsample': [0.6, 0.8, 1.0],

'colsample_bytree': [0.6, 0.8, 1.0],

'max_depth': [3, 4, 5]}
random_search= RandomizedSearchCV(xgb_pipeline, param_distributions=params, n_iter=100, scoring='f1_weighted', n_jobs=4, verbose=3,random_state=1001 )
random_search.fit(X_train,y_train)

#当我们使用XGBClassifier时，XGBRegressor的工作原理相同。您想搜索的参数在params中，可以简单地添加要尝试的值。

我们将f1_weighted作为指标，因为这是比赛中的要求。作业数量（n_jobs）基本上取决于是否要并行化计算。（如果有多个核心）

如前所述，这是一个随机搜索，因此并不是所有的参数组合都将被试用，这有助于节省计算时间，并具有超参数的初步建议。

#网格搜索优化

xgb_pipeline = Pipeline([('scaler',StandardScaler()), ('classifier',XGBClassifier())])gbm_param_grid= {

'classifier__learning_rate': np.array([0.01,0.001]),

'classifier__n_estimators': np.array([100,200,300,400]), 'classifier__subsample': np.array([0.7,0.8,0.9]), 'classifier__max_depth': np.array([10,11,12,13,14,15,16,17]), 'classifier__lambda': np.array([1]),

'classifier__gamma': np.array([0])}grid_search= GridSearchCV(estimator=xgb_pipeline, param_grid=gbm_param_grid, n_jobs= -1, scoring='f1_weighted', verbose=10)grid_search.fit(X_train,y_train)

跟上面一样，可以更改XGBClassifier（）使其成为XGBRegressor（）。我们为变量n_jobs使用-1，以表明我们希望使用所有核进行计算。详细部署以显示分数和用于在训练时获取分数的参数。

#Print out bestparameters

print(random_search.best_params_)print(grid_search.best_params_)#Print outscores on validation set

print(random_search.score(X_test,y_test))

print(grid_search.score(X_test,y_test))

4、https://zhuanlan.zhihu.com/p/297751352

通过训练集和测试集上的loss来判断拟合情况。具体来说

train loss 不断下降，test loss不断下降，说明网络仍在学习;
train loss 趋于不变，test loss不断下降，说明数据集100%有问题;
train loss 趋于不变，test loss趋于不变，说明学习遇到瓶颈，需要减小学习率或批量数目;
train loss 不断上升，test loss不断上升，说明网络结构设计不当，训练超参数设置不当，数据集经过清洗等问题。

板栗烧鸡加雪碧

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
【笔记】XGB调参

行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel).学习率和树的个数 (learning_rate and n_estimators).树的个数和大小 (n_estimators and max_depth).- “mlogloss” 代表m-元对数损失。- “logloss” 代表二元对数损失。- “auc” 代表ROC曲线下面积。- “error” 代表分类错误率。- “rmse” 代表均方根误差。
复制链接

扫一扫