利用随机搜索算法对XGBoost分类模型参数寻优

最新推荐文章于 2024-08-04 03:12:33 发布

红尘炼丹客

最新推荐文章于 2024-08-04 03:12:33 发布

阅读量398

点赞数 7

文章标签：分类机器学习人工智能 python

本文链接：https://blog.csdn.net/tingyunye/article/details/140670628

版权

随机搜索算法

随机搜索算法是一种基于随机采样的参数优化方法，主要用于在给定的参数空间中搜索最优的参数组合。它与网格搜索不同之处在于，网格搜索会遍历所有可能的参数组合，而随机搜索则通过随机选择一定数量的参数组合来进行搜索，从而在较少的计算资源下寻找到一个较优的解。

随机搜索算法的基本原理

1. 参数空间定义：
   - 首先定义需要优化的参数及其取值范围（或分布）。这些参数可以是模型的超参数，例如学习率、树的深度、节点最小权重等。
2. 随机参数选择：
   - 从定义的参数空间中随机选择一组参数。这些参数可以是均匀分布、正态分布或离散的列表。
3. 模型训练与评估：
   - 使用选定的参数在训练集上训练模型，并在验证集（或交叉验证中的验证集）上评估模型性能。
4. 性能评估：
   - 根据选择的评估指标（如准确率、F1分数等），评估当前参数组合的性能。
5. 更新最优参数：
   - 如果当前参数组合的性能优于之前记录的最佳性能，则更新最优参数。
6. 迭代搜索：
   - 重复以上步骤，直到达到预定的迭代次数或时间限制。

XGBoost分类模型参数寻优的基本步骤

1. 准备数据：
   - 加载数据集，并将数据集划分为训练集和测试集。
2. 定义参数空间：
   - 定义XGBoost分类模型的参数空间，例如`n_estimators`（树的数量）、`max_depth`（树的最大深度）、`learning_rate`（学习率）等。
3. 初始化XGBoost分类器：
   - 使用Scikit-learn中的`XGBClassifier`类初始化一个XGBoost分类器模型。
4. 随机搜索对象初始化：
   - 使用Scikit-learn中的`RandomizedSearchCV`类初始化一个随机搜索对象。设置参数包括模型、参数分布、迭代次数（`n_iter`）、交叉验证折数（`cv`）、评分指标（如`scoring='accuracy'`）、随机种子等。
5. 执行随机搜索：
   - 调用随机搜索对象的`fit`方法，将训练集数据传入，进行参数搜索和模型训练。
6. 输出最优参数：
   - 在搜索完成后，通过`best_params_`属性获取最优参数组合。
7. 使用最优模型：
   - 根据最优参数创建最优模型，使用测试集评估模型性能，并输出各项评估指标。

随机搜索算法的优缺点

优点：
- 更高效的参数搜索：相比于网格搜索，随机搜索不需要遍历所有可能的参数组合，因此可以在相同的时间内找到较优的参数。
- 节省计算资源：尤其是当参数空间较大时，随机搜索可以节省大量计算资源。
- 适用于高维空间：随机搜索在高维参数空间中也能有效工作，而网格搜索的计算复杂度随参数数量的增加而指数增长。

缺点：
- 无法保证最优性：由于随机选择参数的性质，随机搜索无法保证找到全局最优解，仅能找到较优解。
- 依赖于随机性：搜索的效果受随机选择的参数组合影响，可能会导致搜索结果的不稳定性。
- 可能需要更多迭代次数：为了找到较优的参数组合，可能需要更多的迭代次数，尤其是在参数空间较大或复杂的情况下。

综上，随机搜索算法通过随机选择参数组合来优化模型性能，是一种高效且常用的参数优化方法，特别适用于大数据集和高维参数空间的场景。

实现的代码

# -*- coding: utf-8 -*-
import time
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, matthews_corrcoef

# 加载数据集
wine = load_wine()
X = wine.data
y = wine.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=24)

print("---------------------使用默认参数----------------------------")
# 初始化XGBoost分类器
model_default = XGBClassifier(random_state=24)
# 训练
model_default.fit(X_train, y_train)
# 预测
y_pred_default = model_default.predict(X_test)

# 输出默认参数下的评估指标
acc_default = accuracy_score(y_test, y_pred_default)
print("默认参数 accuracy:", acc_default)

precision_default = precision_score(y_test, y_pred_default, average='weighted')
recall_default = recall_score(y_test, y_pred_default, average='weighted')
f1_default = f1_score(y_test, y_pred_default, average='weighted')
auc_default = roc_auc_score(y_test, model_default.predict_proba(X_test), multi_class='ovr')
mcc_default = matthews_corrcoef(y_test, y_pred_default)
conf_mat_default = confusion_matrix(y_test, y_pred_default)

print("精确率:", precision_default)
print("召回率:", recall_default)
print("F1分数:", f1_default)
print("AUC：", auc_default)
print("MCC：", mcc_default)
print("混淆矩阵:\n", conf_mat_default)

print("---------------------参数寻优----------------------------")
t1 = time.time()

# 定义参数分布
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 2, 3, 4, 5],
}

# 初始化XGBoost分类器
model = XGBClassifier(random_state=24)

# 初始化随机搜索对象
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100,
                                   cv=5, scoring='accuracy', random_state=24, verbose=2)

# 执行随机搜索
random_search.fit(X_train, y_train)
t2 = time.time()

# 输出最优参数
print("Best parameters:")
print(random_search.best_params_)
print("time:", t2-t1)

print("---------------------最优模型----------------------------")
# 使用最优参数创建最优模型
best_params = random_search.best_params_
model_best = random_search.best_estimator_

# 训练
model_best.fit(X_train, y_train)
# 预测
y_pred_best = model_best.predict(X_test)

# 输出最优模型下的评估指标
acc_best = accuracy_score(y_test, y_pred_best)
print("最优参数 accuracy:", acc_best)

precision_best = precision_score(y_test, y_pred_best, average='weighted')
recall_best = recall_score(y_test, y_pred_best, average='weighted')
f1_best = f1_score(y_test, y_pred_best, average='weighted')
auc_best = roc_auc_score(y_test, model_best.predict_proba(X_test), multi_class='ovr')
mcc_best = matthews_corrcoef(y_test, y_pred_best)
conf_mat_best = confusion_matrix(y_test, y_pred_best)

print("精确率:", precision_best)
print("召回率:", recall_best)
print("F1分数:", f1_best)
print("AUC：", auc_best)
print("MCC：", mcc_best)
print("混淆矩阵:\n", conf_mat_best)

运行的结果

---------------------使用默认参数----------------------------
默认参数 accuracy: 0.9555555555555556
精确率: 0.9586868686868687
召回率: 0.9555555555555556
F1分数: 0.9548880748880749
AUC：0.9992816091954023
MCC：0.9334862385321101
混淆矩阵:
 [[19  0  0]
 [ 1 14  1]
 [ 0  0 10]]
---------------------参数寻优----------------------------
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=8, min_child_weight=4, n_estimators=200, subsample=0.7; total time=   0.1s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=8, min_child_weight=4, n_estimators=200, subsample=0.7; total time=   0.1s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=8, min_child_weight=4, n_estimators=200, subsample=0.7; total time=   0.1s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=8, min_child_weight=4, n_estimators=200, subsample=0.7; total time=   0.1s
......
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=3, min_child_weight=3, n_estimators=300, subsample=0.7; total time=   0.2s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=3, min_child_weight=3, n_estimators=300, subsample=0.7; total time=   0.3s
[CV] END colsample_bytree=0.7, learning_rate=0.05, max_depth=3, min_child_weight=3, n_estimators=300, subsample=0.7; total time=   0.2s
Best parameters:

{'subsample': 1.0, 'n_estimators': 500, 'min_child_weight': 1, 'max_depth': 10, 'learning_rate': 0.01, 'colsample_bytree': 0.6}
time: 173.85961079597473
---------------------最优模型----------------------------
最优参数 accuracy: 0.9777777777777777
精确率: 0.9797979797979799
召回率: 0.9777777777777777
F1分数: 0.9779484553678103
AUC：1.0
MCC：0.9664959643957367
混淆矩阵:
 [[19  0  0]
 [ 0 15  1]
 [ 0  0 10]]