以Python自带数据集鸢尾花为例,记录一下在对模型进行超参训练时,如何清晰的记录每组超参数训练模型及其对应得分。
导入基础的库和数据
import pandas as pd
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_iris
# 加载鸢尾花数据集
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df
超参寻优
GridSearchCV
老生常谈,使用Scikit-Learn的GridSearchCV进行超参数寻优。
GridSearchCV可以拆分为两部分,GridSearch和CV,即网格搜索和交叉验证同时实现。
其中GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=‘raise’, return_train_score=True)
GridSearchCV中每一个详解参考下链接文章:
https://blog.csdn.net/qq_41076797/article/details/102755904
GridSearchCV中的Scoring
我是想清晰的记录每一组超参对应的模型及其Scoring,然后明白模型为什么会选择这组参数,所以重点说一下这个。
重点是
Scikit-learn默认使用了负均方误差(negative mean squared error)作为评分指标。在网格搜索的目标是寻找最大化评分的最佳模型,而均方误差的计算结果越小越好。为了与最大化评分的目标一致,Scikit-learn将均方误差的值取负数,这样最大化负均方误差的结果就等价于最小化均方误差。
- 当 scoring=‘neg_mean_squared_error’ ,即代表以负均方误差作为评分标准,均方误差(MSE)的公式见下:
- 当 scoring=‘neg_root_mean_squared_error’ ,即代表以负均方根误差作为评分标准,均方根误差(RMSE)的公式见下,也就是对MSE开平方:
https://www.cnblogs.com/ethan-wen/p/17405999.html
以Xgboost模型为例上代码
# 创建XGBoost回归模型
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
daoshu = 20
X = iris_df.iloc[:,:4]
y = iris_df.iloc[:,-1:]
XGB_X = X[:-daoshu]
XGB_y = y[:-daoshu]
X_train, X_test, y_train, y_test = train_test_split(XGB_X, XGB_y, test_size=0.3, random_state=18)
xgb_reg = xgb.XGBRegressor()
# 参数网格-----------------------------------------------------------
param_grid = {
'max_depth': [5,7],
'learning_rate': [0.1, 0.5],
'n_estimators': [100, 500],
# 'subsample': [0.6,0.8, 1],
# 'colsample_bytree': [0.6,0.8, 1],
# 'reg_alpha':[0,0.5,0.8,1],
# 'reg_lambda':[0,0.5,1]
}
grid_search = GridSearchCV(estimator=xgb_reg, param_grid=param_grid, scoring='neg_mean_squared_error', cv=8, verbose=1)
grid_search.fit(X_train, y_train)
#网格查找每个参数时的-MSE
par =[]
par_mses = []
for i, par_mse in zip(grid_search.cv_results_['params'],grid_search.cv_results_['mean_test_score']):
# print(i, par_mse)
par.append(i)
par_data = pd.DataFrame(par)
par_mses.append(par_mse)
par_rmsedata = pd.DataFrame(par_mses)
search_data = pd.concat([par_data,par_rmsedata],axis=1)
search_data.columns = ['learning_rate', 'max_depth', 'n_estimators','neg_mean_squared_error']
#-----------------------------------------------------------------------------------------------------------------------------------------
# 输出最优参数
best_params = grid_search.best_params_
print('best_score:',grid_search.best_score_)
print(f"倒数前{daoshu}行训练的参数\n")
print(f"Best parameters: {best_params}")
# 使用最优参数训练模型
xgb_reg_optimized = xgb.XGBRegressor(**best_params)
xgb_reg_optimized.fit(XGB_X, XGB_y)
# 预测
y_pred_optimized = xgb_reg_optimized.predict(X)
rmse_optimized = np.sqrt(mean_squared_error(y, y_pred_optimized))
print(f"Optimized RMSE: {rmse_optimized:.4f}") #最终预测结果的 RMSE
#可视化展示
pre_target = pd.DataFrame(y_pred_optimized)
predata= pd.concat([iris_df,pre_target],axis=1)
plt.figure(figsize=(6,3))
plt.plot(range(len(predata['target'])),predata['target'],c='blue')
plt.scatter(range(len(predata['target'])),predata.iloc[:,-1:],ls=':',c='red',lw=3)
plt.title('预测值与真实值对比', size= 20)
plt.legend(['真实目标值','预测值'])
plt.show()
predata.tail()
结果展示
每一组超参对应模型的Scoring过程见图,共有222组,红色标出的即是best_params对应的best_score,为什么会选择这组呢?再提一遍,因为Scikit-learn默认使用了负均方误差(negative mean squared error)作为评分指标。在网格搜索的目标是寻找最大化评分的最佳模型,而均方误差的计算结果越小越好。为了与最大化评分的目标一致,Scikit-learn将均方误差的值取负数,这样最大化负均方误差的结果就等价于最小化均方误差。
最大化评分:-0.048115 (负的MSE)最大,实际对应的MSE 是 0.048115 最小,这样每一步就都清晰了。