《Python机器学习基础教程》第六章模型调优与正则化6.2 超参数调优的基本概念：掌握核心原理，优化模型性能

本文链接：https://blog.csdn.net/jrckkyy/article/details/146439131

在这里插入图片描述

6.2 超参数调优的基本概念：掌握核心原理，优化模型性能

什么是超参数

定义与示例

在机器学习中，超参数是在模型训练之前设置的参数。它们不是通过训练数据直接学习到的，而是需要手动设定或通过某种算法进行优化。超参数的选择对模型的性能有重要影响。

常见的超参数包括：

学习率（Learning Rate）：控制梯度下降法中每次更新参数的步长。
正则化系数（Regularization Coefficient）：控制模型复杂度，防止过拟合。
决策树的最大深度（Max Depth of Decision Tree）：限制决策树的深度，防止过拟合。
支持向量机的核函数（Kernel Function in SVM）：选择不同的核函数来处理不同类型的特征空间。
神经网络中的隐藏层数量和节点数（Number of Hidden Layers and Nodes in Neural Networks）：控制神经网络的复杂度。

这些超参数的选择直接影响模型的训练过程和最终性能。因此，找到最优的超参数配置是模型调优的重要任务。

超参数调优的目的

提高模型性能

通过调整超参数，可以显著提高模型在训练集和测试集上的性能。例如，适当的学习率可以使模型更快收敛，而合适的正则化系数可以防止模型过拟合。通过超参数调优，我们可以找到最佳配置，使模型在各种评估指标上表现更佳。

避免过拟合

过拟合是指模型在训练集上表现非常好，但在未见过的数据（即测试集）上表现较差。这是由于模型过于复杂，捕捉到了训练数据中的噪声而非真实模式。通过调整超参数，如减小决策树的最大深度、增加正则化系数等，可以有效减少过拟合的风险。

提高泛化能力

泛化能力是指模型在未见过的数据上的表现。通过超参数调优，我们可以找到一个既能很好地拟合训练数据，又能良好泛化到新数据的模型。这通常通过交叉验证来实现，确保模型在多个子集上都能表现良好。

超参数调优的方法

网格搜索

网格搜索是一种穷举所有可能的超参数组合的方法。通过定义一个超参数网格，网格搜索会遍历所有组合，找到最优配置。这种方法虽然计算成本较高，但可以确保找到全局最优解。

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 定义超参数网格
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# 初始化模型
model = RandomForestClassifier(random_state=42)

# 进行网格搜索
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 输出最佳参数
print("最佳参数:", grid_search.best_params_)
print("最佳得分:", grid_search.best_score_)

随机搜索

随机搜索是一种通过随机采样超参数组合的方法。与网格搜索相比，随机搜索计算成本较低，但可能无法找到全局最优解。然而，在实际应用中，随机搜索往往能获得较好的结果。

from sklearn.model_selection import RandomizedSearchCV

# 定义超参数分布
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# 初始化模型
model = RandomForestClassifier(random_state=42)

# 进行随机搜索
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# 输出最佳参数
print("最佳参数:", random_search.best_params_)
print("最佳得分:", random_search.best_score_)

贝叶斯优化

贝叶斯优化是一种基于贝叶斯定理的高效超参数搜索方法。它通过构建一个概率模型来预测不同超参数组合的性能，并逐步更新模型以找到最优配置。贝叶斯优化通常比网格搜索和随机搜索更高效。

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# 定义超参数空间
param_space = {
    'n_estimators': Integer(50, 200),
    'max_depth': Categorical([None, 10, 20, 30]),
    'min_samples_split': Integer(2, 10)
}

# 初始化模型
model = RandomForestClassifier(random_state=42)

# 进行贝叶斯优化
bayes_search = BayesSearchCV(model, param_space, n_iter=10, cv=5, scoring='accuracy', random_state=42)
bayes_search.fit(X_train, y_train)

# 输出最佳参数
print("最佳参数:", bayes_search.best_params_)
print("最佳得分:", bayes_search.best_score_)

交叉验证

交叉验证是一种常用的评估模型性能的方法。通过将数据集分成多个子集，交叉验证可以在多个子集上评估模型，从而提供更可靠的性能估计。常见的交叉验证方法包括K折交叉验证（K-Fold Cross-Validation）和留一交叉验证（Leave-One-Out Cross-Validation）。

from sklearn.model_selection import cross_val_score

# 使用K折交叉验证
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

# 输出交叉验证得分
print("交叉验证得分:", scores)
print("平均得分:", scores.mean())

学习曲线

学习曲线是一种可视化工具，用于展示模型在不同训练集大小下的性能变化。通过绘制学习曲线，我们可以判断模型是否过拟合或欠拟合，并据此调整超参数。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")

    plt.legend(loc="best")
    return plt

# 绘制学习曲线
plot_learning_curve(model, "Learning Curve (Random Forest)", X_train, y_train, cv=5)
plt.show()

正则化技术

正则化是一种防止过拟合的技术，通过在损失函数中添加惩罚项来限制模型的复杂度。常见的正则化技术包括L1正则化（Lasso）和L2正则化（Ridge）。

from sklearn.linear_model import Ridge, Lasso

# L2正则化
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
print("Ridge回归系数:", ridge_model.coef_)

# L1正则化
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train, y_train)
print("Lasso回归系数:", lasso_model.coef_)

早停法

早停法是一种防止过拟合的技术，通过在训练过程中监控验证集的性能，当验证集性能不再提升时提前停止训练。这可以有效地防止模型在训练集上过度拟合。

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# 初始化模型
mlp_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, early_stopping=True, validation_fraction=0.2, n_iter_no_change=10)

# 训练模型
mlp_model.fit(X_train, y_train)

# 输出验证集得分
print("验证集得分:", mlp_model.score(X_val, y_val))

模型集成

模型集成是一种通过组合多个模型来提高预测性能的技术。常见的模型集成方法包括Bagging、Boosting和Stacking。通过模型集成，可以进一步提高模型的性能和泛化能力。

from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Bagging
bagging_model = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=42)
bagging_model.fit(X_train, y_train)
print("Bagging得分:", bagging_model.score(X_test, y_test))

# Boosting
boosting_model = AdaBoostClassifier(n_estimators=100, random_state=42)
boosting_model.fit(X_train, y_train)
print("Boosting得分:", boosting_model.score(X_test, y_test))

# Stacking
base_models = [
    ('svm', SVC()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
]
meta_model = LogisticRegression()
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
stacking_model.fit(X_train, y_train)
print("Stacking得分:", stacking_model.score(X_test, y_test))

参考文献

资料名称	链接
Scikit-Learn官方文档	https://scikit-learn.org/stable/
《Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow》	https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
机器学习中的模型调优	https://machinelearningmastery.com/tune-machine-learning-algorithms-in-python-with-scikit-learn/
网格搜索方法详解	https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e
随机搜索方法详解	https://towardsdatascience.com/randomized-search-for-hyperparameters-tuning-3a5c9d9f8ebd
贝叶斯优化方法详解	https://towardsdatascience.com/bayesian-optimization-for-hyperparameter-tuning-5e9f5f7d9f94
模型调优实战指南	https://www.coursera.org/learn/machine-learning
机器学习基础教程	https://www.cs.toronto.edu/~hinton/csc2515/notes/lec6tutorial.pdf
模型调优与正则化	https://towardsdatascience.com/model-selection-and-evaluation-in-machine-learning-7d7a3c5b5c88
评估指标详解	https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers