Datawhale 202103 集成学习（上）| （补充）机器学习调参方案整理

最新推荐文章于 2024-08-31 11:55:30 发布

o0卤化氢0o

最新推荐文章于 2024-08-31 11:55:30 发布

阅读量400

点赞数

分类专栏：笔记文章标签：机器学习

本文链接：https://blog.csdn.net/zeroice7/article/details/115038311

版权

笔记专栏收录该内容

19 篇文章 1 订阅

订阅专栏

机器学习调参总结

S1：机器学习调参基础
S2：调参方法
S3：各种模型的调参（3.27）
- S3.1 SVM模型
- S3.2 决策树模型
S4：调参实例
- S4.1 调参思路：
- S4.2 课后作业实例

S1：机器学习调参基础

【目标：理解调参调整的是总误差中的** 偏差-方差均衡 **】

以线性回归模型为例，有高次项、低次项和常数项，我们训练模型的目的是使数据点的每一个值都恰好位于拟合函数上，这时模型在数据集的损失值误差即为0。
但数据集中，训练集用于训练的数据耦合性好，测试集用于检验模型表示模型泛化性好坏。
方差-偏差均衡
测试均方误差曲线呈现U型曲线，这表明了在测试误差曲线中有两种力量在互相博弈。可以证明：
$E\left(y_{0}-\hat{f}\left(x_{0}\right)\right)^{2}=\operatorname{Var}\left(\hat{f}\left(x_{0}\right)\right)+\left[\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)\right]^{2}+\operatorname{Var}(\varepsilon)$
也就是说，我们的测试均方误差的期望值可以分解为 $\hat{f}(x_0)$ 的方差、 $\hat{f}(x_0)$ 的偏差平方和误差项 $\epsilon$ 的方差。

可以从函数构成解读如上曲线：

黄色曲线仅包含低次方项，例如它的构成是： $f(x_0) = ax_0+b$
绿色曲线在黄色曲线基础上抖动明显，可能包含较多高次项，例如它的构成是： $f_(x_0)=ax_0^{n} + bx_0^{n-1} + ... + x_0 + k$
方差项是模型未能完美拟合高次方项 $x^n$ 产生的，绿色曲线中较明显；
偏差项是模型未能匹配好低次方项 $x$ 或常数项 $k$ 产生的，黄色与绿色曲线均存在；

为了使得模型的测试均方误差达到最小值，也就是同时最小化偏差的平方和方差。由于我们知道偏差平方和方差本身是非负的，因此测试均方误差的期望不可能会低于误差的方差，因此我们称 $\operatorname{Var}(\varepsilon)$ 为建模任务的难度，这个量在我们的任务确定后是无法改变的，也叫做不可约误差。

【PS：调参过程就是调整超参数，训练算法达到方差-偏差均衡的过程。】

S2：调参方法

模型中的参数分为两类，由程序自动计算优化的参数不需要人为调整，算法无法自动计算的参数名为超参数。需要根据算法原理调整，获得最好的结果，这也是调参优化”方差-偏差均衡“问题的基本过程。

模型参数：

进行预测时需要参数。
它参数定义了可使用的模型。
参数是从数据估计或获悉的。
参数通常不由编程者手动设置。
参数通常被保存为学习模型的一部分。
参数是机器学习算法的关键，它们通常由过去的训练数据中总结得出。
超参数：
模型超参数是模型外部的配置，其值无法从数据中估计。
超参数通常用于帮助估计模型参数。
超参数通常由人工指定。
超参数通常可以使用启发式设置。
超参数经常被调整为给定的预测建模问题。

通常有几种搜索方式可供选择：

S3：各种模型的调参（3.27）

S3.1 SVM模型

包含两个重要参数：

C为松弛变量大小，作为惩罚项，C越大表示分类边界引入的噪声点越多。
Gamma 为数据的分散程度，gamma在rbf核中可理解为正态分布方差的倒数，方差大gamma小则数据分散，方差小gamma大则数据集中。
目的：希望C小，Gamma大（理想状态C为0，Gamma为1）。
【PS：后期持续补充推导过程】

S3.2 决策树模型

决策树包含X个重要参数：

criterion(分割误差计算方式)：一般为基尼系数（gini）与交叉熵(cross entropy)。
$\sum\limits_{k=1}^{k} \hat{p}_{mk}{1-\hat{p}_{mk}}$
$-\sum\limits_{k=1}^{K} \hat{p}_{mk}log\;\hat{p}_{mk}$
max_depth：树的最大深度。
min_samples_split：拆分内部节点所需的最小样本数。
min_samples_leaf：在叶节点处需要的最小样本数。
决策树分类算法的完整步骤：
a. 选择最优切分特征j以及该特征上的最优点s：
遍历特征j以及固定j后遍历切分点s，选择使得基尼系数或者交叉熵最小的(j,s)
b. 按照(j,s)分裂特征空间，每个区域内的类别为该区域内样本比例最多的类别。
c. 继续调用步骤1，2直到满足停止条件，就是每个区域的样本数小于等于5。
d. 将特征空间划分为J个不同的区域，生成分类树。

S4：调参实例

S4.1 调参思路：

优先考虑训练尽可能大的模型，目的在于保证结果方差最小，并保证模型不发生过拟合下进行下步。
模型最大化的情况下，继续减小过度拟合的可能性（目的是继续减小方差，如SVM中保证Gamma值尽可能大），并减小偏差带来的影响（如保证SVM中松弛变量C最小）。

S4.2 课后作业实例

课后作业：
fetch_lfw_people数据集，进行一次实战。

#下载数据
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

#画一些人脸，看看需要处理的数据
import matplotlib.pyplot as plt
import seaborn as sns;sns.set()
fig, ax = plt.subplots(3,5)
fig.subplots_adjust(left=0.0625, right=1.2, wspace=1)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)

LFW人脸Demo

步骤一：通过控制PCA维度建立最大的模型

# PCA降维选择特征数目 尽量使 C小，Gamma大。
for PCA_ in [50, 100, 150, 200, 250, 300, 350, 400]:
     #为了测试分类器的训练效果，将数据集分解成训练集和测试集进行交叉检验
    x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)

    # PCA降维数据 并构建SVM分类模型
    pca = PCA(n_components=PCA_, whiten=True, random_state=42)
    svc = SVC(kernel='rbf', class_weight='balanced')
    model = make_pipeline(pca, svc)
   
    #用网络搜索交叉检验来寻找最优参数组合。通过不断调整C（松弛变量）和参数gamma（控制径向基函数核的大小），确定最优模型
    param_grid = {'svc__C': [1,5,10, 20, 30], 'svc__gamma':[0.0001, 0.0005, 0.001, 0.005]}
    grid = GridSearchCV(model, param_grid)

    grid.fit(x_train, y_train)
    print("PCA ", str(PCA_), " \tbest parameter -> ", grid.best_params_)

PCA  50  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.005}
PCA  100  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.005}
PCA  150  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.001}
PCA  200  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.001}
PCA  250  	best parameter ->  {'svc__C': 5, 'svc__gamma': 0.001}
PCA  300  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0001}
PCA  350  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0005}
PCA  400  	best parameter ->  {'svc__C': 10, 'svc__gamma': 0.0005}

步骤二：遵循C小，Gamma大的基础上，选择PCA最大值为100。

# 选择PCA最合适的为PCA=100
x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)

# PCA降维数据 并构建SVM分类模型
pca = PCA(n_components=PCA_, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C（松弛变量）和参数gamma（控制径向基函数核的大小），确定最优模型
param_grid = {'svc__C': [1,5,10, 20, 30], 'svc__gamma':[0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", str(PCA_), " \tbest parameter -> ", grid.best_params_)

步骤三：进一步优化参数C和Gamma。

# PCA100降维数据 并构建SVM分类模型
pca = PCA(n_components=100, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C（松弛变量）和参数gamma（控制径向基函数核的大小），确定最优模型
param_grid = {'svc__C': [1, 3, 5, 7, 10, 13, 15], 'svc__gamma':[0.001, 0.003, 0.005, 0.007, 0.01]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", 100, " \tbest parameter -> ", grid.best_params_)

Result PCA  100  	best parameter ->  {'svc__C': 13, 'svc__gamma': 0.007}

# PCA100降维数据 并构建SVM分类模型
pca = PCA(n_components=100, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C（松弛变量）和参数gamma（控制径向基函数核的大小），确定最优模型
param_grid = {'svc__C': [10, 11, 12, 13, 14, 15], 'svc__gamma':[0.005, 0.006, 0.007,0.008, 0.009, 0.01]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print("PCA ", 100, " \tbest parameter -> ", grid.best_params_)

Result PCA  100  	best parameter ->  {'svc__C': 11, 'svc__gamma': 0.007}

步骤四：选择最优参数为C=11， Gamma=0.007，绘制训练曲线。

# 绘制训练曲线函数准备
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

# 绘制训练曲线
SVM_C = 11
SVM_GAMMA = 0.007

fig, axes = plt.subplots(3, 1, figsize=(10, 15))

title = r"Learning Curves (SVM, RBF kernel, $C={}$, $\gamma={}$)".format(str(SVM_C), str(SVM_GAMMA))
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
model = make_pipeline(pca, SVC(C=SVM_C, gamma=SVM_GAMMA))
plot_learning_curve(model , title, x_train, y_train, axes=axes[:], ylim=(0.3, 1.2), cv=cv, n_jobs=4)

# plt.savefig('./IMG/lfw_svm_c{}_gamma{}.jpg'.format(str(SVM_C), str(SVM_GAMMA)))
plt.show()