【基础概念】（自存）几种验证模型的方法（交叉验证及模型对比）

最新推荐文章于 2024-09-21 20:29:28 发布

CristinaM

最新推荐文章于 2024-09-21 20:29:28 发布

阅读量5.4k

点赞数 4

分类专栏：资料基础验证

原文链接：https://towardsdatascience.com/validating-your-machine-learning-model-25b4c8643fb7

版权

资料同时被 3 个专栏收录

20 篇文章

订阅专栏

基础

10 篇文章

订阅专栏

验证

1 篇文章

订阅专栏

Validating your Machine Learning Model

前言

使用适当的验证技术可以帮助您理解您的模型，且评估无偏的泛化性能。
不存在适用于所有项目的通用验证方法，因此，了解数据类型很重要。
除了常见的k-Fold cross-validation，本文还会涉及到Nested CV, LOOCV，以及一些模型选择的技术。

包括如下几种：

Train/test split
k-Fold Cross-Validation
Leave-one-out Cross-Validation
Leave-one-group-out Cross-Validation
Nested Cross-Validation
Time-series Cross-Validation
Wilcoxon signed-rank test
McNemar’s test
5x2CV paired t-test
5x2CV combined F test

1.合理划分数据集

所有验证技术的基础是在训练模型时分割数据。

Train/test split

最基础的方式就是the train/test split，这非常简单，70%的训练集和30%的测试集。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)

X_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

该方法的优点是：我们可以看到模型如何对以前不可见的数据做出反应。
然而，如果我们数据的一个子集只有一定年龄或收入水平的人，会怎么样？
这通常称为采样偏差（sampling bias）：

Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample.

Holdout set

优化模型的超参数时，如果要使用训练/测试拆分进行优化，则可能会发生过拟合。
解决该问题可以通过创建其他保留集（Holdout set）。 HS是处理/验证步骤中都未使用过的数据的10％。
在这里插入图片描述
在训练/测试集上优化模型后，可以通过验证保留集来检查是否过拟合。
**Tip：**如果仅使用训练/测试组，则建议比较训练和测试集的分布。如果它们相差很大，那么您可能会遇到泛化问题。可以使用facets比较其分布。

2. 使用k-Fold Cross-Validation (k-Fold CV)

为了最小化抽样偏差，我们可以考虑稍微不同的方法对模型进行验证。
如果我们不是进行一次拆分，而是进行多次拆分，并验证所有这些拆分的组合会怎么样呢？

这就是k折交叉验证的用处。它将数据拆分为k折，然后在k-1折上训练数据，并在遗漏的1折上进行测试。它将对所有组合执行此操作，并平均每个实例的结果。
5-Fold Cross-Validation
优点： 所有observations都用于训练和验证，每个observations都用于一次验证。通常选择i = 5或k = 10，因为他们在计算复杂度和验证精度之间找到了很好的平衡。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])

kf = KFold(n_splits=4)

for train_index,test_index in kf.split(X):
    X_train,X_test = X[train_index],X[test_index]
    y_train,y_test = y[train_index],y[test_index]
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)

X_train:
[[1 2]
 [3 4]
 [1 2]]
X_test:
[[3 4]]

cross_val_score交叉验证及其用于参数选择、模型选择、特征选择

3. Leave-one-out Cross-Validation (LOOCV) 留一交叉验证

理论：LOOCV - Leave-One-Out-Cross-Validation 留一交叉验证
在这里插入图片描述

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import LeaveOneOut

X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)

X_train:
[[1 2]]
X_test:
[[3 4]]

NOTE: 由于模型需要进行n次训练，因此LOOCV的计算量很大。仅当数据较小或您可以处理那么多计算时才这样做。

3. Leave-one-out Cross-Validation (LOOCV) 留一交叉验证

在这里插入图片描述

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import LeaveOneGroupOut

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 1, 2])
groups = np.array([1, 1, 2, 2])

logo = LeaveOneGroupOut()

for train_index, test_index in logo.split(X, y, groups):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
print("X_train:")
print(X_train)
print("X_test:")
print(X_test)

X_train:
[[1 2]
 [3 4]]
X_test:
[[5 6]
 [7 8]]

4. Nested Cross-Validation

在优化模型的超参数时，如果使用相同的k折CV策略来调整模型并评估性能，则存在过拟合的风险。
使用Nested Cross-Validation，该方法允许将超参数调整步骤与误差估计步骤分开。为此，我们nest两个k折交叉验证循环：

内部循环用于超参数调优
外部循环用于估计精度

DEMO

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV,cross_val_score,KFold



X = load_iris().data
y = load_iris().target

p_grid = {"C": [1, 10, 100],
          "gamma": [.01, .1]}


svr = SVC(kernel='rbf')

inner_cv = KFold(n_splits=2,shuffle=True,random_state=42)
outer_cv = KFold(n_splits=5,shuffle=True,random_state=42)

clf = GridSearchCV(estimator=svr,param_grid=p_grid,cv=inner_cv)
nested_score = cross_val_score(clf,X=X,y=y,cv=outer_cv).mean()
print(nested_score)

0.9800000000000001

您可以自由选择在内部和外部循环中使用的交叉验证方法。
例如，如果您希望按特定的组进行分割，则可以对内部循环和外部循环使用Leave-one-group-out。

5.Time Series CV

现在，如果你在时间序列数据上使用k-Fold CV会发生什么?过度拟合是一个主要问题，因为您的训练数据可能包含来自未来的信息。因此，所有的训练数据都要在测试数据之前完成。
验证时间序列数据的一种方法是使用k-fold CV，并确保每个fold中的训练数在发生在测试数据之前。
在这里插入图片描述

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

print("X_train:")
print(X_train)
print("X_test:")
print(X_test)

X_train:
[[1 2]
 [3 4]
 [1 2]
 [3 4]
 [1 2]]
X_test:
[[3 4]]

注意：确保训练前时间序列是按顺序排列的。因为没有为TimeSeriesSplit不提供时间索引，仅根据记录出现的顺序创建拆分。

6.Comparing Models（很重要了）

你什么时候认为一种模式比另一种更好?如果一个模型的精度比另一个模型的精度高不了多少，这是否足以成为选择最佳模型的充分理由。存在许多将统计信息应用于机器学习模型选择的方法。

Wilcoxon signed-rank test

使用场景：当样本量较小且数据不服从正态分布时

我们可以用这个 significance test来比较两个机器学习模型。
使用k-fold交叉验证，我们可以为每个模型创建k个准确性评分。这将产生两个样本，每个模型一个。

然后，我们可以使用Wilcoxon signed-rank检验来检验两个样本之间是否存在显著性差异。如果他们这样做，那么其中一个比另一个更准确。
在这里插入图片描述

from scipy.stats import wilcoxon
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier,RandomForestClassifier
from  sklearn.model_selection import cross_val_score,KFold

X = load_iris().data
y = load_iris().target

model1 = ExtraTreesClassifier()
model2 = RandomForestClassifier()
kf = KFold(n_splits=20, random_state=42)

results_model1 = cross_val_score(model1,X,y,cv=kf)
results_model12 = cross_val_score(model2,X,y,cv=kf)

stat,p = wilcoxon(results_model1,results_model12,zero_method='zsplit')
print(p)

0.25065329208296216

结果就是p值。
如果该值小于0.05，我们可以拒绝零假设，即模型之间没有显着差异。

**注意：**在模型之间保持相同的折叠数是很重要的，以确保样本是从相同的总体中得出的。
这可以通过在交叉验证过程中简单地设置相同的random_state来实现。

7.McNemar’s Test

McNemar 检验用于检查一个模型和另一个模型之间的预测匹配的程度。
这被称为列联表的同质性。
从该表里，我们可以计算x^2，它可以用来计算p值。
在这里插入图片描述
同样，如果p值低于0.05，我们可以拒绝零假设，并看到一个模型明显优于另一个模型。
我们可以使用mlxtend库来创建表并计算相应的p值

DEMO

import numpy as np
from mlxtend.evaluate import mcnemar_table, mcnemar


y_target = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])#测试值


y_model1 = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
                     0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1])#模型1预测结果1


y_model2 = np.array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
                     1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0])#模型2预测结果2


tb = mcnemar_table(y_target=y_target, 
                   y_model1=y_model1, 
                   y_model2=y_model2)
chi2, p = mcnemar(ary=tb, exact=True)

print('chi-squared:', chi2)
print('p-value:', p)

chi-squared: 3
p-value: 0.5078125

8.5x2CV paired t-test

该方法的工作原理如下。假设我们有两个分类器A和B。我们将数据随机分为50％训练和50％测试。然后，我们在训练数据上训练每个模型，并从称为DiffA的测试集中计算模型之间的准确性差异。然后，将训练和测试拆分取反，并在DiffB中再次计算差异。
重复五次，然后计算差的平均方差（S²）。然后，将其用于计算t统计量：
在这里插入图片描述
其中，DiffA为第一次迭代的平均方差。

from mlxtend.evaluate import paired_ttest_5x2cv
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.datasets import load_iris

X = load_iris().data
y = load_iris().target

clf1 = ExtraTreeClassifier()
clf2 = DecisionTreeClassifier()

t,p = paired_ttest_5x2cv(estimator1=clf1,
                         estimator2=clf2,
                         X=X,y=y,
                         random_seed=42)
print(p)

你也可以使用：

from mlxtend.evluate import combined_ftest_5x2cv.

you can use the combined 5x2CV F-test instead which was shown to be slightly more robust (Alpaydin, 1999).
5x2cv combined F test

from mlxtend.evaluate import combined_ftest_5x2cv
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.datasets import load_iris

X = load_iris().data
y = load_iris().target

clf1 = ExtraTreeClassifier()
clf2 = DecisionTreeClassifier()

t,p = combined_ftest_5x2cv(estimator1=clf1,
                         estimator2=clf2,
                         X=X,y=y,
                         random_seed=42)
print(p)

0.5101185529124018