【交叉验证】bootstrapped 5-fold nested cross validation

jc菜鸟教程

已于 2023-12-25 11:47:25 修改

阅读量1k

点赞数 18

文章标签：学习 python

于 2023-11-23 16:39:09 首次发布

本文链接：https://blog.csdn.net/double_double_/article/details/134579527

版权

1、nested cross validation【嵌套交叉验证】

（1）传统k折交叉验证

只有一个loop循环，即内层循环
将数据集切分为k份，取一份作为测试模型的数据，其余数据用于训练模型。

（2）嵌套交叉验证

更多内容来自
 例子1

使用场景：当计算成本不是很大时，需要考虑可以提供的硬件条件。如果满足以上两个标准，就可以考虑使用嵌套交叉验证来计算数据，得到几乎无偏的估计误差，从而对比不同算法的表现。

有两个loops（循环），即内层循环和外层循环。
对于在K-Fold中每一个i，都有一个嵌套的K-Fold交叉验证。
运行流程包含两个循环，即外层循环和内层循环。
内层循环是指带有搜索模型最佳超参数功能的交叉验证，目的是给外层循环提供模型的最佳超参数。例如，随机搜索或者网格搜索。
而外层循环是给内层循环提供训练数据，同时保留部分数据，以作对内层循环模型的测试。
通过这样的方式，可以防止数据的信息泄漏，以得到相对较低的模型评分偏差。
在这里插入图片描述

伪代码：
其中Require表示需要提供的超参数，RandomSample（ ${P}{sets}$ ）是一个从超参数网格中获取随机数据集的函数。
在这里插入图片描述

Ⅰ优点

通过对基础模型泛化误差的估计来进行超参数的搜索，以得到模型最佳参数。评估了模型的泛化能力

Ⅱ缺点

其一、可能会造成信息泄漏；其二、由于是对相同数据进行误差估计，所以会导致较高的偏差（当用可能的模型最佳超参数对相同的训练集和测试集进行误差的计算时，模型是有偏的，会导致较大偏差）。模型选择看重偏差和方差，因此一个好的评估模型真实误差的方法应是结合无偏和低方差两个方面。训练时间长，计算量大。

2、bootstrap 【自举法】

背景：
我们的项目并不总是有充足的数据。通常，我们只有一个样本数据集可供使用，由于缺乏资源我们无法执行重复实验(例如A/B测试)。

幸运的是，我们有重采样的方法来充分利用我们所拥有的数据。自举法（Bootstrapping）是一种重采样技术，可以为我们解决这个问题。
详细算法介绍

Bootstrapping 是一种统计重采样技术，通过放回采样来创建数据的多个子集。该技术允许您通过从数据集中重复采样来估计统计数据（例如模型性能指标）的变异性。

随机抽样，有放回的抽样，重复多次。

Bootstrapped Cross-Validation: ：在交叉验证的背景下，Bootstrapping 不仅涉及一组折叠，还涉及通过引导创建的多组折叠。这意味着对于交叉验证外循环中的每个折叠，训练集是通过从整个数据集中进行放回采样来生成的。

3、Bootstrapped 5-Fold Nested Cross-Validation

（1）Bootstrapped 5-Fold

在这种特定情况下，交叉验证的外循环涉及 5 重。每一次折叠代表不同的训练-测试分割。每个折叠的训练集都是通过引导生成的。

（2）Nested Cross-Validation

在每个外层折叠内，都有一个交叉验证的内部循环。该内部循环通常还执行一定数量的折叠（例如，5），并且它用于超参数调整或模型选择。这为评估过程增加了另一层稳健性。

python代码实现

pip install scikit-learn tensorflow

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from tensorflow import keras
from tensorflow.keras import layers

# Generate some example data
X, y = np.random.rand(1000, 10), np.random.randint(2, size=(1000,))

# Define your deep learning model using Keras
def create_model():
    model = keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=(10,)),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Define the outer cross-validation loop (5 folds)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize a list to store the accuracy of each outer fold
outer_accuracy = []

##--------------------outer loop--------------------
for train_index, test_index in outer_cv.split(X, y):
    # Split the data into training and test sets for the outer fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Initialize a list to store the accuracy of each inner fold
    inner_accuracy = []

    # Define the inner cross-validation loop (5 folds)
    inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
	##--------------------inner loop--------------------
    for inner_train_index, inner_test_index in inner_cv.split(X_train, y_train):
        # Split the data into training and validation sets for the inner fold
        X_inner_train, X_val = X_train[inner_train_index], X_train[inner_test_index]
        y_inner_train, y_val = y_train[inner_train_index], y_train[inner_test_index]

        # Create and train the deep learning model
        model = create_model()
        model.fit(X_inner_train, y_inner_train, epochs=10, batch_size=32, verbose=0)

        # Evaluate the model on the validation set
        y_val_pred = model.predict(X_val)
        y_val_pred_binary = (y_val_pred > 0.5).astype(int)
        inner_acc = accuracy_score(y_val, y_val_pred_binary)
        inner_accuracy.append(inner_acc)

    # Choose the best hyperparameters based on the inner loop
    best_inner_fold = np.argmax(inner_accuracy)   ## acc最高的那一个模型配置
    best_inner_train_index, best_inner_test_index = list(inner_cv.split(X_train, y_train))[best_inner_fold]
    X_best_inner_train, y_best_inner_train = X_train[best_inner_train_index], y_train[best_inner_train_index]  ##  获取新的训练集

    # Train the model on the entire training set of the outer fold
    model = create_model()
    model.fit(X_best_inner_train, y_best_inner_train, epochs=10, batch_size=32, verbose=0)

    # Evaluate the model on the test set of the outer fold
    y_test_pred = model.predict(X_test)
    y_test_pred_binary = (y_test_pred > 0.5).astype(int)
    outer_acc = accuracy_score(y_test, y_test_pred_binary)
    outer_accuracy.append(outer_acc)

# Print the average accuracy across all outer folds
print("Average Accuracy:", np.mean(outer_accuracy))  ## 打印外层平均acc

更加完整的python代码

可以修改外部循环以使用bootstrap样本在每个折叠中进行训练。

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from tensorflow import keras
from tensorflow.keras import layers

# Generate some example data
X, y = np.random.rand(1000, 10), np.random.randint(2, size=(1000,))

# Define your deep learning model using Keras
def create_model():
    model = keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=(10,)),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Define the outer cross-validation loop (5 folds) with bootstrapping
n_outer_folds = 5
n_bootstrap_samples = 100  # You can adjust the number of bootstrap samples

# Initialize a list to store the accuracy of each outer fold
outer_accuracy = []

for _ in range(n_outer_folds):
    # Create bootstrapped samples for the outer fold
    bootstrap_indices = np.random.choice(len(X), size=n_bootstrap_samples, replace=True)
    X_bootstrap, y_bootstrap = X[bootstrap_indices], y[bootstrap_indices]

    # Split the bootstrapped data into training and test sets for the outer fold
    X_train, X_test, y_train, y_test = train_test_split(X_bootstrap, y_bootstrap, test_size=0.2, stratify=y_bootstrap)

    # Initialize a list to store the accuracy of each inner fold
    inner_accuracy = []

    # Define the inner cross-validation loop (5 folds)
    inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for inner_train_index, inner_test_index in inner_cv.split(X_train, y_train):
        # Split the data into training and validation sets for the inner fold
        X_inner_train, X_val = X_train[inner_train_index], X_train[inner_test_index]
        y_inner_train, y_val = y_train[inner_train_index], y_train[inner_test_index]

        # Create and train the deep learning model
        model = create_model()
        model.fit(X_inner_train, y_inner_train, epochs=10, batch_size=32, verbose=0)

        # Evaluate the model on the validation set
        y_val_pred = model.predict(X_val)
        y_val_pred_binary = (y_val_pred > 0.5).astype(int)
        inner_acc = accuracy_score(y_val, y_val_pred_binary)
        inner_accuracy.append(inner_acc)

    # Choose the best hyperparameters based on the inner loop
    best_inner_fold = np.argmax(inner_accuracy)
    best_inner_train_index, best_inner_test_index = list(inner_cv.split(X_train, y_train))[best_inner_fold]
    X_best_inner_train, y_best_inner_train = X_train[best_inner_train_index], y_train[best_inner_train_index]

    # Train the model on the entire training set of the outer fold
    model = create_model()
    model.fit(X_best_inner_train, y_best_inner_train, epochs=10, batch_size=32, verbose=0)

    # Evaluate the model on the test set of the outer fold
    y_test_pred = model.predict(X_test)
    y_test_pred_binary = (y_test_pred > 0.5).astype(int)
    outer_acc = accuracy_score(y_test, y_test_pred_binary)
    outer_accuracy.append(outer_acc)

# Print the average accuracy across all outer folds
print("Average Accuracy:", np.mean(outer_accuracy))

jc菜鸟教程

关注

18
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
【交叉验证】bootstrapped 5-fold nested cross validation

其二、由于是对相同数据进行误差估计，所以会导致较高的偏差（当用可能的模型最佳超参数对相同的训练集和测试集进行误差的计算时，模型是有偏的，会导致较大偏差）。如果满足以上两个标准，就可以考虑使用嵌套交叉验证来计算数据，得到几乎无偏的估计误差，从而对比不同算法的表现。是指带有搜索模型最佳超参数功能的交叉验证，目的是给外层循环提供模型的最佳超参数。来创建数据的多个子集。是给内层循环提供训练数据，同时保留部分数据，以作对内层循环模型的测试。通过这样的方式，可以防止数据的信息泄漏，以得到相对较低的模型评分偏差。
复制链接

扫一扫