在使用标准化 (StandardScaler()) 时，为什么要在训练集上使用fit_transform()，但是在测试集上使用transform()?

最新推荐文章于 2023-02-21 13:24:19 发布

601.373336

最新推荐文章于 2023-02-21 13:24:19 发布

阅读量5.7k

点赞数 21

分类专栏：机器学习文章标签：机器学习深度学习数据预处理

原文链接：https://sebastianraschka.com/faq/docs/scale-training-test.html

版权

机器学习专栏收录该内容

6 篇文章

订阅专栏

在机器学习中，为了确保模型泛化能力，需要在训练集上计算标准化参数（如均值和标准差），然后应用于测试集。这是因为测试集代表未知数据，使用训练集参数能模拟真实世界中新数据的情况。如果在测试集上重新计算参数，可能会导致过拟合，影响模型评估的准确性。因此，正确做法是在训练集上标准化，然后将参数应用于测试集，以保持数据的一致性和模型的通用性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

此博客引用部分来源于：https://sebastianraschka.com/faq/docs/scale-training-test.html
注意：引用部分是英文原版，翻译部分是我根据自己的计算和理解之后翻译的，有一些不同。

Why do we need to re-use training parameters to transform test data?

翻译：为什么需要在训练集上得出参数并重新使用它们来缩放 (scale) 测试集？这篇文章是讨论在使用标准化 (StandardScaler()) 时，为什么要在训练集上 (X_train) 使用fit_transform()，但是在测试集 (X_test) 上使用transform()?
即如下所示代码：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

// 在训练集上使用fit_transform()
scaler.fit_transform(X_train)

// 在测试集上使用transform()
scaler.transform(X_test)

In practice, I’ve seen many ways for scaling a dataset prior to feeding it to a learning algorithm. Can you guess which one is “correct?”

翻译：以下有三种对数据集进行缩放的方法，哪种是正确的？

方法 1:

scaled_dataset = (dataset - dataset_mean) /dataset_std_deviation

train, test = split(scaled_dataset)

方法 2:

train, test = split(dataset)

scaled_train =  (train - train_mean) / train_std_deviation

scaled_test = (test - test_mean) / test_std_deviation ```

方法 3:

scaled_train =  (train - train_mean) / train_std_deviation

scaled_test = (test - train_mean) / train_std_deviation

That’s right, the “correct” way is Scenario 3. I agree, it may look a bit odd to use the training parameters and re-use them to scale the test dataset. (Note that in practice, if the dataset is sufficiently large, we wouldn’t notice any substantial difference between the scenarios 1-3 because we assume that the samples have all been drawn from the same distribution.)

翻译：正确的方法是3：使用在训练集上得出的参数并重新使用它们来缩放 (scale) 测试集。(请注意，在实践中，如果数据集足够大，方法1-3之间不会有任何实质性差异，因为我们假设所有样本都来自相同的分布。)

Again, why Scenario 3? The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data.

翻译：选择方法3的原因是，测试集数据是新的、没有见过的数据。我们使用测试集来估计模型在新的数据上的表现。

Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.

翻译：新的、没有见过的数据可能只是1个数据 (1 data point) 而不是一些数据。(如果只有一个数据，我们无法估计平均值和标准差)。这是一个例子来说明为什么我们需要保留和使用训练集上得出的参数来缩放测试集。

To recapitulate: If we standardize our training dataset, we need to keep the parameters (mean and standard deviation for each feature). Then, we’d use these parameters to transform our test data and any future data later on.

翻译：总结一下：如果我们标准化我们的训练集，我们需要保留训练集上得出的参数 (每个特征的均值和标准差)。
然后，我们将使用这些参数来转换 (transform) 我们的测试集和以后的任何数据。

Let’s imagine we have a simple training set consisting of 3 samples with 1 feature column (let’s call the feature column “length in cm”):

sample1: 10 cm -> class 2
sample2: 20 cm -> class 2
sample3: 30 cm -> class 1

翻译：假设我们的训练集只有三个例子，一个特征 (长度，单位为cm)

例子1: 10 cm -> class 2
例子2: 20 cm -> class 2
例子3: 30 cm -> class 1

Given the data above, we compute the following parameters:

mean: 20
standard deviation: 8.2

翻译：根据上面的数据，我们计算以下参数：

mean: 20
standard deviation: 8.2

If we use these parameters to standardize the same dataset, we get the following values:

sample1: -1.21 -> class 2
sample2: 0 -> class 2
sample3: 1.21 -> class1

翻译：如果我们使用这些参数来标准化训练集，我们将得到以下值：

例子1: -1.21 -> class 2
例子2: 0 -> class 2
例子3: 1.21 -> class1

Now, let’s say our model has learned the following hypotheses: It classifies samples with a standardized length value < 0.6 as class 2 (class 1 otherwise). So far so good. Now, let’s imagine we have 3 new unlabelled data points that you want to classify.

sample4: 5 cm -> class ?
sample5: 6 cm -> class ?
sample6: 7 cm -> class ?

翻译：假设：如果新的数据的标准化后的长度小于0.6，则被分为class 2 (否则为class 1)。现在，有3个新的数据需要分类：

例子4: 5 cm -> class ?
例子5: 6 cm -> class ?
例子6: 7 cm -> class ?

If we look at the “unstandardized “length in cm” values in our training dataset, it is intuitive to say that all of these samples are likely belonging to class 2. However, if we standardize these by re-computing the standard deviation and and mean from the new data, we would get similar values as before (i.e., properties of a standard normal distribution) in the training set and our classifier would (probably incorrectly) assign the “class 2” label to the samples 4 and 5.

sample4: -1.21 -> class 2
sample5: 0 -> class 2
sample6: 1.21 -> class 1

翻译：如果我们查看未被标准化的原始训练集：

例子1: 10 cm -> class 2
例子2: 20 cm -> class 2
例子3: 30 cm -> class 1

那么，例子4、5、6很可能都属于class 2 (因为长度都小于例子1的10cm)。

例子4: 5 cm -> class 2
例子5: 6 cm -> class 2
例子6: 7 cm -> class 2

在测试集上计算出来的mean和standard deviation：

mean: 6
standard deviation: 0.82

如果我们使用在测试集上计算出来的mean和standard deviation来标准化例子4、5、6，我们会得到与标准化后的训练集类似的值，此时分类的情况与有所不同 (例子6被错误的分类为class 1)：

例子4: -1.21 -> class 2
例子5: 0 -> class 2
例子6: 1.21 -> class 1

However, if we use the parameters from your “training set standardization, we will get the following standardized values

sample4: -18.37
sample5: -17.15
sample6: -15.92

翻译：但是，如果我们在训练集上得出参数并重新使用它们来缩放 (scale) 例子4、5、6，我们将得到以下的值 (这些值我计算的和原文不一样，但是不妨碍理解。根据“如果新的数据的标准化后的长度小于0.6，则被分为class 2 (否则为class 1)”分类)：

例子4: -1.83 -> class 2
例子5: -1.71 -> class 2
例子6: -1.59 -> class 2

Note that these values are more negative than the value of sample1 in the original training set, which makes much more sense now!

翻译：请注意，这些值小于标准化后的原始训练集中的例子1的值，符合之前提到的分类：

如果我们查看未被标准化的原始训练集：

例子1: 10 cm -> class 2
例子2: 20 cm -> class 2
例子3: 30 cm -> class 1

那么，例子4、5、6很可能都属于class 2 (因为长度都小于例子1的10cm)。

例子4: 5 cm -> class 2
例子5: 6 cm -> class 2
例子6: 7 cm -> class 2

结论：
总的来说，需要在训练集上得出参数并重新使用它们来缩放测试集的原因如下：

极端的例子：只有少量的数据时，计算出来的mean和standard deviation其实并不准确。
使用测试集计算出来的mean1和sd1去缩放测试集，即假设测试集来自mean等于mean1，standard deviation等于sd1的分布，即等于用测试集训练测试集，如果新加入别的测试数据，需要重新计算mean和standard deviation，这两个参数就更贴合测试集，容易overfitting；
而用训练集计算出来的参数去缩放测试集 (即假设训练集和测试集来自同一个分布)，使得训练的模型更普适 (genelization)。

后面的结论是我自己总结的，如有不对请多指教，欢迎一起讨论～