sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略

sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略





Standardization&Scaling、 Normalization简介

1、Standardization, or mean removal and variance scaling

1.1、Scaling features to a range

1.2、Scaling sparse data

1.3、Scaling data with outliers

1.4、Scaling vs Whitening

1.5、Centering kernel matrices





Standardization&Scaling、 Normalization简介


The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.




1、Standardization, or mean removal and variance scaling 标准化,或均值去除和方差标度

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

数据集的标准化Standardization 是许多在scikit-learn中实现的机器学习评估器的共同需求;如果单个特征与标准正态分布数据(均值为零,单位方差为零的高斯分布)没有多少相似之处,它们可能会表现得很糟糕。


For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

The function scale provides a quick and easy way to perform this operation on a single array-like dataset:

例如,学习算法的目标函数中使用的许多元素(如支持向量机的RBF核或线性模型的l1和l2正则化器)都假设所有特征都以0为中心并且具有相同的顺序的方差。如果一个特征的方差比其他特征的方差大几个数量级,那么它就可能控制目标函数,使estimator 无法按照预期正确地从其他特征中学习

scale 函数提供了一个快速和简单的方法来执行这个操作在一个单一的数组类数据集:

from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)
print(X_scaled )

Scaled data has zero mean and unit variance:

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:

preprocessing 模块进一步提供了一个StandardScaler 类实现 Transformer API StandardScaler API来计算在一个训练集上的平均值和标准偏差,可以以后再申请相同的转换测试集。因此适用于该类的早期步骤sklearn.pipeline.Pipeline::

The scaler instance can then be used on new data to transform it the same way it did on the training set:

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.



scaler = preprocessing.StandardScaler().fit(X_train)

X_test = [[-1., 1., 0.]]


1.1、Scaling features to a range  缩放功能到一个范围

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

另一种标准化方法是将特征缩放到给定的最小值和最大值之间,通常是在0和1之间,或者将每个特征的最大绝对值缩放到单位大小。这可以分别使用MinMaxScaler 和MaxAbsScaler来实现。



1.2、Scaling sparse data  缩放稀疏数据

Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.

MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this. However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memory

unintentionally. RobustScaler cannot be fitted to sparse inputs, but you can use the transform method on sparse inputs.


MaxAbsScaler 和maxabs_scale 是专门为扩展稀疏数据而设计的,并且是推荐的实现方法。但是,scale 和StandardScaler 可以接受scipy.sparse 稀疏矩阵作为输入,只要with_mean=False被显式传递给构造函数。否则,将引发一个ValueError,因为静默居中将破坏稀疏性,并且常常会由于无意中分配过多的内存而导致执行崩溃。RobustScaler 不能适用于稀疏输入,但可以在稀疏输入上使用变换方法。

Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.sparse.csr_matrixand scipy.sparse.csc_matrix). Any other sparse input will be converted to the Compressed Sparse Rows representation. To avoid unnecessary memory copies, it is recommended to choose the CSR or CSC representation upstream.

Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the toarray method of sparse matrices is another option.

注意,scalers 同时接受压缩的稀疏行和压缩的稀疏列格式(请参阅scipy.sparse.csr_matrixand scipy.sparse.csc_matrix)。任何其他稀疏输入将被转换为压缩的稀疏行表示。为了避免不必要的内存副本,建议选择上游的CSR或CSC表示。


1.3、Scaling data with outliers 用离群值对数据进行缩放

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.

如果您的数据包含许多异常值,那么使用数据的均值和方差进行缩放可能不会很好地工作。在这些情况下,你可以使用robust_scale 和RobustScaler 作为完全替代。他们对数据的中心和范围使用更可靠的估计。


Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data?


关于定心和缩放数据的重要性的进一步讨论可以在这个常见问题解答中找到:Should I normalize/standardize/rescale the data?

1.4、Scaling vs Whitening 缩放比例与白化

It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue you can use sklearn.decomposition.PCA with whiten=True to further remove the linear correlation across features.


要解决这个问题,可以使用带有whiten=True 的sklearn.decomposition.PCA 进一步消除特性之间的线性相关性

1.5、Centering kernel matrices  中心核矩阵

If you have a kernel matrix of a kernel K that computes a dot product in a feature space defined by function ϕ, a KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by ϕ followed by removal of the mean in that space.

如果你有一个内核的内核K矩阵计算特征空间的内积函数定义的ϕ, a KernelCenterer可以变换核矩阵,特征空间的内积定义为包含ϕ随后切除意味着在这个空间。


2、Normalization  归一化

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.



The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:


The preprocessing module further provides a utility class Normalizer that implements the same operation using the TransformerAPI (even though the fit method is useless in this case: the class is stateless as this operation treats samples independently).

This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:




The normalizer instance can then be used on sample vectors as any transformer:

Note: L2 normalization is also known as spatial sign preprocessing.




X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')

normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing

normalizer.transform([[-1.,  1., 0.]])


Sparse input

normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.

For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.


normalize 和Normalizer 既接受来自scipy.sparse数组,也接受来自scipy的稀疏矩阵。稀疏作为输入。

对于稀疏输入,将数据转换为压缩的稀疏行表示形式(请参阅 scipy.sparse.csr_matrix),然后将其提供给高效的Cython例程。为了避免不必要的内存副本,建议选择上游的CSR表示形式。





发布了1646 篇原创文章 · 获赞 7295 · 访问量 1321万+


©️2019 CSDN 皮肤主题: 代码科技 设计师: Amelia_0503