sklearn.preprocessing
https://scikit-learn.org/stable/modules/preprocessing.html
结合sklearn来学习一下数据的预处理过程:
安装 pip install -U scikit-learn
sklearn源码位置:
C:\Users\chen\AppData\Local\Programs\Python\Python37\Lib\site-packages\sklearn\preprocessing
数据的标准化处理:大多数scikit库都需要将数据进行标准化处理:
Gaussian with zero mean and unit variance 均值为0 单位方差的高斯分布数据
1. 使用StandardScaler()可以实现标准化数据 均值=0 单位方差:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]])
scaler = preprocessing.StandardScaler().fix(X_train)
scaler.transform(X_train)
scaler.mean_
scaler.scale_
源码位置:data.py
构造函数: 是否复制 是否均值化 是否单位方差化
Parameters
----------
copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is
not a NumPy array or scipy.sparse CSR matrix, a copy may still be
returned.
with_mean : boolean, True by default
If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently,
unit standard deviation).
def __init__(self, copy=True, with_mean=True, with_std=True):
self.with_mean = with_mean
self.with_std = with_std
self.copy = copy
fit函数 求解均值以及方差:
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y
Ignored
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
transform函数:进行均值和方差转化 Perform standardization by centering and scaling
def transform(self, X, y='deprecated', copy=None):
"""Perform standardization by centering and scaling
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
y : (ignored)
.. deprecated:: 0.19
This parameter will be removed in 0.21.
copy : bool, optional (default: None)
Copy the input X or not.
"""
if not isinstance(y, string_types) or y != 'deprecated':
warnings.warn("The parameter y on transform() is "
"deprecated since 0.19 and will be removed in 0.21",
DeprecationWarning)
check_is_fitted(self, 'scale_')
copy = copy if copy is not None else self.copy
X = check_array(X, accept_sparse='csr', copy=copy, warn_on_dtype=True,
estimator=self, dtype=FLOAT_DTYPES,
force_all_finite='allow-nan')
if sparse.issparse(X):
if self.with_mean:
raise ValueError(
"Cannot center sparse matrices: pass `with_mean=False` "
"instead. See docstring for motivation and alternatives.")
if self.scale_ is not None:
inplace_column_scale(X, 1 / self.scale_)
else:
if self.with_mean:
X -= self.mean_
if self.with_std:
X /= self.scale_
return X
2.归一化数据,将特征数据归一化于某个范围 MinMaxScaler /
MaxAbsScaler
MinMaxScaler实现数据归一化[0,1]
实例:
X_train = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
归一化计算公式:
构造函数 可以指定范围
Parameters
----------
feature_range : tuple (min, max), default=(0, 1)
Desired range of transformed data.
copy : boolean, optional, default True
Set to False to perform inplace row normalization and avoid a
copy (if the input is already a numpy array).
def __init__(self, feature_range=(0, 1), copy=True):
self.feature_range = feature_range
self.copy = copy
fit方法,计算特征的最大值以及最小值
def fit(self, X, y=None):
"""Compute the minimum and maximum to be used for later scaling.
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum
used for later scaling along the features axis.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
transform根据指定范围进行范围处理
def transform(self, X):
"""Scaling features of X according to feature_range.
Parameters
----------
X : array-like, shape [n_samples, n_features]
Input data that will be transformed.
"""
check_is_fitted(self, 'scale_')
X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES,
force_all_finite="allow-nan")
X *= self.scale_
X += self.min_
return X
MaxAbsScaler数据归一化[-1,1]
实例:
X_train = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
fit计算绝对值的最大值:
def fit(self, X, y=None):
"""Compute the maximum absolute value to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum
used for later scaling along the features axis.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
transform将数据除以绝对值最大值,归一化到[-1,1]
def transform(self, X):
"""Scale the data
Parameters
----------
X : {array-like, sparse matrix}
The data that should be scaled.
"""
check_is_fitted(self, 'scale_')
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
estimator=self, dtype=FLOAT_DTYPES,
force_all_finite='allow-nan')
if sparse.issparse(X):
inplace_column_scale(X, 1.0 / self.scale_)
else:
X /= self.scale_
return X
3.稀疏数据处理
4.极端数据处理
利用RobustScaler来处理极端值 利用分位数的概念来处理极端值
Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to
the quantile range (defaults to IQR: Interquartile Range).
The IQR is the range between the 1st quartile (25th quantile)
and the 3rd quartile (75th quantile).
构造函数:
Parameters
----------
with_centering : boolean, True by default
If True, center the data before scaling.
This will cause ``transform`` to raise an exception when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
with_scaling : boolean, True by default
If True, scale the data to interquartile range.
quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR
Quantile range used to calculate ``scale_``.
.. versionadded:: 0.18
copy : boolean, optional, default is True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is
not a NumPy array or scipy.sparse CSR matrix, a copy may still be
returned.
def __init__(self, with_centering=True, with_scaling=True,
quantile_range=(25.0, 75.0), copy=True):
self.with_centering = with_centering
self.with_scaling = with_scaling
self.quantile_range = quantile_range
self.copy = copy