sklearn(二十四)：Preprocessing data

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:09:41 修改

阅读量458

点赞数

分类专栏： Sklearn 文章标签： sklearn python 机器学习

于 2018-10-13 00:00:28 首次发布

本文链接：https://blog.csdn.net/u014765410/article/details/83034066

版权

Sklearn 专栏收录该内容

27 篇文章 4 订阅

订阅专栏

note that：preprocssing data之前，要先了解data中是否有outlier，进而在决定采用何种preprocessing data的方法。
下面介绍几种preprocessing data的方法：

Standardization, or mean removal and variance scaling

standardization
将各个feature的数据transform为mean=0,variance=1。

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
#with_mean=True:If True, center the data before scaling.
#with_std=True：If True, scale the data to unit variance.
#示例
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])
       
sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)#可以将standardize用于test data中
#示例
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_                                      
array([1. ..., 0. ..., 0.33...])

>>> scaler.scale_                                       
array([0.81..., 0.81..., 1.24...])

>>> scaler.transform(X_train)                           
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])
>>> X_test = [[-1., 1., 0.]]
>>> scaler.transform(X_test)                
array([[-2.44...,  1.22..., -0.26...]])

#上述两个function都可以通过调节with_mean，或with_std，而disable centering（remove mean）或scaling（divide variance）操作。

Scaling features to a range
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
下面介绍几种scaling的function:

#MinMaxScaler：将feature缩放到一个range内。
#缩放formular：
#X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
#X_scaled = X_std * (max - min) + min
sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
#示例
>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])
>>> X_test = np.array([[-3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])

#MaxAbsScaler：将各个feature都除以各自的abs_maximum，使得feature各个value在[-1,1]之间。
sklearn.preprocessing.MaxAbsScaler(copy=True)
#示例
>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> X_test_maxabs                 
array([[-1.5, -1. ,  2. ]])
>>> max_abs_scaler.scale_         
array([2.,  1.,  2.])

#note that：如果不想创建object，可以直接使用函数minmax_scale，maxabs_scale，也可以达到上述function的效果，但是，不能将preprocess 过程应用于 test dataset。

Scaling sparse data
需要注意，sparse data不能进行centered处理，否则会毁掉其sparse结构，不过，sparse data可以进行scaling处理。可运用相关function达到处理目的，具体如下：

MaxAbsScaler（）#用于scaling data
maxabs_scaler() #同MaxAbsScaler()
scale()和StandardScaler() #将with_mean=False，也可实现scaling的功能

sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) #其centered和scaling动作是基于以下operation:centered是基于median，scaling是基于interquantile。RobustScaler（） is robust to outlier。RobustScaler()不能用sparse data拟合，但是可以用transform()方法对sparse data进行scaling。

在将sparse data作为input输入相关function时，为避免unnecessary memory copies，可以首先将其处理为scipy.sparse.csc_matrix，scipy.sparse.csr_matrix，以节省内存空间。

scaling the data with outliers
当data中有outlier时，使用mean和variance对data进行preprocessing处理，并不能perform well。因为，此时，mean和variance并不能很好估计data的平均水平以及波动浮动。为了处理有outlier的data，可以使用RobustScaler()和robust_scale() function。
在有些案例中，上述的预处理过程还不能满足某些ML算法的要求，有些ML算法还可能要求feature之间“线性不相关”，为此，可以利用 sklearn.decomposition.PCA with whiten=True进一步去除feature之间的线性相关关系。
centering the kernel matrix

sklearn.preprocessing.KernelCenterer   #center a kernel matrix

#示例
>>> from sklearn.preprocessing import KernelCenterer
>>> from sklearn.metrics.pairwise import pairwise_kernels
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> K = pairwise_kernels(X, metric='linear')
>>> K
array([[  9.,   2.,  -2.],
       [  2.,  14., -13.],
       [ -2., -13.,  21.]])
>>> transformer = KernelCenterer().fit(K)
>>> transformer
KernelCenterer()
>>> transformer.transform(K)#相当于是对K(x,z)中的phi(x),phi(z)分别进行去中心化处理后，然后在计算Kernel？？？
array([[  5.,   0.,  -5.],
       [  0.,  14., -14.],
       [ -5., -14.,  19.]])

假设kernel K(x,z)=phi(x)^T phi(z)，通过KernelCenterer()可以将phi(x)去中心化（使其mean=0），而不用显示计算phi(x)，其功能与用StandardScaler(with_std=False)对phi(x)去中心化相同。

Non-linear transformation

Mapping to a Uniform distribution
通过“等级转换”，将服从任意分布的data转化为服从“均匀分布”，这种转化可以将原data中分布密集的sample离散化，使得各个sample在space中呈现均匀分布，举例说明：假如原data为1维数据，样本点分别为0,35,36,37,40,100,500。则从data可以看出，sample 2,3,4分布较密集，但是，经过“uniform distribution transformation”以后，所有sample的分布将变得均匀。
“unifrom distribution transformation”对于outlier不敏感，不会受outlier的影响。但是，这种转化会distort the correlations and distances within and across features。
QuantileTransformer和quantile_transform提供基于分位数函数的非参数变换，以便将数据映射到0到1之间的均匀分布。通过这两个function，可将original data转化为服从“均匀分布”，或“高斯分布”。这两个function可以将原始data中密集的数据分散化，且对outlier不敏感，是一种非常强悍的预处理方案。
QuantileTransformer和quantile_transform中，低于或高于拟合范围的新/未见数据的特征值将被映射到输出分布的边界。需要注意的是，这种转化是一种“非线性转化”，他可能会distort the correlation and distance within and across features，不过，值得一提的是，他使得不同scale的feature具有可比性。
Mapping to a Gaussian distribution
将原始data转化为Guassian distribution。实现该transformation的function有：QuantileTransformer，quantile_transform，PowerTransformer。
PowerTransformer提供了两种转化方法：the Yeo-Johnson transform and the Box-Cox transform。其中，Box-Cox transform仅仅能够应用于strictly positive data。

**Note that：**的是无论是“uniform distribution transformation”还是“Guassian distribution transformation”，都不是对所有“distribution_data”均适用的。因此，visualization before and after transformation，以判断是否对data转换成功，是非常必要的。

上述两种distribution transformation中提到的function列如下：

sklearn.preprocessing.QuantileTransformer(n_quantiles=1000, output_distribution=’uniform’, ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)#可将transformation应用于test data
#n_quantiles：计算的quantiles的数量
#output_distriution：{uniform,normal}
#ignore_implicit_zeros：=True，将close to zero的entries舍弃掉，不用于计算statistics；=False:直接将这些值重定义为=0;
#subsample：用于计算quantile statistics的最大sample数量。
#random-state:设定的随机状态。

klearn.preprocessing.quantile_transform(X, axis=0, n_quantiles=1000, output_distribution=’uniform’, ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=False) #不可将对train data的transformation应用于test data

sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’, standardize=True, copy=True)
#method:执行transformation的方法{yeo-johnson，box-cox}
#standardize=True：set mean=0,variance=unit

Qusetion??? 对于uniform distribution transformation以及Guassian distribution transformation，各自适用的情景是什么？？？（欢迎指教）

Normalization

通过normalization transformation可以将data中各个sample缩放为unit norm（单位向量）。如果你想用quadratic form（如：dot_product，kernel）来量化两个sample之间的相似度，则这种transformation是非常有效的。举例说明：现有两个sample：x1,x2，通过这种转化首先使得|x1|=1，|x2|=1，然后计算x1 * x2 来量化两个sample的相似度。这种做法类似于计算两个sample之间的夹角的余弦值：x1x2/|x1||x2| = x1x2。夹角越大，余弦越小，两个sample的相似度也越小。
这种normalization transformation经常用于“text classification”和“clustering contexts”中。
以下两个function均可实现该转化：

sklearn.preprocessing.normalize(X, norm=’l2’, axis=1, copy=True, return_norm=False) #没有fit等method
#norm:{l1,l2}。选择的normalization方法
#axis:沿哪个轴进行normalization transformation
#return_norm=True：是否返回norm

sklearn.preprocessing.Normalizer(norm=’l2’, copy=True)
#fit method do nothing

#normalization transformation是在single sample上进行的，因此，不用fit，直接转化即可。

Encoding categorical features

在data中，如果存在一些类别feature，如gender{female,male}，则通过encoding categorical function，可以将这些feature编码为integers。如feature_gender={f,m,m,f}通过function可以编码为{1,0,0,1}。但是，在拟合estimator时，这种feature不能够被直接应用，因为，estimator往往会将该特征理解为order feature，而不是，categorical feature，因此，可以进一步通过onehot / dummy function，将这些feature进行转化，使得对于某个sample，其只可在k各类别中的其中一个类value=1，而在其他value=0。举例说明，如果有sample=(1,f,2)，各列分别对应feature1,gender,feature3。则通过onehot function，可以将sample中的categorical feature进行dummy encoding，上述sample将转为=（1,1,0,2），其对应的特征分别为feature1,female,male,feature3。
实施该过程的function有：

sklearn.preprocessing.OrdinalEncoder(categories=’auto’, dtype=<class ‘numpy.float64’>) #将categorical feature转为integers
#categories:{auto,list}={自动甄别，以list形式传入}
#dtype:output的数据类型

sklearn.preprocessing.OneHotEncoder(n_values=None, categorical_features=None, categories=None, sparse=True, dtype=<class ‘numpy.float64’>, handle_unknown=’error’) #将categorical feature编为one hot code
#handle_unknown：如果某sample中的categorical feature值缺失，则要么返回error，要么ignore，这种情况，将将sample的categorical feature 编为0（better choice）

Discretization

通过discretization function可以将具有连续值的feature进行离散化操作。
执行discretizaiton 的function列如下：

sklearn.preprocessing.KBinsDiscretizer(n_bins=5, encode=’onehot’, strategy=’quantile’) #将连续值分为 k个bin.
#n_bins：将连续值分为多少个离散的interval
#encode：{onehot,onehot-dense,ordinal:直接将各个连续值编码，而不转化为onehot}
#strategy：{uniform:每个bin等宽,quantile：每个bin有相等的point,kmeans:每个bin有相同的“centroid”}

sklearn.preprocessing.Binarizer(threshold=0.0, copy=True) #将连续值分为2个bin，低于threshold,设为0，高于设为1

sklearn.preprocessing.binarize(X, threshold=0.0, copy=True) #将连续值分为2个bin,没有fit,transform等method

Imputation of missing values

填补data中缺失值的function有：

sklearn.impute.SimpleImputer(missing_values=nan, strategy=’mean’, fill_value=None, verbose=0, copy=True)
#missing_values：缺失值在data中的表示方法
#strategy:{mean,median,most_frequent:对于categorical feature，选取频率最高的value填补缺失值,constant：在fill_value中给出填补缺失值的常数}

sklearn.impute.MissingIndicator(missing_values=nan, features=’missing-only’, sparse=’auto’, error_on_new=True) #返回指示missingvalue的indicator matrix
#features:{missing-only:返回的imputer mask仅包含有miss-value的feature,all:返回所有的imputer mask}
#error_on_new=True：当features=missing-only，如果new sample没有missing value，将返回error
#sparse:{auto:output is same type as input,True：sparse matrix,False：numpy array}

Generating polynomial features

将data的feature转化为polynomial features，举例说明：对于以sample X=(x1,x2)，将其特征转为polynomial(2) features为：（1，x1,x2,x1²,x2²,x1x2）。
实现该转化的function为：

sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True)
#interaction_only：仅包含feature interaction项
#include_bias=True：包含bias项

Custom transformers

#通过函数FunctionTransformer()，可以将user_defined transformation function用于data cleaning。
sklearn.preprocessing.FunctionTransformer(func=None, inverse_func=None, validate=None, accept_sparse=False, pass_y=’deprecated’, check_inverse=True, kw_args=None, inv_kw_args=None)
#func与inverse_func互为倒数，a callable function used as transformation function
#check_inverse=True:check func和inverse_func的关系是否正确（是否互为倒数）
#validate=True：对input data进行校验,如果input X不可逆，将raise exception
#pass_y=True:将target y引入inner callable？？？个人理解，将y引入转换机制中
#accept_sparse=True:接收sparse matrix作为input
#kw_args:为func的字典关键字
#inv_kw_args:为inverse_func的字典关键字

#示例code
>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p, validate=True)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

#Please note that a warning is raised and can be turned into an error with a filterwarnings:
>>> import warnings
>>> warnings.filterwarnings("error", message=".*check_inverse*.",
...                         category=UserWarning, append=False)

官方文档：Preprocessing data

Sarah ฅʕ•̫͡•ʔฅ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
sklearn(二十四)：Preprocessing data

note that：preprocssing data之前，要先了解data中是否有outlier，进而在决定采用何种preprocessing data的方法。下面介绍几种preprocessing data的方法：Standardization, or mean removal and variance scalingstandardization将各个feature的数据trans...
复制链接

扫一扫