这个模块主要是对数据的预处理,例如标准化,中心化,scaling,二值化等
RobustScaler
class sklearn.preprocessing.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
这个Scaler将每个样本减去其中位数,然后除以IQR。
因为异常值往往对样本的均值/方差产生负面影响,在这种情况下,这样处理可以取得较好的结果
v i ′ = v i − m e d i a n I Q R v^{\prime}_i = \frac{v_i - median}{IQR} vi′=IQRvi−median
- v i v_i vi表示样本值
- m e d i a n median median是样本的中位数
- I Q R IQR IQR是样本的四分位距
参数
- with_centering
- with_scaling
- quantitle_range:tuple (q_min,q_max),0.0 < q_min < q_max < 100.0
- copy:boolean,可选,默认为True
示例
>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2., 2.],
... [ -2., 1., 3.],
... [ 4., 1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. , 0. ],
[-1. , 0. , 0.4],
[ 1. , 0. , -1.6]])
StandardScaler
官方地址
sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
对数据进行标准化
z
=
x
−
μ
σ
z = \frac{x-\mu}{\sigma}
z=σx−μ
参数
- with_mean:bool,default=True
- 是否使用平均值,False则平均值为0
- with_std:bool,default=True
- False则 σ \sigma σ=1
示例
>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]
#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)
Normalizer
sklearn.preprocessing.Normalizer(X, norm='l2', *, axis=1, copy=True, return_norm=False)
参数
- norm
- l1:样本各个特征值除以各个特征值 的绝对值之和
- l2:样本各个特征值除以各个特征值的平方和
- max:样本各个特征值除以样本中特征值最大值
示例
>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
... [1, 3, 9, 3],
... [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X) # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
[0.1, 0.3, 0.9, 0.3],
[0.5, 0.7, 0.5, 0.1]])