机器学习之预处理pyspark和sklearn相似处理比较(持续更新中)

  • sklearn: import sklearn.preprocessing
  • pyspark: import.ml.feature

MinMaxScaler 归一化到 [0 1]

  • 原理
    X s  caled  = ( X − X . min ⁡ ( a x i s = 0 ) ) ( X . max ⁡ ( a x i s = 0 ) − X ⋅ min ⁡ ( a x i s = 0 ) ) ⋅ ( max ⁡ − min ⁡ ) + min ⁡ X_{s} \text { caled }=\frac{(X-X . \min (a x i s=0))}{(X . \max (a x i s=0)-X \cdot \min (a x i s=0))} \cdot(\max -\min )+\min Xs caled =(X.max(axis=0)Xmin(axis=0))(XX.min(axis=0))(maxmin)+min
    对每一列做归一化处理
  • sklearn.preprocessing.MinMaxScaler(copy=True,feature_range(0,1))
X,y=make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
plt.subplot(121)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.cool)

plt.subplot(122)
X_2=MinMaxScaler().fit_transform(X)
plt.scatter(X_2[:,0],X_2[:,1],c=y,cmap=plt.cm.cool)
plt.show()

在这里插入图片描述

  • pyspark.ml.feature.MinMaxScaler
    MinMaxScaler(self, min=0.0, max=1.0, inputCol=None, outputCol=None)

StandardScaler

数据标准化处理

  • 标准化
     在机器学习中,我们可能要处理不同种类的资料,例如,音讯和图片上的像素值,这些资料可能是高维度的,资料标准化后会使每个特征中的数值平均变为0(将每个特征的值都减掉原始资料中该特征的平均)、标准差变为1,这个方法被广泛的使用在许多机器学习算法中。
  • sklearn.preprocessing.StandardScaler
    StandardScaler(copy=True, with_mean=True, with_std=True)
    说明:
    If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).
    If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.

    即 当with_std=True 以及with_mean=True时,μ is set to 0 and the std to 1。
    当 with_std=False 以及with_mean=False时μ和 std 来源于原数据
    demo:
X = np.array([[1., -1., 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
scaler = sklearn.preprocessing.StandardScaler(with_mean=True, with_std=True).fit(X)
# print(scaler.mean_)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)

scaler = sklearn.preprocessing.StandardScaler(with_std=True, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=True).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
输出
>{'with_mean': True, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
***********
{'with_mean': False, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': None, 'var_': None, 'scale_': None}
[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]
***********
{'with_mean': False, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 1.22474487 -1.22474487  1.60356745]
 [ 2.44948974  0.          0.        ]
 [ 0.          1.22474487 -0.80178373]]
***********
{'with_mean': True, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': None, 'scale_': None}
[[ 0.         -1.          1.66666667]
 [ 1.          0.         -0.33333333]
 [-1.          1.         -1.33333333]]
  • pyspark.ml.features.StandardScaler
    StandardScaler(withMean=False, withStd=True, inputCol=None, outputCol=None)
    当withMean为true,withStd为false时,向量中的各元素均减去它相应的均值。当withMean和withStd均为true时,各元素在减去相应的均值之后,还要除以它们相应的方差。 当withMean为true,程序只能处理稠密的向量,不能处理稀疏向量。
    demo
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()

结果:
在这里插入图片描述

Binarizer 根据阈值进行二值化处理

小于等于阈值的设置为0
大于阈值的设置为1

  • sklear.preprocessing.Binarizer
    说明:
    sklearn 的Binarizer 只能处理2D数组

    demo:
x=np.array([[1,2,3.4],[2.1,1.3,-10]])
transformer=sklearn.preprocessing.Binarizer(threshold=2).fit(x)
print(transformer.transform(x))

binarizer=sklearn.preprocessing.Binarizer(threshold=2)
print(binarizer.fit_transform([[1,2,3,4],[2,3,4,5]]))

输出:
>[[0. 0. 1.]
 [1. 0. 0.]]
 
[[0 0 1 1]
 [0 1 1 1]]

fit transform 以及fit_transform的区别

  1. fit: when you want to train your model without any pre-processing on
    the data

  2. transform: when you want to do pre-processing on the data using one of the functions from sklearn.preprocessing

  3. fit_transform(): It’s same as calling fit() and then transform() - a shortcut

  • pyspark.ml.feature.Binarizer
    Binarizer(self, threshold=0.0, inputCol=None, outputCol=None)
    demo
df=spark.createDataFrame([(0.5,),(2.,),(3.,)],['values'])
df.show()
binarizer=pyspark.ml.feature.Binarizer(threshold=2,inputCol='values',outputCol='features')
binarizer.transform(df).show()

"""
+------+
|values|
+------+
|   0.5|
|   2.0|
|   3.0|
+------+

+------+--------+
|values|features|
+------+--------+
|   0.5|     0.0|
|   2.0|     0.0|
|   3.0|     1.0|
+------+--------+
"""

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值