sklearn学习笔记3——pipeline

最新推荐文章于 2024-07-29 09:51:19 发布

wateryouyouyou

最新推荐文章于 2024-07-29 09:51:19 发布

阅读量1.2w

点赞数 4

本文链接：https://blog.csdn.net/wateryouyo/article/details/53909636

版权

pipeline为方便数据处理，提供了两种模式：串行化和并行化

1.串行化，通过Pipeline类实现

通过steps参数，设定数据处理流程。格式为('key','value')，key是自己为这一step设定的名称，value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数，最后一步可有可无，一般最后一步为模型。pipe继承了最后一个类的所有方法。

In [42]: from sklearn.pipeline import Pipeline
    ...: from sklearn.svm import SVC
    ...: from sklearn.decomposition import PCA
    ...: pipe=Pipeline(steps=[('pca',PCA()),('svc',SVC())])
    ...:
    ...: from sklearn.datasets import load_iris
    ...: iris=load_iris()
    ...: pipe.fit(iris.data,iris.target)
    ...:
Out[42]:
Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None,
 random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, cache_size=200,
 class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

训练得到的是一个模型，可直接用来预测，预测时，数据会从step1开始进行转换，避免了模型用来预测的数据还要额外写代码实现。还可通过pipe.score(X,Y)得到这个模型在X训练集上的正确率。

In [46]: pipe.predict(iris.data)
Out[46]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [47]: iris.target
Out[47]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

make_pipeline函数是Pipeline类的简单实现，只需传入每个step的类实例即可，不需自己命名，自动将类的小写设为该step的名。

In [50]: make_pipeline(StandardScaler(),GaussianNB())
Out[50]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
True, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])

In [51]: p=make_pipeline(StandardScaler(),GaussianNB())

In [52]: p.steps
Out[52]:
[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('gaussiannb', GaussianNB(priors=None))]

同时可以通过set_params重新设置每个类里边需传入的参数，设置方法为step的name__parma名=参数值

In [59]: p.set_params(standardscaler__with_mean=False)
Out[59]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
False, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])

2.并行化，通过FeatureUnion实现
FeatureUnion，同样通过(key，value)对来设置，通过set_params设置参数。不同的是，每一个step分开计算，FeatureUnion最后将它们计算得到的结果合并到一块，返回的是一个数组，不具备最后一个estimator的方法。有些数据需要标准化，或者取对数，或onehot编码最后形成多个特征项，再选择重要特征，这时候FeatureUnion非常管用。

In [60]: from sklearn.pipeline import FeatureUnion
In [61]: from sklearn.preprocessing import StandardScaler
In [63]: from sklearn.preprocessing import FunctionTransformer
In [64]: from numpy import log1p

In [65]: step1=('Standar',StandardScaler())
In [66]: step2=('ToLog',FunctionTransformer(log1p))
In [67]: steps=FeatureUnion(transformer_list=[step1,step2])

In [68]: steps.fit_transform(iris.data)
Out[68]:
array([[-0.90068117,  1.03205722, -1.3412724 , ...,  1.5040774 ,
         0.87546874,  0.18232156],
       [-1.14301691, -0.1249576 , -1.3412724 , ...,  1.38629436,
         0.87546874,  0.18232156],
       [-1.38535265,  0.33784833, -1.39813811, ...,  1.43508453,
         0.83290912,  0.18232156],
       ...,
       [ 0.79566902, -0.1249576 ,  0.81962435, ...,  1.38629436,
         1.82454929,  1.09861229],
       [ 0.4321654 ,  0.80065426,  0.93335575, ...,  1.48160454,
         1.85629799,  1.19392247],
       [ 0.06866179, -0.1249576 ,  0.76275864, ...,  1.38629436,
         1.80828877,  1.02961942]])

In [69]: data=steps.fit_transform(iris.data)

In [70]: data.shape  #最后有8个特征，标准化后4个，log后4个，共个
Out[70]: (150, 8)

In [71]: iris.data.shape
Out[71]: (150, 4)









参考：
http://scikit-learn.org/stable/modules/pipeline.html