pipeline为方便数据处理,提供了两种模式:串行化和并行化
1.串行化,通过Pipeline类实现
通过steps参数,设定数据处理流程。格式为('key','value'),key是自己为这一step设定的名称,value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数,最后一步可有可无,一般最后一步为模型。pipe继承了最后一个类的所有方法。
In [42]: from sklearn.pipeline import Pipeline
...: from sklearn.svm import SVC
...: from sklearn.decomposition import PCA
...: pipe=Pipeline(steps=[('pca',PCA()),('svc',SVC())])
...:
...: from sklearn.datasets import load_iris
...: iris=load_iris()
...: pipe.fit(iris.data,iris.target)
...:
Out[42]:
Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None,
random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
训练得到的是一个模型,可直接用来预测,预测时,数据会从step1开始进行转换,避免了模型用来预测的数据还要额外写代码实现。还可通过pipe.score(X,Y)得到这个模型在X训练集上的正确率。
In [46]: pipe.predict(iris.data)
Out[46]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [47]: iris.target
Out[47]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
make_pipeline函数是Pipeline类的简单实现,只需传入每个step的类实例即可,不需自己命名,自动将类的小写设为该step的名。
In [50]: make_pipeline(StandardScaler(),GaussianNB())
Out[50]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
True, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])
In [51]: p=make_pipeline(StandardScaler(),GaussianNB())
In [52]: p.steps
Out[52]:
[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
('gaussiannb', GaussianNB(priors=None))]
同时可以通过set_params重新设置每个类里边需传入的参数,设置方法为step的name__parma名=参数值
In [59]: p.set_params(standardscaler__with_mean=False)
Out[59]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=
False, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])
2.并行化,通过FeatureUnion实现
FeatureUnion,同样通过(key,value)对来设置,通过set_params设置参数。不同的是,每一个step分开计算,FeatureUnion最后将它们计算得到的结果合并到一块,返回的是一个数组,不具备最后一个estimator的方法。有些数据需要标准化,或者取对数,或onehot编码最后形成多个特征项,再选择重要特征,这时候FeatureUnion非常管用。
In [60]: from sklearn.pipeline import FeatureUnion
In [61]: from sklearn.preprocessing import StandardScaler In [63]: from sklearn.preprocessing import FunctionTransformer In [64]: from numpy import log1p In [65]: step1=('Standar',StandardScaler()) In [66]: step2=('ToLog',FunctionTransformer(log1p)) In [67]: steps=FeatureUnion(transformer_list=[step1,step2]) In [68]: steps.fit_transform(iris.data) Out[68]: array([[-0.90068117, 1.03205722, -1.3412724 , ..., 1.5040774 , 0.87546874, 0.18232156], [-1.14301691, -0.1249576 , -1.3412724 , ..., 1.38629436, 0.87546874, 0.18232156], [-1.38535265, 0.33784833, -1.39813811, ..., 1.43508453, 0.83290912, 0.18232156], ..., [ 0.79566902, -0.1249576 , 0.81962435, ..., 1.38629436, 1.82454929, 1.09861229], [ 0.4321654 , 0.80065426, 0.93335575, ..., 1.48160454, 1.85629799, 1.19392247], [ 0.06866179, -0.1249576 , 0.76275864, ..., 1.38629436, 1.80828877, 1.02961942]]) In [69]: data=steps.fit_transform(iris.data) In [70]: data.shape #最后有8个特征,标准化后4个,log后4个,共个 Out[70]: (150, 8) In [71]: iris.data.shape Out[71]: (150, 4)
参考: