sklearn Pipeline 函数用法

没有安全感的鸵鸟

已于 2022-03-04 09:42:47 修改

阅读量2.2k

点赞数 5

文章标签： sklearn 机器学习数据挖掘

于 2022-03-04 09:39:06 首次发布

本文链接：https://blog.csdn.net/qq_39758080/article/details/123268684

版权

本文介绍了sklearn库中的Pipeline类，它允许将多个数据处理步骤和模型训练串联起来，形成一个流水线。通过示例展示了如何创建Pipeline，设置步骤中每个estimator的参数，以及结合GridSearchCV进行超参数调优。Pipeline简化了机器学习流程，避免了在不同数据集上重复相同操作的风险，并提高了代码的可读性和复用性。

摘要由CSDN通过智能技术生成

0 导入包

from sklearn.pipeline import Pipeline

1 定义

Pipeline，中文是管道。相当于将一系列的操作封装成一个函数，可以拿这个函数对其他数据进行相同的（流水线）操作。

class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)

Pipeline是用 (key, value) 的列表构造的，相当于[(key, value), (key, value), ……, (key, value)]的形式
key是字符串，表示这个step的名字
value是是一个estimator的object

2 用例

2.1 构造Pipeline

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])

2.2 设置estimator的参数

可以用<estimator>__<parameter>形式设置参数，第一个参数是estimator的名字，第二个参数是estimator的参数，中间是两个下划线

>>> pipe.set_params(clf__C=10)
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])

2.3 完整用例

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88

2.4 结合GridSearchCV

>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
...                   clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)