Pipeline可以将多个估计器串起来,例如将特征提取、正则化和分类串起来形成一个典型的机器学习工作流是非常有用的。管道的两个目的:
方便性:只需要调用fit和predict一次,就能适合所有估计器
联合参数选择:在管道中,结合网格搜索对估计器参数进行选择
在管道中的所有估计器,除了最后一个外,都必须是transformers(转换器),最后一个估计器可以是转换器或分类器
Pipeline由键值对元组列表组成的,键是一个字符串,定义指定步骤的名称,可以随意取,值是一个估计器对象
①利用Pipeline实例化管道对象
In [1]: from sklearn.pipeline import Pipeline
...: from sklearn.svm import SVC
...: from sklearn.decomposition import PCA
...: estimators = [('reduce_dim',PCA()),('clf',SVC())]
...: pipe = Pipeline(estimators)
...: pipe
...:
Out[1]:
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_component
s=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
②利用make_pipeline构造一个Pipeline对象
sklearn.pipeline.make_pipeline(*steps):构造时,不需要,也不允许定义估计器名称,自动有估计器类型的小写字母命名
In [2]: from sklearn.naive_bayes import GaussianNB
...: from sklearn.preprocessing import StandardScaler
...: from sklearn.pipeline import make_pipeline
...: make_pipeline(StandardScaler(),GaussianNB(priors=None))
...:
Out[2]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=T
rue, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])
获取管道中估计器方法:
①管道中各个估计器是以元组列表的方式存储在steps属性中,可以列表索引的方式访问具体估计器
In [4]: pipe.steps
Out[4]:
[('reduce_dim',
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)),
('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]
In [5]: pipe.steps[0]
Out[5]:
('reduce_dim',
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False))
②管道中所有估计器是以字典的方式存储在named_steps属性,可以以字典索引方式访问具体估计器
In [2]: pipe.named_steps
Out[2]:
{'clf': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_s
tate=None,
svd_solver='auto', tol=0.0, whiten=False)}
In [3]: pipe.named_steps['reduce_dim']
Out[3]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
③可以以<estimator>__<parameter>方式设置估计器参数
In [6]: pipe.set_params(clf__C=10)
Out[6]:
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_component
s=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
In [7]: pipe.get_params('clf__C')
Out[7]:
{'clf': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
'clf__C': 10,
'clf__cache_size': 200,
'clf__class_weight': None,
'clf__coef0': 0.0,
'clf__decision_function_shape': None,
'clf__degree': 3,
'clf__gamma': 'auto',
'clf__kernel': 'rbf',
'clf__max_iter': -1,
'clf__probability': False,
'clf__random_state': None,
'clf__shrinking': True,
'clf__tol': 0.001,
'clf__verbose': False,
'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_s
tate=None,
svd_solver='auto', tol=0.0, whiten=False),
'reduce_dim__copy': True,
'reduce_dim__iterated_power': 'auto',
'reduce_dim__n_components': None,
'reduce_dim__random_state': None,
'reduce_dim__svd_solver': 'auto',
'reduce_dim__tol': 0.0,
'reduce_dim__whiten': False,
'steps': [('reduce_dim',
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)),
('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]}
In [9]: pipe.get_params('clf__C')['clf__C']
Out[9]: 10
④结合网格搜索GridSearchCV进行参数调优
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim = [None,PCA(5),PCA(10)],
clf = [SVC(),LogisticRegression()],
clf__C=[0.1,10,100])
grid_search = GridSearchCV(pipe,param_grid = param_grid)
FeatureUnion将多个转换器结合成一个新的转换器,由一个转换器对象列表组成,在训练期间,各个转换器独立训练数据,对于数据转换,各个转换器都是并行应用,最终就是将各个转换器输出的样本矩阵合并成一个大的矩阵。FeatureUnion和pipeline具有相同的功能,两者结合建立复杂模型。
FeatureUnion由键值对元组列表组成,键是给转换步骤随意取名的字符串,值时一个估计器对象
①利用FeatureUnion实例化FeatureUnion对象
In [12]: from sklearn.pipeline import FeatureUnion
...: from sklearn.decomposition import PCA
...: from sklearn.decomposition import KernelPCA
...: estimators = [('linear_pca',PCA()),('kernel_pca',PCA())]
...: combined = FeatureUnion(estimators)
...: combined
...:
Out[12]:
FeatureUnion(n_jobs=1,
transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_
components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', PCA(copy=True, iter
ated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False))],
transformer_weights=None)
②利用make_union实例化FeatureUnion对象
In [14]: from sklearn.pipeline import make_union
...: from sklearn.decomposition import PCA
...: from sklearn.decomposition import KernelPCA
...: make_union(PCA(),KernelPCA())
...:
Out[14]:
FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_compone
nts=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0,
coef0=1, copy_X=True, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='linear',
kernel_params=None, max_iter=None, n_components=None, n_jobs=1,
random_state=None, remove_zero_eig=False, tol=0))],
transformer_weights=None)
和Pipeline类似,也可以利用set_params方法去掉某步骤,通过制定参数为None
In [15]: combined.set_params(kernel_pca=None)
Out[15]:
FeatureUnion(n_jobs=1,
transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_
components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
transformer_weights=None)