许多数据转换的步骤需要以正确的顺序来执行,sklearn提供了Pipeline来支持这样的转换。
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
Pipeline构造函数会通过一系列名称/估算器的配对来定义步骤的序列。除了最后一个是估算器之外,前面都必须是转换器(也就是说必须有fit_transform()方法)。
当调用流水线的fit()方法时,会在所有转换器上按照顺序依次调用fit_transform()、将一个调用的输出作为参数传递给下一个调用方法,直到传递给最终的估算器,则只会调用fit()方法)
流水线的方法与最终估算器的方法相同。
FeatureUnion类
只需要提供一个转换器列表(可以是整个转换器流水线),当transform()方法被调用时,它会并行运行每个转换器的transform()方法,等待它们的输出,然后将它们连结起来,返回结果(同样地,调用fit()方法也会调用每个转换器的fit()方法)。
from sklearn.base import TransformerMixin #gives fit_transform method for free
class MyLabelBinarizer(TransformerMixin):
def __init__(self, *args, **kwargs):
self.encoder = LabelBinarizer(*args, **kwargs)
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator,TransformerMixin):
def __init__(self,attribute_names):
self.attribute_names = attribute_names
def fit(self,X,y=None):
return self
def transform(self,X):
return X[self.attribute_names].values
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', MyLabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)