Pipeline
Pipeline 将若干个估计器按顺序连在一起,比如
特征提取 -> 降维 -> 拟合 -> 预测
在整个 Pipeline 中,它的属性永远和最后一个估计器属性一样
-
如果最后一个估计器是预测器,那么 Pipeline 是预测器
-
如果最后一个估计器是转换器,那么 Pipeline 是转换器
pip作为转换器测试:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
a=np.array([[1,2,3,4,5,6,np.NAN,5],[3,4,5,6,np.NAN,3,np.NAN,9]])
X=np.transpose(a)#转换
print(X)
#impleImputer 起名叫 impute,MinMaxScaler起名叫 normalize。
pipp=Pipeline([("impute",SimpleImputer(missing_values=np.nan,strategy="mean")),("normalize",MinMaxScaler())])
#因为这是转换器,所以pipp也是转换器
X_pro=pipp.fit_transform(X)
print(X_pro)
#单独尝试一下
aa=SimpleImputer(missing_values=np.nan,strategy="mean").fit_transform(X)
mms=MinMaxScaler().fit_transform(aa)
print(mms)#结果和上面的是一样的
测试结果:
F:\开发工具\pythonProject\tools\venv\Scripts\python.exe F:/开发工具/pythonProject/tools/python的sklear学习/sklearn07.py
[[ 1. 3.]
[ 2. 4.]
[ 3. 5.]
[ 4. 6.]
[ 5. nan]
[ 6. 3.]
[nan nan]
[ 5. 9.]]
[[0. 0. ]
[0.2 0.16666667]
[0.4 0.33333333]
[0.6 0.5 ]
[0.8 0.33333333]
[1. 0. ]
[0.54285714 0.33333333]
[0.8 1. ]]
Process finished with exit code 0
FeatureUnion
如果我们想在一个节点同时运行几个估计器,我们可用 FeatureUnion
策略:
-
对分类型变量:获取 -> 中位数填充 -> 独热编码
-
对数值型变量:获取 -> 均值填充 -> 标准化
主要就是 transform 函数中,将输入的 DataFrame X 根据属性名称来获取其值。
接下来建立一个流水线 full_pipe,它并联着两个流水线
categorical_pipe 处理分类型变量
DataFrameSelector 用来获取
SimpleImputer 用出现最多的值来填充 None
OneHotEncoder 来编码返回非稀疏矩阵
numeric_pipe 处理数值型变量
DataFrameSelector 用来获取
SimpleImputer 用均值来填充 NaN
normalize 来规范化数值
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator,TransformerMixin
class DataFrameSelector(BaseEstimator,TransformerMixin):
def __init__(self,attribute_names):
self.attribute_names=attribute_names
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
return X[self.attribute_names].values
#创建一个字典
fe={"height":[1.67,1.89,np.NAN,1.66,1.88,np.NAN],
"weight":[56,78,92,np.NAN,78,92],
"age":[26,34,18,34,25,27],
"love":["apple","origine","piss","loss","good",None]
}
X=pd.DataFrame(fe)
categorical_feature=["love"]
numeric_feature=["height","age","weight"]
categorical_pipe=Pipeline([
("select",DataFrameSelector(categorical_feature)),
("impute",SimpleImputer(missing_values=None,strategy="most_frequent")),
("one_hot_encode",OneHotEncoder(sparse=False))
])
numeric_pipe=Pipeline([
("select",DataFrameSelector(numeric_feature)),
("impute",SimpleImputer(missing_values=np.nan,strategy="mean")),
("normalize",MinMaxScaler())
])
full_pipe=FeatureUnion(transformer_list=[
("numeric_pipe",numeric_pipe),
("categorical_pipe",categorical_pipe)
])
x_pro=full_pipe.fit_transform(X)
print(x_pro)
测试结果:
F:\开发工具\pythonProject\tools\venv\Scripts\python.exe F:/开发工具/pythonProject/tools/python的sklear学习/sklearn08.py
[[0.04347826 0.5 0. 1. 0. 0.
0. 0. ]
[1. 1. 0.61111111 0. 0. 0.
1. 0. ]
[0.5 0. 1. 0. 0. 0.
0. 1. ]
[0. 1. 0.64444444 0. 0. 1.
0. 0. ]
[0.95652174 0.4375 0.61111111 0. 1. 0.
0. 0. ]
[0.5 0.5625 1. 1. 0. 0.
0. 0. ]]
Process finished with exit code 0