【机器学习】（二十四）算法链与管道：网格搜索中应用Pipeline类；通用管道接口；利用网格搜索选择模型

最新推荐文章于 2023-04-14 10:11:09 发布

王亿亿

最新推荐文章于 2023-04-14 10:11:09 发布

阅读量830

点赞数

分类专栏： ML&DL 文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_43931465/article/details/108070342

版权

ML&DL 专栏收录该内容

36 篇文章 11 订阅

订阅专栏

机器学习算法，首先要对数据进行缩放，然后手动合并特征，再利用无监督机器学习来学习特征。
算法链：将许多不同的处理步骤和机器学习模型链接在一起。

Pipeline类可以将多个处理步骤合并为单个scikit-learn估计器，类本身具有fit、predict、score方法。
Pipeline类最常见的用例是将预处理步骤（比如数据缩放）与一个监督模型（比如分类器）链接在一起。
管道对象由一个步骤列表组成。每个步骤都是一个元组，包含一个自选定的字符串代表名称+一个估计器的实例。利用管道，减少了“预处理+分类”过程，并且可以在cross_val_score或GridSearchCV中使用这个估计器。

在网格搜索中使用管道

from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# 加载并划分数据
cancer = load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

# 构建管道：一个由步骤组成的列表
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
pipe.fit(x_train, y_train)
print(pipe.score(x_test, y_test)) # 0.972027972027972

# 参数网格:（步骤名称）__（参数名称）
param_grid = {'svm__C':[0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma':[0.001, 0.01, 0.1, 1, 10, 100]}

# 网格搜索
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(x_train, y_train)
print(grid.best_score_) # 0.9812311901504789
print(grid.score(x_test, y_test)) # 0.972027972027972
print(grid.best_params_) # {'svm__C': 1, 'svm__gamma': 1}

步骤的名称任意拟定，但需要为参数网格中每个参数指定它在管道中所属的步骤。
参数网格语法：为每个参数指定步骤名称，后面加上双下划綫，然后是参数名称。

对于交叉验证的每次划分来说，仅使用训练部分对MinMaxScaler进行拟合，测试部分没有泄露到参数搜索中。
在交叉验证中，信息泄露的影响大小取决于预处理步骤的性质。使用测试部分来估计数据的范围，通常不会有太大影响。但在特征提取和特征选择中使用测试部分，则会导致结果的显著差异。

通用管道接口

Pipeline类可以将任意数量的估计器链接在一起。除了最后一部之外所有步骤都需要具有transform方法。管道内部对每个步骤依次调用fit和transform，前一个步骤中transform方法的输出作为下一个步骤的输入。

用make_pipeline方便地创建管道

函数make_pipeline可以创建管道并根据每个步骤所属的类为其自动命名。如果多个步骤属于同一类，则会附加一个数字。

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 创建管道
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print(pipe.steps)

# 访问步骤属性
pipe.fit(cancer.data)
components = pipe.named_steps["pca"].components_ # 从“pca”步骤中提取前两个主成分
print(components.shape) # (2, 30)

Pipeline类对象的steps属性是由元组组成的步骤列表；named_steps属性是一个字典，将步骤名称映射为估计器。
在这里插入图片描述

访问网格搜索管道中的属性

使用管道的主要原因之一就是进行网格搜索。一个常见的任务是在网格搜索内访问管道的某些步骤。

from sklearn.linear_model import LogisticRegression

# 构建管道
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# 参数网格
param_grid = {"logisticregression__C":[0.01, 0.1, 1, 10, 100]}

# 在数据集上对网格搜索进行拟合
x_train, x_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=4)
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(x_train, y_train)

# 访问网格搜索管道中的属性
print(grid.best_estimator_.named_steps["logisticregression"]) # 步骤属性
print(grid.best_estimator_.named_steps["logisticregression"].coef_) # 权重

grid.best_estimator_属性保存GridSearchCV中找到的最佳模型，在这里是一个管道，包含两个步骤。
在这里插入图片描述

网格搜索预处理步骤与模型参数

使用监督任务（比如回归或者分类）的输出来调节预处理参数。

在岭回归之前使用boston数据集的多项式特征

from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

# 加载数据集
boston = load_boston()
x_train, x_test, y_train, y_test = train_test_split(
    boston.data, boston.target, random_state=0)

# 构建管道：缩放数据、计算多项式、岭回归
pipe = make_pipeline(StandardScaler(), PolynomialFeatures(), Ridge())
# print(pipe.steps)

# 网格搜索
param_grid = {'polynomialfeatures__degree':[1, 2, 3], 
              'ridge__alpha':[0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
grid.fit(x_train, y_train)

print(grid.score(x_test, y_test)) # 0.768304546410014
print(grid.best_params_) # {{'polynomialfeatures__degree': 2, 'ridge__alpha': 10}

网格搜索选择使用哪个模型

例如比较RandomForestClassifier和SVC模型，前者不需要预处理，后者需要对数据进行缩放。
类似于在非网格的空间中搜索。先显式对步骤命名，然后定于需要搜索的网格调整不同的参数。

在iris数据集上比较RandomForestClassifier和SVC模型

from sklearn.ensemble import RandomForestClassifier

# 先显式实例化网格
pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])

# 参数网格，跳过的步骤用None
param_grid = [
    {'classifier':[SVC()], 'preprocessing':[StandardScaler(),None],
     'classifier__gamma':[0.001, 0.01, 0.1, 1, 10, 100],
     'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100]},
    {'classifier':[RandomForestClassifier(n_estimators=100)], 
     'preprocessing':[None],
     'classifier__max_features':[1, 2, 3]}]

# 在数据集上对网格搜索进行拟合
x_train, x_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=4)
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(x_train, y_train)

print(grid.best_score_) # 0.9718194254445965
print(grid.score(x_test, y_test)) # 0.972027972027972
print(grid.best_params_)

在这里插入图片描述

ValueError: Invalid parameter ridge_alpha for estimator Pipeline

双下划线只打了一个的报错。

王亿亿

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
【机器学习】（二十四）算法链与管道：网格搜索中应用Pipeline类；通用管道接口；利用网格搜索选择模型

机器学习算法，首先要对数据进行缩放，然后手动合并特征，再利用无监督机器学习来学习特征。算法链：将许多不同的处理步骤和机器学习模型链接在一起。Pipeline类可以将多个处理步骤合并为单个scikit-learn估计器，类本身具有fit、predict、score方法。Pipeline类最常见的用例是将预处理步骤（比如数据缩放）与一个监督模型（比如分类器）链接在一起。管道对象由一个步骤列表组成。每个步骤都是一个元组，包含一个自选定的字符串代表名称+一个估计器的实例。利用管道，减少了“预处理+分类”过程
复制链接

扫一扫