亲和力 估算_复合估算器和管道的生产力提升器交互式可视化

亲和力 估算

As per my experience, in a real-life machine learning project, the lion’s share of the work is data pre-processing. It is a prerequisite to train the model with appropriate pre-processed data to get accurate predictions later.

根据我的经验,在现实生活中的机器学习项目中,大部分工作是数据预处理。 前提条件是使用适当的预处理数据训练模型,以便以后获得准确的预测。

Often it gets so complex with sequential pre-processing and transformations that it becomes very tedious and timeconsuming to unravel it. It gets more challenging if your colleague has done part of the pre-preprocessing logic, and you need to complete the remaining part.

通常,随着顺序的预处理和转换,它变得如此复杂,以至于解开它变得非常乏味且耗时。 如果您的同事已经完成了部分预处理逻辑,那么这将变得更具挑战性,而您需要完成其余部分。

Scikit-Learn introduced rich interactive visualisation of composite estimators and structure of pipeline in version 0.23 in May 2020.

2020年5月,Scikit-Learn在0.23版中引入了丰富的复合估算器和管道结构的交互式可视化功能。

In this article, I will illustrate the way we can use this major new feature to improve our interpretation of complex sequential pre-processing steps and pinpoint erroneous steps in case of any problem.

在本文中,我将说明如何使用这一主要的新功能来改进对复杂的顺序预处理步骤的解释,并在出现任何问题时查明错误的步骤。

We will be using the famous “Titanic” datasets in Seaborn package in this article. Most of the collected data in real-world have missing values. Here, we will be using the SimpleImputer, KNNImputer to fill the missing values in the sample Titanic dataset. We require ColumnTransformer to perform a different set of transformations to different columns.

本文将使用Seaborn软件包中著名的“ Titanic”数据集。 现实世界中收集的大多数数据都缺少值。 在这里,我们将使用SimpleImputer,KNNImputer来填充样本Titanic数据集中的缺失值。 我们要求ColumnTransformer对不同的列执行一组不同的转换。

import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer ,KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

First, we will download the sample titanic dataset and select a few of the features as independent variable from the full set of features available.As shown in the code below, upper case “X” is generally used to denote independent variables and lower case “y” as the dependent variable.

首先,我们将下载样本泰坦尼克号数据集,并从全部可用功能集中选择一些功能作为自变量。如下面的代码所示,大写字母“ X”通常用于表示自变量,小写字母“ y”作为因变量。

TitanicDataset=sns.load_dataset("titanic")X=TitanicDataset[["sex","age","fare","embarked","who","pclass",
"sibsp"]].copy()
y=TitanicDataset[["survived"]].copy()

As the feature “fare” doesn’t have any blank values for any records, hence we do not need an imputer. In below code, the features are grouped based on the set of pre-processing and transformation to be applied.

由于“票价”功能对于任何记录都没有空白值,因此我们不需要麻烦。 在下面的代码中,基于要应用的一组预处理和转换对功能进行分组。

numeric_independent_variables1= ['fare']
numeric_independent_variables2= [ 'age', 'pclass','sibsp']
categorical_independent_variables=["who","embarked","sex"]

We have defined three different pipelines for three groups of features. The first pipeline only involves scaling while in the second pipeline imputing and scaling is mentioned in sequence. For the categorical features, to fill the missing values simpleimputer with most frequent strategy is used, and then OneHotEncoder is mentioned to convert the categorical values into numeric values.

我们为三组要素定义了三个不同的管道。 第一条流水线仅涉及缩放,而在第二条流水线中依次涉及插补和缩放。 对于分类特征,使用最常用的策略来填充缺失值simpleimputer,然后提到使用OneHotEncoder将分类值转换为数值。

numeric_pipeline1=Pipeline([('scaler', StandardScaler())])numeric_pipeline2=Pipeline([('imputer', KNNImputer(n_neighbors=7)),('scaler', StandardScaler())])categorical_pipeline=Pipeline([('imputer',SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='error'))])

We will use column transformer as we are going to use a different set of pre-processing and transformation to different groups of columns.

我们将使用列转换器,因为我们将使用一组不同的预处理和转换到不同的列组。

consolidated_pipeline= ColumnTransformer([('num1', numeric_pipeline1, numeric_independent_variables1),('num2', numeric_pipeline2, numeric_independent_variables2),('cat', categorical_pipeline, categorical_independent_variables)])

In the earlier step, in a column transformer, three pipelines are mentioned.

在较早的步骤中,在柱式变压器中提到了三个管线。

Finally, we have a nested pipeline with earlier “consolidated_pipeline”, “pcaand “classifier” put into a new pipeline “clf”.

最后,我们有一个嵌套的管道,其中将先前的“ consolidated_pipeline ”,“ pca 和“ classifier”放入了新的管道“ clf”。

clf = Pipeline([('consolidated_pipeline', consolidated_pipeline),('pca',PCA(n_components=5)),('classifier',RandomForestClassifier(max_depth=5))])

It is already getting quite complex to interpret the sequence of transformations, even with relatively simpler pre-processing and transformer sequence compare to real-life projects

解释转换的顺序已经变得非常复杂,即使与现实项目相比,预处理和转换器顺序相对简单

In version 0.19, Scikit-Learn introduced the set_config package. In the version 0.23 released in May 2020 “Display” parameter is added to enable interactive visualisation of composite estimator and pipeline.

在0.19版中,Scikit-Learn引入了set_config软件包。 在2020年5月发布的0.23版本中,添加了“显示”参数,以实现组合估算器和管道的交互式可视化。

from sklearn import set_config
from sklearn.utils import estimator_html_repr
set_config(display='diagram')
diaplay(clf) # for
Jupyter Notebook and Google Collab
Image for post
Pipeline and estimator visualisation — output of the code discussed in the article
管道和估算器可视化-本文讨论的代码输出

If you are running the code in the Jupyter Notebook and Google Collab then above code will bring the structure of pipelines and other composite estimators. Clicking on the individual boxes show the parameters.

如果您正在Jupyter Notebook和Google Collab中运行代码,则以上代码将带来管道和其他组合估计器的结构。 单击各个框将显示参数。

In case you are using IDLE then you can download the full structure in your local machine with the below code. As we have not provided any folder path in the code below, hence HTML file will be saved in the same location as the main python program.

如果您使用的是IDLE,则可以使用以下代码在本地计算机中下载完整结构。 由于我们在下面的代码中未提供任何文件夹路径,因此HTML文件将保存在与主python程序相同的位置。

with open('estimator.html', 'w') as f:
f.write(estimator_html_repr(clf)) # for saving the HTML file

Instead of going through all the pre-processing and transformation coding, with one glance, we learn about the pipeline and transformation sequence with this tree structure visualisation. In case of lengthy complex and nested pipelines and pre-processing steps, it helps us to ascertain the problem area quickly and is a real productivity booster.

无需一目了然地进行所有预处理和转换编码,而是通过这种树形结构可视化来了解管道和转换序列。 在冗长而复杂的嵌套管道和预处理步骤的情况下,它可以帮助我们快速确定问题区域,并真正提高了生产率。

翻译自: https://towardsdatascience.com/productivity-booster-interactive-visualisation-of-composite-estimator-and-pipeline-407ab780671a

亲和力 估算

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值