Spark-Pipeline

学习笔记。若涉及侵权,请告知删除。

Intro

构建在DataFrame之上。Mllib提供标准的机器学习算法API,能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.

  • DataFrame: from Spark SQL as an ML dataset, which can hold a variety of data types, e.g. different columns storing text, feature vectors, true labels and predictions.
  • Estimator: is an algorithm which can be fit on a DataFrame to produce a Transformer. A learning algorithm is an Estimator which trains on a DataFrame and produces a model.
  • Transformer: is an algorithm which can transform one DataFrame into another DataFrame, e.g. an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with prediction.
  • Pipeline: a pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
  • Parameter: All Transformer and Estimators share a common API for specifying parameters.

Pipeline

A Pipeline is specified as a sequences of stages, and each stage is either a Transformer or an Estimator.
The input DataFrame is transformed as it passes through each stage.

For Transformer stages, the transform() method is called on the DataFrame.

For Estimator stages, the fit() method is called to produce a Transformer (which becomes a part of the PipelineModel, or fitted Pipeline)

Fit

fit

the figure is for the training time usage of a Pipeline.

The top row, three stages, the first two are Transformers and the third one is an Estimator.

The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames.

If the Pipeline has more stages after LogRegModel, it would call the model’s transform() method on the DataFrame before passing the DF to the next stage.

Transform

<

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值