Spark2.0 Pipelines,Java版

最新推荐文章于 2021-02-19 18:05:52 发布

原创最新推荐文章于 2021-02-19 18:05:52 发布 · 956 阅读

1 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

22 篇文章

订阅专栏

本文介绍了Apache Spark MLlib中的管道API概念，包括Transformer、Estimator及Pipeline等核心组件，并阐述了这些组件如何共同作用于DataFrame之上，实现从数据预处理到模型训练的完整机器学习流程。

概述

MLlib中众多机器学习算法API在单一管道或工作流中更容易相互结合起来使用。管道的思想主要是受到scikit-learn库的启发。
ML API使用Spark SQL中的DataFrame作为机器学习的数据集。DataFrame不同的列可以分别存储文本，特征向量，真实的Lables，和预测值。

Transformer:一个Transformer是一个算法，可以将一个DataFrame转换为另一个DataFrame。如将一个带特征值的DataFrame转换为带预测值的DataFrame。
Estimator：Estimator在一个DataFrame上完成Transformer转换过程。如一个学习算法就是一个Estimator，该Estimator应用在测试DataFrame上，完成模型的训练过程。
Pipelie：将多个Transformers和 Estimators 串在一起，以完成某个特定的机器学习工作流程。
参数：全部Transformers和 Estimators 共享通用的API，以完成各自特定参数的设置。

MLlib standardizes APIs for machine learning algorithms to make it
easier to combine multiple algorithms into a single pipeline, or
workflow. This section covers the key concepts introduced by the
Pipelines API, where the pipeline concept is mostly inspired by the
scikit-learn project.

DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

A DataFrame can be created either implicitly or explicitly from a
regular RDD. See the code examples below and the Spark SQL programming guide for examples.

Columns in a DataFrame are named. The code examples below use names such as “text,” “features,” and “label.”

Transformer: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame. E.g., an ML model is a Transformer
which transforms a DataFrame with features into a DataFrame with
predictions.

Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm is an
Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow. Parameter: All Transformers and
Estimators now share a common API for specifying parameters.