Spark-Pipeline

最新推荐文章于 2022-07-29 13:56:55 发布

wendaocp

最新推荐文章于 2022-07-29 13:56:55 发布

阅读量363

点赞数

分类专栏： AI / BigData / Cloud 文章标签：大数据

本文链接：https://blog.csdn.net/wendaocp/article/details/106589518

版权

AI / BigData / Cloud 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Spark-Pipeline

注
Intro
Pipeline

注

学习笔记。若涉及侵权，请告知删除。

Intro

构建在DataFrame之上。Mllib提供标准的机器学习算法API，能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.

DataFrame: from Spark SQL as an ML dataset, which can hold a variety of data types, e.g. different columns storing text, feature vectors, true labels and predictions.
Estimator: is an algorithm which can be fit on a DataFrame to produce a Transformer. A learning algorithm is an Estimator which trains on a DataFrame and produces a model.
Transformer: is an algorithm which can transform one DataFrame into another DataFrame, e.g. an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with prediction.
Pipeline: a pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Parameter: All Transformer and Estimators share a common API for specifying parameters.

Pipeline

A Pipeline is specified as a sequences of stages, and each stage is either a Transformer or an Estimator.
The input DataFrame is transformed as it passes through each stage.

For Transformer stages, the transform() method is called on the DataFrame.

For Estimator stages, the fit() method is called to produce a Transformer (which becomes a part of the PipelineModel, or fitted Pipeline)

Fit

fit

the figure is for the training time usage of a Pipeline.

The top row, three stages, the first two are Transformers and the third one is an Estimator.

The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames.

If the Pipeline has more stages after LogRegModel, it would call the model’s transform() method on the DataFrame before passing the DF to the next stage.

Transform

transform

all Estimators in the original Pipeline have become Transformers.

When the PipelineModel’s transform() is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates teh dataset and passed it to the next stage.

Pipeline and PipelineModels help to ensure that training and test data go through identical feature processing steps.

DAG

A Pipeline’s stages are specified as an ordered array.

It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph.

The graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.

Runtime checking: since Pipelines can operate on DataFrames with varied types, they cannot use compile-time type checking.
Pipelines and PipelineModels instead do runtime checking before actually running the Pipeline.
This type checking is done using the DataFrame schema, a description of the data types of columns in the DataFrame.

Unique Pipeline stages: a pipeline’s stages should be unique instances. identified with ID.

Parameters

MLlib Estimators and Transformers use a uniform API for specifying parameters.
A Param is a named parameter with self-contained documentation.
A ParamMap is a set of (paramter, value) pairs.

There are two main ways to pass parameters to an algorithm:

logreg.setMaxIter(10), use Set
Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.
ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20), if there are 2 logreg instances lr1 and lr2.

Saving and Loading Pipelines

model.write().overwrite().save("/abc/model")
val model = PipelineModel.load("/abc/model")

code

package spark

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.{SparkConf, SparkContext}


object TestPipeline {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("test")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val sqlContext = new SQLContext(sc)

//    train_test(sqlContext)

//    pipelineFitSave(sqlContext)


    pipelineLoadTransform(sqlContext)


  }

  def train_test(sqlContext: SQLContext): Unit = {
    // training dataset with label/features
    val training = sqlContext.createDataFrame(Seq(
      (1.0, Vectors.dense(0.0, 1.1, 0.1)),
      (0.0, Vectors.dense(2.0, 1.0, -1.0)),
      (0.0, Vectors.dense(2.0, 1.3, -1.0)),
      (1.0, Vectors.dense(0.0, 1.2, -0.5))
    )).toDF("label", "features")

    // logreg instance
    val lr = new LogisticRegression() // instance
    lr.setMaxIter(10).setRegParam(0.01) // set params

    // fit
    val model1 = lr.fit(training)

    println("model1's params: " + model1.parent.extractParamMap)

    // set params using ParamMap
    val paramMap = ParamMap(lr.maxIter -> 20).put(lr.maxIter, 30).put(lr.regParam -> 0.1, lr.threshold -> 0.55)
    val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")
    val paramMapCombined = paramMap ++ paramMap2 // 结合

    // now, use combined paramMap to train a model
    val model2 = lr.fit(training, paramMapCombined)
    println("model2's params: "+ model2.parent.extractParamMap)


    // ===============================================================
    // ========================test=============================
    // ===============================================================

    val test = sqlContext.createDataFrame(Seq(
      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
      (0.0, Vectors.dense(3.0, 2.0, -0.1)),
      (1.0, Vectors.dense(0.0, 2.2, -1.5))
    )).toDF("label", "features")

    model2.transform(test).select("features", "label", "myProbability", "prediction")
      .collect()
      .foreach {
        case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>println(s"($features, $label) -> prob=$prob, prediction=$prediction")
      }
  }

在下面的代码里在pipeline里设置多个stages。最后一次save。

  def pipelineFitSave(sqlContext: SQLContext): Unit = {

    // data
    val training = sqlContext.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // 3 stages: tokenizer, hashingTF, lr。 此处直接set 参数了
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
    val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
    val logisticRegression = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
    
    val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, logisticRegression)) // set stages in pipeline

    // fit
    val model = pipeline.fit(training) // 可以ParamMap 设置参数

    // save。此处使用 pipeline save的好处：只有一次 磁盘写入
    model.write.overwrite().save("/atest/spark/lrmodel") // after fit
    pipeline.write.overwrite().save("/atest/spark/unfit-lrmodel") // unfit pipeline

    println("over")

  }

下面代码里，一次load出来。

  def pipelineLoadTransform(sqlContext: SQLContext): Unit = {
    val pipeline: Pipeline   = Pipeline.load("/atest/spark/unfit-lrmodel")
    val model: PipelineModel = PipelineModel.load("/atest/spark/lrmodel")
    
    val test = sqlContext.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "spark hadoop spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach {
        case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")}


  }

}

wendaocp

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark-Pipeline

Spark-Pipeline注IntroPipelineFitTransformDAGParametersSaving and Loading Pipelinescode注学习笔记。若涉及侵权，请告知删除。Intro构建在DataFrame之上。Mllib提供标准的机器学习算法API，能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.DataFrame: from Spark SQL as an ML dataset, which can hold a var
复制链接

扫一扫