Spark-Pipeline

学习笔记。若涉及侵权,请告知删除。

Intro

构建在DataFrame之上。Mllib提供标准的机器学习算法API,能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.

  • DataFrame: from Spark SQL as an ML dataset, which can hold a variety of data types, e.g. different columns storing text, feature vectors, true labels and predictions.
  • Estimator: is an algorithm which can be fit on a DataFrame to produce a Transformer. A learning algorithm is an Estimator which trains on a DataFrame and produces a model.
  • Transformer: is an algorithm which can transform one DataFrame into another DataFrame, e.g. an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with prediction.
  • Pipeline: a pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
  • Parameter: All Transformer and Estimators share a common API for specifying parameters.

Pipeline

A Pipeline is specified as a sequences of stages, and each stage is either a Transformer or an Estimator.
The input DataFrame is transformed as it passes through each stage.

For Transformer stages, the transform() method is called on the DataFrame.

For Estimator stages, the fit() method is called to produce a Transformer (which becomes a part of the PipelineModel, or fitted Pipeline)

Fit

fit

the figure is for the training time usage of a Pipeline.

The top row, three stages, the first two are Transformers and the third one is an Estimator.

The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames.

If the Pipeline has more stages after LogRegModel, it would call the model’s transform() method on the DataFrame before passing the DF to the next stage.

Transform

transform

all Estimators in the original Pipeline have become Transformers.

When the PipelineModel’s transform() is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates teh dataset and passed it to the next stage.

Pipeline and PipelineModels help to ensure that training and test data go through identical feature processing steps.

DAG

A Pipeline’s stages are specified as an ordered array.

It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph.

The graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.

Runtime checking: since Pipelines can operate on DataFrames with varied types, they cannot use compile-time type checking.
Pipelines and PipelineModels instead do runtime checking before actually running the Pipeline.
This type checking is done using the DataFrame schema, a description of the data types of columns in the DataFrame.

Unique Pipeline stages: a pipeline’s stages should be unique instances. identified with ID.

Parameters

MLlib Estimators and Transformers use a uniform API for specifying parameters.
A Param is a named parameter with self-contained documentation.
A ParamMap is a set of (paramter, value) pairs.

There are two main ways to pass parameters to an algorithm:

  1. logreg.setMaxIter(10), use Set
  2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.
    ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20), if there are 2 logreg instances lr1 and lr2.

Saving and Loading Pipelines

model.write().overwrite().save("/abc/model")
val model = PipelineModel.load("/abc/model")

code

package spark

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.{SparkConf, SparkContext}


object TestPipeline {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("test")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val sqlContext = new SQLContext(sc)

//    train_test(sqlContext)

//    pipelineFitSave(sqlContext)


    pipelineLoadTransform(sqlContext)


  }

  def train_test(sqlContext: SQLContext): Unit = {
    // training dataset with label/features
    val training = sqlContext.createDataFrame(Seq(
      (1.0, Vectors.dense(0.0, 1.1, 0.1)),
      (0.0, Vectors.dense(2.0, 1.0, -1.0)),
      (0.0, Vectors.dense(2.0, 1.3, -1.0)),
      (1.0, Vectors.dense(0.0, 1.2, -0.5))
    )).toDF("label", "features")

    // logreg instance
    val lr = new LogisticRegression() // instance
    lr.setMaxIter(10).setRegParam(0.01) // set params

    // fit
    val model1 = lr.fit(training)

    println("model1's params: " + model1.parent.extractParamMap)

    // set params using ParamMap
    val paramMap = ParamMap(lr.maxIter -> 20).put(lr.maxIter, 30).put(lr.regParam -> 0.1, lr.threshold -> 0.55)
    val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")
    val paramMapCombined = paramMap ++ paramMap2 // 结合

    // now, use combined paramMap to train a model
    val model2 = lr.fit(training, paramMapCombined)
    println("model2's params: "+ model2.parent.extractParamMap)


    // ===============================================================
    // ========================test=============================
    // ===============================================================

    val test = sqlContext.createDataFrame(Seq(
      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
      (0.0, Vectors.dense(3.0, 2.0, -0.1)),
      (1.0, Vectors.dense(0.0, 2.2, -1.5))
    )).toDF("label", "features")

    model2.transform(test).select("features", "label", "myProbability", "prediction")
      .collect()
      .foreach {
        case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>println(s"($features, $label) -> prob=$prob, prediction=$prediction")
      }
  }

在下面的代码里在pipeline里设置多个stages。最后一次save。

  def pipelineFitSave(sqlContext: SQLContext): Unit = {

    // data
    val training = sqlContext.createDataFrame(Seq(
      (0L, "a b c d e spark", 1.0),
      (1L, "b d", 0.0),
      (2L, "spark f g h", 1.0),
      (3L, "hadoop mapreduce", 0.0)
    )).toDF("id", "text", "label")

    // 3 stages: tokenizer, hashingTF, lr。 此处直接set 参数了
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
    val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
    val logisticRegression = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
    
    val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, logisticRegression)) // set stages in pipeline

    // fit
    val model = pipeline.fit(training) // 可以ParamMap 设置参数

    // save。此处使用 pipeline save的好处:只有一次 磁盘写入
    model.write.overwrite().save("/atest/spark/lrmodel") // after fit
    pipeline.write.overwrite().save("/atest/spark/unfit-lrmodel") // unfit pipeline

    println("over")

  }

下面代码里,一次load出来。

  def pipelineLoadTransform(sqlContext: SQLContext): Unit = {
    val pipeline: Pipeline   = Pipeline.load("/atest/spark/unfit-lrmodel")
    val model: PipelineModel = PipelineModel.load("/atest/spark/lrmodel")
    
    val test = sqlContext.createDataFrame(Seq(
      (4L, "spark i j k"),
      (5L, "l m n"),
      (6L, "spark hadoop spark"),
      (7L, "apache hadoop")
    )).toDF("id", "text")

    model.transform(test)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach {
        case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")}


  }

}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值