注
学习笔记。若涉及侵权,请告知删除。
Intro
构建在DataFrame之上。Mllib提供标准的机器学习算法API,能够方便的将不同的算法组合成一个独立的管道 Pipeline or workflow.
- DataFrame: from Spark SQL as an ML dataset, which can hold a variety of data types, e.g. different columns storing text, feature vectors, true labels and predictions.
- Estimator: is an algorithm which can be fit on a DataFrame to produce a Transformer. A learning algorithm is an Estimator which trains on a DataFrame and produces a model.
- Transformer: is an algorithm which can transform one DataFrame into another DataFrame, e.g. an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with prediction.
- Pipeline: a pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
- Parameter: All Transformer and Estimators share a common API for specifying parameters.
Pipeline
A Pipeline is specified as a sequences of stages, and each stage is either a Transformer or an Estimator.
The input DataFrame is transformed as it passes through each stage.
For Transformer stages, the transform() method is called on the DataFrame.
For Estimator stages, the fit() method is called to produce a Transformer (which becomes a part of the PipelineModel, or fitted Pipeline)
Fit
the figure is for the training time usage of a Pipeline.
The top row, three stages, the first two are Transformers and the third one is an Estimator.
The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames.
If the Pipeline has more stages after LogRegModel, it would call the model’s transform() method on the DataFrame before passing the DF to the next stage.
Transform
all Estimators in the original Pipeline have become Transformers.
When the PipelineModel’s transform() is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates teh dataset and passed it to the next stage.
Pipeline and PipelineModels help to ensure that training and test data go through identical feature processing steps.
DAG
A Pipeline’s stages are specified as an ordered array.
It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph.
The graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.
Runtime checking: since Pipelines can operate on DataFrames with varied types, they cannot use compile-time type checking.
Pipelines and PipelineModels instead do runtime checking before actually running the Pipeline.
This type checking is done using the DataFrame schema, a description of the data types of columns in the DataFrame.
Unique Pipeline stages: a pipeline’s stages should be unique instances. identified with ID.
Parameters
MLlib Estimators and Transformers use a uniform API for specifying parameters.
A Param is a named parameter with self-contained documentation.
A ParamMap is a set of (paramter, value) pairs.
There are two main ways to pass parameters to an algorithm:
- logreg.setMaxIter(10), use Set
- Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.
ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20), if there are 2 logreg instances lr1 and lr2.
Saving and Loading Pipelines
model.write().overwrite().save("/abc/model")
val model = PipelineModel.load("/abc/model")
code
package spark
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.{SparkConf, SparkContext}
object TestPipeline {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("test")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
// train_test(sqlContext)
// pipelineFitSave(sqlContext)
pipelineLoadTransform(sqlContext)
}
def train_test(sqlContext: SQLContext): Unit = {
// training dataset with label/features
val training = sqlContext.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, -1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
// logreg instance
val lr = new LogisticRegression() // instance
lr.setMaxIter(10).setRegParam(0.01) // set params
// fit
val model1 = lr.fit(training)
println("model1's params: " + model1.parent.extractParamMap)
// set params using ParamMap
val paramMap = ParamMap(lr.maxIter -> 20).put(lr.maxIter, 30).put(lr.regParam -> 0.1, lr.threshold -> 0.55)
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")
val paramMapCombined = paramMap ++ paramMap2 // 结合
// now, use combined paramMap to train a model
val model2 = lr.fit(training, paramMapCombined)
println("model2's params: "+ model2.parent.extractParamMap)
// ===============================================================
// ========================test=============================
// ===============================================================
val test = sqlContext.createDataFrame(Seq(
(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
(0.0, Vectors.dense(3.0, 2.0, -0.1)),
(1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")
model2.transform(test).select("features", "label", "myProbability", "prediction")
.collect()
.foreach {
case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>println(s"($features, $label) -> prob=$prob, prediction=$prediction")
}
}
在下面的代码里在pipeline里设置多个stages。最后一次save。
def pipelineFitSave(sqlContext: SQLContext): Unit = {
// data
val training = sqlContext.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
// 3 stages: tokenizer, hashingTF, lr。 此处直接set 参数了
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val logisticRegression = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, logisticRegression)) // set stages in pipeline
// fit
val model = pipeline.fit(training) // 可以ParamMap 设置参数
// save。此处使用 pipeline save的好处:只有一次 磁盘写入
model.write.overwrite().save("/atest/spark/lrmodel") // after fit
pipeline.write.overwrite().save("/atest/spark/unfit-lrmodel") // unfit pipeline
println("over")
}
下面代码里,一次load出来。
def pipelineLoadTransform(sqlContext: SQLContext): Unit = {
val pipeline: Pipeline = Pipeline.load("/atest/spark/unfit-lrmodel")
val model: PipelineModel = PipelineModel.load("/atest/spark/lrmodel")
val test = sqlContext.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "spark hadoop spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
model.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach {
case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")}
}
}