Machine Learning Library (MLlib) Guide: ML Pipelines

最新推荐文章于 2021-11-03 19:26:42 发布

prinf("Hello World")

最新推荐文章于 2021-11-03 19:26:42 发布

阅读量226

点赞数

本文链接：https://blog.csdn.net/qq_39484341/article/details/105164699

版权

Machine Learning Library (MLlib) Guide: ML Pipelines

本文内容主要来自于 Spark MLlib 的文档 ML Pipelines - Spark 2.4.5 Documentation，对其中的一些描述直接copy过来。并且细化和修改其中的样例。

Main Concepts in Pipelines

Transformer

Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame.. Transformer achieve its transformation by implementing a method transformer(), which converts one DataFrame into another DataFrame, generally by appending one or more columns.

Estimator

Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.

Technically, a Estimator implements fit() method, which accept a DataFrame and produces a Transformer

Parameter

MLlib Estimators and Transformers use a uniform API for specifying parameters.

A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.

There are two main ways to pass parameters to an algorithm:

Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call lr.setMaxIter(10) to make lr.fit() use at most 10 iterations.
Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.

Pipeline

A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

Code Examples

Example1: Transformer、Esitmator、ParamMap

准备数据

导入需要的包，并且准备数据

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param.ParamMap

val spark = SparkSession.builder().master("local").appName("EstimatorTransformerPipeline").getOrCreate()
import spark.implicits._

val df = Seq(
  (1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5)),
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
).toDF("label", "features")

val Array(training, testing) = df.randomSplit(Array(0.7, 0.3), seed=1)
training.show() // 查看数据

结果

+-----+--------------+
|label|      features|
+-----+--------------+
|  0.0|[2.0,1.0,-1.0]|
|  0.0| [2.0,1.3,1.0]|
|  0.0|[3.0,2.0,-0.1]|
|  1.0|[-1.0,1.5,1.3]|
|  1.0|[0.0,2.2,-1.5]|
+-----+--------------+

定义 LogisticRegression （也就是之前所说的Estimator）

// lr 是一个 estimator
val lr = new LogisticRegression()
println(lr.explainParams()) // 输出参数说明

输出结果如下（省略了一些行）

aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
....
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undefined)

上面的只是参数说明，相当于文档一样的东西，要查看 lr 这个对象的各个参数的值，可以用

println(lr.extractParamMap)

输出结果为

{
	logreg_738fb0b848b2-aggregationDepth: 2,
	logreg_738fb0b848b2-elasticNetParam: 0.0,
	logreg_738fb0b848b2-family: auto,
	logreg_738fb0b848b2-featuresCol: features,
	logreg_738fb0b848b2-fitIntercept: true,
	logreg_738fb0b848b2-labelCol: label,
	logreg_738fb0b848b2-maxIter: 100,
	logreg_738fb0b848b2-predictionCol: prediction,
	logreg_738fb0b848b2-probabilityCol: probability,
	logreg_738fb0b848b2-rawPredictionCol: rawPrediction,
	logreg_738fb0b848b2-regParam: 0.0,
	logreg_738fb0b848b2-standardization: true,
	logreg_738fb0b848b2-threshold: 0.5,
	logreg_738fb0b848b2-tol: 1.0E-6
}

.extractParaMap 返回的是一个 ParamMap 对象。

设置参数

可以直接对 lr 对象设置参数，使用 .setXX 的方式来设置参数，可以先通过 .explainParams() 来查看各个参数和其作用，之后直接设置。

lr.setMaxIter(10).setRegParam(0.01)
lr.extractParamMap() // 再查看参数，发现maxIter 和 regParam 的值已经发生了变化

训练模型

fit 不传参数 paramMap

上面的 lr 是一个 Estimator，可以使用 fit 来返回一个 Model 对象（Model对象是 Transformer）

val model1 = lr.fit(training)

Model 对象有一个 .parent 方法，可以返回创建这个 Model 的 Estimator，在这里，这个 Estimator 就是 lr

model1.parent == lr // true

fit + paramMap

使用 fit 方法时，还可以传入一个 ParamMap 对象，下面先看看 ParamMap 对象的一些基本使用。

val paramMap = ParamMap(lr.maxIter->10)
paramMap.put(lr.maxIter->30).put(lr.threshold->0.55, lr.regParam->0.02)

paramMap的使用有下面几个点需要注意：

ParamMap 对象有点类似于 Map 对象。是用 key-value 来表示的。需要注意的是，key 是 lr.maxIter，不是一个字符串。
paraMap 可以使用 put 方法来放入键值对，之后放入的会覆盖之前的。还可以同时放入多个键值对
put 可以链式调用。
可以使用 ++ 来合并两个 ParamMap，如下

// Combine ParaMaps
val paramMap2 = ParamMap(lr.probabilityCol->"myProbability")
val paramMapCombined = paramMap ++ paramMap2

下面将 paramMapCombined 作为参数传递

val model2 = lr.fit(training, paramMapCombined)

同样，model2 也会有一个 parent 方法来返回创建它的 Estimator()，但是这里的 parent不等于 lr

model2.parent == lr // false

原因是当 fit 在接受到 paramMap 参数时，会复制一份原来的 Estimator，再在复制的 Estimator 上建立 Model

模型预测

val predicted = model2.transfrom(testing)
predicted.show()

结果为

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|       myProbability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0| [0.0,1.1,0.1]|[0.10699661942890...|[0.52672366472839...|       0.0|
|  1.0|[0.0,1.2,-0.5]|[-0.0479529380998...|[0.48801406217663...|       0.0|
+-----+--------------+--------------------+--------------------+----------+

Example2：Pipeline

下面的例子使用 Pipeline 将

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3GsjvRv7-1585387309086)(evernotecid://7E3AE0DC-DC71-4DDC-9CC8-0C832D6C11C2/appyinxiangcom/22483756/ENNote/p3001?hash=d73c4e350e6c0bce76e1c101325aed81)]

1. 导入需要的包 + 数据准备

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .master("local")
  .appName("PipelineExample")
  .getOrCreate()
import spark.implicits._

val trainning = Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0),
).toDF("id", "text", "label")

val testing = Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "spark hadoop spark"),
  (7L, "apache hadoop")
).toDF("id", "text")

trainning.show()

结果

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+

2. 创建 Tokenizer、HashingTF 和 LogisticRegression

// 用于分词
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")

// 用于创建词向量 
val hashingtf = new HashingTF().setInputCol("words")..setOutputCol("features")

val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)

3. 创建 Pipeline并训练模型

用 .setStages 方法将 tokenizer, hashingtf, lr 添加到流水线

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingtf, lr))

使用 fit 训练模型，返回一个 PipelineModel 对象

val model = pipeline.fit(trainning)

4. 模型持久化

// 保存训练好的模型
model.write.overwrite().save("/tmp/lr-model")
// 还可以保存pipeline
pipeline.write.overwrite().save("/tmp/lr-pipeline")

5. 模型载入

val newModel = PipelineModel.load("/tmp/lr-model")
newModel.transform(testing).show()

结果

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|       spark i j k|    [spark, i, j, k]|(262144,[19036,68...|[0.16293291377586...|[0.54064335448522...|       0.0|
|  5|             l m n|           [l, m, n]|(262144,[1303,526...|[2.64074492868040...|[0.93343826273835...|       0.0|
|  6|spark hadoop spark|[spark, hadoop, s...|(262144,[173558,1...|[-1.2126835161593...|[0.22922657813243...|       1.0|
|  7|     apache hadoop|    [apache, hadoop]|(262144,[68303,19...|[3.74294051364968...|[0.97686361395183...|       0.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+

References

ML Pipelines - Spark 2.4.5 Documentation

prinf("Hello World")

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning Library (MLlib) Guide: ML Pipelines

文章目录Machine Learning Library (MLlib) Guide: ML PipelinesMain Concepts in PipelinesTransformerEstimatorParameterPipelineCode ExamplesExample1: Transformer、Esitmator、ParamMap准备数据定义 LogisticRegression （也...
复制链接

扫一扫