文章目录
Machine Learning Library (MLlib) Guide: ML Pipelines
本文内容主要来自于 Spark MLlib 的文档 ML Pipelines - Spark 2.4.5 Documentation,对其中的一些描述直接copy过来。并且细化和修改其中的样例。
Main Concepts in Pipelines
Transformer
Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame.. Transformer achieve its transformation by implementing a method transformer()
, which converts one DataFrame
into another DataFrame
, generally by appending one or more columns.
Estimator
- Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
Technically, a Estimator implements fit()
method, which accept a DataFrame
and produces a Transformer
Parameter
MLlib Estimators and Transformers use a uniform API for specifying parameters.
A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.
There are two main ways to pass parameters to an algorithm:
-
Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter(10)
to makelr.fit()
use at most 10 iterations. -
Pass a ParamMap to
fit()
ortransform()
. Any parameters in the ParamMap will override parameters previously specified via setter methods.
Pipeline
A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Code Examples
Example1: Transformer、Esitmator、ParamMap
准备数据
导入需要的包,并且准备数据
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param.ParamMap
val spark = SparkSession.builder().master("local").appName("EstimatorTransformerPipeline").getOrCreate()
import spark.implicits._
val df = Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5)),
(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
(0.0, Vectors.dense(3.0, 2.0, -0.1)),
(1.0, Vectors.dense(0.0, 2.2, -1.5))
).toDF("label", "features")
val Array(training, testing) = df.randomSplit(Array(0.7, 0.3), seed=1)
training.show() // 查看数据
结果
+-----+--------------+
|label| features|
+-----+--------------+
| 0.0|[2.0,1.0,-1.0]|
| 0.0| [2.0,1.3,1.0]|
| 0.0|[3.0,2.0,-0.1]|
| 1.0|[-1.0,1.5,1.3]|
| 1.0|[0.0,2.2,-1.5]|
+-----+--------------+
定义 LogisticRegression (也就是之前所说的Estimator)
// lr 是一个 estimator
val lr = new LogisticRegression()
println(lr.explainParams()) // 输出参数说明
输出结果如下(省略了一些行)
aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
....
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undefined)
上面的只是参数说明,相当于文档一样的东西,要查看 lr
这个对象的各个参数的值,可以用
println(lr.extractParamMap)
输出结果为
{
logreg_738fb0b848b2-aggregationDepth: 2,
logreg_738fb0b848b2-elasticNetParam: 0.0,
logreg_738fb0b848b2-family: auto,
logreg_738fb0b848b2-featuresCol: features,
logreg_738fb0b848b2-fitIntercept: true,
logreg_738fb0b848b2-labelCol: label,
logreg_738fb0b848b2-maxIter: 100,
logreg_738fb0b848b2-predictionCol: prediction,
logreg_738fb0b848b2-probabilityCol: probability,
logreg_738fb0b848b2-rawPredictionCol: rawPrediction,
logreg_738fb0b848b2-regParam: 0.0,
logreg_738fb0b848b2-standardization: true,
logreg_738fb0b848b2-threshold: 0.5,
logreg_738fb0b848b2-tol: 1.0E-6
}
.extractParaMap
返回的是一个 ParamMap
对象。
设置参数
可以直接对 lr 对象设置参数,使用 .setXX 的方式来设置参数,可以先通过 .explainParams() 来查看各个参数和其作用,之后直接设置。
lr.setMaxIter(10).setRegParam(0.01)
lr.extractParamMap() // 再查看参数,发现maxIter 和 regParam 的值已经发生了变化
训练模型
fit 不传参数 paramMap
上面的 lr 是一个 Estimator,可以使用 fit 来返回一个 Model
对象(Model对象是 Transformer)
val model1 = lr.fit(training)
Model
对象有一个 .parent
方法,可以返回创建这个 Model
的 Estimator,在这里,这个 Estimator 就是 lr
model1.parent == lr // true
fit + paramMap
使用 fit 方法时,还可以传入一个 ParamMap 对象,下面先看看 ParamMap 对象的一些基本使用。
val paramMap = ParamMap(lr.maxIter->10)
paramMap.put(lr.maxIter->30).put(lr.threshold->0.55, lr.regParam->0.02)
paramMap的使用有下面几个点需要注意:
- ParamMap 对象有点类似于 Map 对象。是用 key-value 来表示的。需要注意的是,key 是 lr.maxIter,不是一个字符串。
- paraMap 可以使用 put 方法来放入键值对,之后放入的会覆盖之前的。还可以同时放入多个键值对
- put 可以链式调用。
- 可以使用 ++ 来合并两个 ParamMap,如下
// Combine ParaMaps
val paramMap2 = ParamMap(lr.probabilityCol->"myProbability")
val paramMapCombined = paramMap ++ paramMap2
下面将 paramMapCombined 作为参数传递
val model2 = lr.fit(training, paramMapCombined)
同样,model2
也会有一个 parent
方法来返回创建它的 Estimator()
,但是这里的 parent不等于 lr
model2.parent == lr // false
原因是当 fit 在接受到 paramMap 参数时,会复制一份原来的 Estimator,再在复制的 Estimator 上建立 Model
模型预测
val predicted = model2.transfrom(testing)
predicted.show()
结果为
+-----+--------------+--------------------+--------------------+----------+
|label| features| rawPrediction| myProbability|prediction|
+-----+--------------+--------------------+--------------------+----------+
| 1.0| [0.0,1.1,0.1]|[0.10699661942890...|[0.52672366472839...| 0.0|
| 1.0|[0.0,1.2,-0.5]|[-0.0479529380998...|[0.48801406217663...| 0.0|
+-----+--------------+--------------------+--------------------+----------+
Example2:Pipeline
下面的例子使用 Pipeline 将
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3GsjvRv7-1585387309086)(evernotecid://7E3AE0DC-DC71-4DDC-9CC8-0C832D6C11C2/appyinxiangcom/22483756/ENNote/p3001?hash=d73c4e350e6c0bce76e1c101325aed81)]
1. 导入需要的包 + 数据准备
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local")
.appName("PipelineExample")
.getOrCreate()
import spark.implicits._
val trainning = Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0),
).toDF("id", "text", "label")
val testing = Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "spark hadoop spark"),
(7L, "apache hadoop")
).toDF("id", "text")
trainning.show()
结果
+---+----------------+-----+
| id| text|label|
+---+----------------+-----+
| 0| a b c d e spark| 1.0|
| 1| b d| 0.0|
| 2| spark f g h| 1.0|
| 3|hadoop mapreduce| 0.0|
+---+----------------+-----+
2. 创建 Tokenizer、HashingTF 和 LogisticRegression
// 用于分词
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
// 用于创建词向量
val hashingtf = new HashingTF().setInputCol("words")..setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)
3. 创建 Pipeline并训练模型
用 .setStages
方法将 tokenizer, hashingtf, lr 添加到流水线
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingtf, lr))
使用 fit 训练模型,返回一个 PipelineModel
对象
val model = pipeline.fit(trainning)
4. 模型持久化
// 保存训练好的模型
model.write.overwrite().save("/tmp/lr-model")
// 还可以保存pipeline
pipeline.write.overwrite().save("/tmp/lr-pipeline")
5. 模型载入
val newModel = PipelineModel.load("/tmp/lr-model")
newModel.transform(testing).show()
结果
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id| text| words| features| rawPrediction| probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| 4| spark i j k| [spark, i, j, k]|(262144,[19036,68...|[0.16293291377586...|[0.54064335448522...| 0.0|
| 5| l m n| [l, m, n]|(262144,[1303,526...|[2.64074492868040...|[0.93343826273835...| 0.0|
| 6|spark hadoop spark|[spark, hadoop, s...|(262144,[173558,1...|[-1.2126835161593...|[0.22922657813243...| 1.0|
| 7| apache hadoop| [apache, hadoop]|(262144,[68303,19...|[3.74294051364968...|[0.97686361395183...| 0.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+