在机器学习中,如何根据给定的数据集,为算法模型拟合参数,使得模型达到最优的效果,这一过程称为“调参”(tuning)。
Spark的Mllib提供了CrossValidator和TrainValidationSplit两种方法,来帮助实现模型的调优。
一般使用上述的两种方法需要进行如下设置,
setEstimator方法指定需要调参的算法algorithm或是工作流Pipeline(Pipeline也是一种Estimator);
setEstimatorParamMaps方法指定“参数网格”(使用new ParamGridBuilder().addGrid(xxx,xxx).build()),作为备选的参数组合;
setEvaluator指定评价方法,用来衡量训练好的模型在验证集上的表现。
交叉验证CrossValidator
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row
// Prepare training data from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 0.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 0.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 0.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 0.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
<