[Spark2.0]ML 调优：模型选择和超参数调优

最新推荐文章于 2024-08-04 00:22:50 发布

yhao浩

最新推荐文章于 2024-08-04 00:22:50 发布

阅读量7.1k

点赞数 1

分类专栏： spark 文章标签： Spark 官网文档调优模型选择

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本节讲述如何使用MLlib提供的工具来对ML算法和Pipline进行调优。内建的交叉验证和其他工具允许用户在算法和Pipline中优化超参数。

模型选择（又名超参数调优）

在ML中一个重要的任务就是模型选择，或者使用给定的数据为给定的任务寻找最适合的模型或参数。这也叫做调优。调优可以是对单个的Estimator，比如LogisticRegression，或者是包含多个算法、向量化和其他步骤的整个Pipline。用户可以一次性对整个Pipline进行调优，而不必对Pipline中的每一个元素进行单独的调优。

MLlib支持使用像CrossValidator和TrainValidationSplit这样的工具进行模型选择。这些工具需要以下的组件：

Estimator：用户调优的算法或Pipline
ParamMap集合：提供参数选择，有时也叫作用户查找的“参数网格”
Evaluator：衡量模型在测试数据上的拟合程度

在上层，这些模型选择工具的工作方式如下：

将输入数据切分成训练数据集和测试数据集
对于每一个（训练数据，测试数据）对，通过ParamMap集合进行迭代：
- 对于每个ParamMap，使用它提供的参数对Estimator进行拟合，给出拟合模型，然后使用Evaluator来评估模型的性能
选择表现最好的参数集合生成的模型

针对回归问题，Evaluator可以是一个RegressionEvaluator；针对二进制数据，可以是BinaryClassificationEvaluator，或者是对于对分类问题的MulticlassClassificationEvaluator。用于选择最佳ParamMap的默认度量方式可以通过评估器的setMetricName方法进行覆盖。

为了方便构造参数网格，用户可以使用通用的ParamGridBuilder。

交叉验证

CrossValidator 从将数据集切分成K折数据集合，并被分别用于训练和测试，例如，K=3折时，CrossValidator会生成3个（训练数据，测试数据）对，每一个数据对的训练数据占2/3，测试数据占1/3。为了评估一个ParamMap，CrossValidator 会计算这三个不同的（训练，测试）数据集对在Estimator拟合出的模型上的平均评估指标。

在找出最好的ParamMap后，CrossValidator 会使用这个ParamMap和整个的数据集来重新拟合Estimator。

示例：使用交叉验证进行模型选择

下面示例示范了使用CrossValidator从整个网格的参数中选择合适的参数。

注意在整个参数网格中进行交叉验证是比较耗时的。例如，在下面的例子中，参数网格有3个hashingTF.numFeatures值和2个lr.regParam值，CrossValidator使用2折切分数据。最终将有(3 * 2) * 2 = 12个不同的模型将被训练。在真实场景中，很可能使用更多的参数和进行更多折切分（k=3和k=10都很常见）。换句话说，使用CrossValidator的代价可能会异常的高。然而，对比启发式的手动调优，这是选择参数的行之有效的方法。

$计算机生成了可选文字:|0apache. spark. ml . Pi pel i ne i mport i mport i mport org. org. org. org. org. org. org. apache. spark. ml . cl assifi cation. Logi sti cRegression apache. spark. ml . eval uati on. Bi assifi cati onEvaI uator apache. spark. ml . feature. {HashingTF, Tokenizer} apache. spark. ml . i g. Vector apache. spark. ml . tuni ng. {CrossVaI i dator, ParamGri dBui I der} apache. spark. sql . Row // Prepare training data from a list of Cid, text , 7abe7) tup les. val training = spark. (OL, "a b cd e spark", 1.0), (IL, "b d", 0.0), "spark f g h", 1.0), , "hadoop mapreduce", 0.0), (41_, "b spark who", 1.0), "g d a y", 0.0), "spark fly", 1.0), 'was mapreduce", 0.0), 'e spark program", 1.0), "a ec I", 0.0), COL, ' 'spark compile", 1.0), (11 L , "hadoop software", 0.0) "text", "label // Configure an ML pipeline, which consists of three stages: val tokeni zer = new Tokenizer() . setlnputCoI ("text") . setoutputCoI ("words") val hashingTF = new HashingTF() . setlnputCoI (tokeni zer . getOutputCoI) . setoutputCoI ("features") val r = new Logisti cRegressionO . setMaxIter (10) val pipeline = new Pipeline() . hashingTF, token i zer , hash ing F, and Jr.$

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala"找到完整的代码。

训练-验证切分

作为CrossValidator 的附加，Spark就同样为超参数调优提供了TrainValidationSplit。相对于CrossValidator的K次评估，TrainValidationSplit只对每个参数组合评估一次。因此它的评估代价没有这么高，但是当训练数据集不够大的时候其结果相对不够可信。

不同于CrossValidator，TrainValidationSplit创建单一的（训练，测试）数据集对。它使用trainRatio参数将数据集切分成两部分。例如，当设置trainRatio=0.75时，TrainValidationSplit将会将数据切分75%作为数据集，25%作为验证集，来生成训练、测试集对。

与CrossValidator相似，TrainValidationSplit最终使用最好的ParamMap和完整的数据集来拟合Estimator。

示例：通过训练/验证切分选择模型

$计算机生成了可选文字:|0i mport i mport i mport org. org. org. apache. spark. ml . eva luati on . Regress i onEva luator apache. spark. ml . regressi on . Li nearRegress i on apache. spark. ml . tuning. {paramGri dBui Ider, Trainvalidationsplit} // Prepare tra in ing and test data. val data spark . read . format("l i bsvm") . 1 1 i b/sampl e _ 1 i near _ regressi on_data. txt") val Array(training, test) = data. randomsp1it(Array(O.9, 0.1), seed 12345) val Ir — new LinearRegression() // We use a paramGridBui7der to construct a grid of parameters to search over. // Trainvalidationsplit will try all combinations of values and determine best model using // the evaluator. val paramGrid = new paramGridBui1der() .addGrid(1r. regParam, Array(O.1, 0.01)) . addGri r . fi tlntercept) .addGrid(1r.e1asticNetparam, Array(O.O, 0.5, . build() 1.0)) // In this case the estimator is simply the linear regression. // A Trainvalidationsplit requires an Estimator, a set of Estimator ParanVaps, val trainvalidationsplit — new Trainvalidationsplit() . setEstimator(1 r) . setEva1uator(new RegressionEva1uator) . setEstimatorparamMaps (paramGri d) and an Evaluator. // of the data will be used for training and the remaining for validation. . setTrai nRati 0(0.8) // Run train validation split, and choose the best set of parameters. val model — trainvalidationsplit.fit(training) // Make predictions on test data. model is the model with combination of parameters // that performed best. model . transform(test) . sel " , . show() "label ", "prediction")$