I'm trying to do hyper-parameter tuning in scala, using GridCV. However I create my pipeline and everything, I fit my dataset to the pipeline, it fits properly.
Then I add some paramGrid and I go for cross-validation after 4 stages it gives me the error:
scala> val cvModel = cv.fit(df)
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
19/02/13 09:16:33 WARN TaskSetManager: Lost task 2.0 in stage 152.0 (TID 916, ip-10.xx.xx.xxx.ec2.internal, executor 7): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.ml.linalg.DenseVector.apply(Vectors.scala:448)
at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:82)
at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:81)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter..
And then two or three paragraphs of error actually. I'm not able to figure it out why is that happening, since I'm coding in scala for the first time. But as per my concept and the code given in the examples it doesn't seem to workout.
Here's my code:
import java.util.Calendar
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature._
import org.apache.spark.sql._
import org.apache.spark.sql.functions.lit
import java.io.PrintWriter
import java.io.File
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
import org.apache.spark.ml.{Pipeline, PipelineModel}
val spark = SparkSession.builder().getOrCreate()
val dataset = spark.sql("select * from userdb.xgb_train_data")
val df = dataset.na.fill(0)
val header = df.columns.filter(_ != "id").filter(_ != "y_val")
val assembler = new VectorAssembler().setInputCols(header).setOutputCol("features")
val booster= new XGBoostClassifier().setLabelCol("y_val")
val pipeline = new Pipeline().setStages(Array(assembler,booster))
val model = pipeline.fit(df)
val evaluator = new BinaryClassificationEvaluator().setLabelCol("y_val")
val paramGrid = new ParamGridBuilder().
addGrid(booster.maxDepth, Array(3, 8)).
addGrid(booster.eta, Array(0.2, 0.6)).
build()
val cv = new CrossValidator().
setEstimator(pipeline).
setEvaluator(evaluator).
setEstimatorParamMaps(paramGrid).
setNumFolds(10)
val cvModel = cv.fit(df)
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages()
.asInstanceOf[XGBoostClassificationModel]
bestModel.extractParamMap()
Or is there any other way to do the hyper-parameter tuning and test on cross-validation? I'm facing the issue when the setEvaluator code is being executed. What I understand is somehow my features shape and y prediction shape are not matching. But how do I make sure they do?
P.S. I'm running this on an EMR cluster. Also I tried out the same thing just by changing the algorithm to Logistic regression and it just works fine. And I'm using xgboost v0.8 and spark is v2.2