java数组转lost_WARN TaskSetManager: Lost Task xxx: java.lang.ArrayIndexOutOfBoundsException: 1 - Scala...

最新推荐文章于 2023-04-07 17:48:08 发布

好好说gg 戈壁风

最新推荐文章于 2023-04-07 17:48:08 发布

阅读量211

点赞数

文章标签： java数组转lost

本文链接：https://blog.csdn.net/weixin_36373860/article/details/114742988

版权

I'm trying to do hyper-parameter tuning in scala, using GridCV. However I create my pipeline and everything, I fit my dataset to the pipeline, it fits properly.

Then I add some paramGrid and I go for cross-validation after 4 stages it gives me the error:

scala> val cvModel = cv.fit(df)

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

19/02/13 09:16:33 WARN TaskSetManager: Lost task 2.0 in stage 152.0 (TID 916, ip-10.xx.xx.xxx.ec2.internal, executor 7): java.lang.ArrayIndexOutOfBoundsException: 1

at org.apache.spark.ml.linalg.DenseVector.apply(Vectors.scala:448)

at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:82)

at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:81)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter..

And then two or three paragraphs of error actually. I'm not able to figure it out why is that happening, since I'm coding in scala for the first time. But as per my concept and the code given in the examples it doesn't seem to workout.

Here's my code:

import java.util.Calendar

import org.apache.log4j.{Level, Logger}

import org.apache.spark.ml.feature._

import org.apache.spark.sql._

import org.apache.spark.sql.functions.lit

import java.io.PrintWriter

import java.io.File

import org.apache.spark.ml.feature.StringIndexer

import org.apache.spark.ml.tuning._

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel

import org.apache.spark.ml.{Pipeline, PipelineModel}

val spark = SparkSession.builder().getOrCreate()

val dataset = spark.sql("select * from userdb.xgb_train_data")

val df = dataset.na.fill(0)

val header = df.columns.filter(_ != "id").filter(_ != "y_val")

val assembler = new VectorAssembler().setInputCols(header).setOutputCol("features")

val booster= new XGBoostClassifier().setLabelCol("y_val")

val pipeline = new Pipeline().setStages(Array(assembler,booster))

val model = pipeline.fit(df)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("y_val")

val paramGrid = new ParamGridBuilder().

addGrid(booster.maxDepth, Array(3, 8)).

addGrid(booster.eta, Array(0.2, 0.6)).

build()

val cv = new CrossValidator().

setEstimator(pipeline).

setEvaluator(evaluator).

setEstimatorParamMaps(paramGrid).

setNumFolds(10)

val cvModel = cv.fit(df)

val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages()

.asInstanceOf[XGBoostClassificationModel]

bestModel.extractParamMap()

Or is there any other way to do the hyper-parameter tuning and test on cross-validation? I'm facing the issue when the setEvaluator code is being executed. What I understand is somehow my features shape and y prediction shape are not matching. But how do I make sure they do?

P.S. I'm running this on an EMR cluster. Also I tried out the same thing just by changing the algorithm to Logistic regression and it just works fine. And I'm using xgboost v0.8 and spark is v2.2

好好说gg 戈壁风

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java数组转lost_WARN TaskSetManager: Lost Task xxx: java.lang.ArrayIndexOutOfBoundsException: 1 - Scala...

I'm trying to do hyper-parameter tuning in scala, using GridCV. However I create my pipeline and everything, I fit my dataset to the pipeline, it fits properly.Then I add some paramGrid and I go for c...
复制链接

扫一扫