java数组转lost_WARN TaskSetManager: Lost Task xxx: java.lang.ArrayIndexOutOfBoundsException: 1 - Scala...

I'm trying to do hyper-parameter tuning in scala, using GridCV. However I create my pipeline and everything, I fit my dataset to the pipeline, it fits properly.

Then I add some paramGrid and I go for cross-validation after 4 stages it gives me the error:

scala> val cvModel = cv.fit(df)

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.xx.xx.xxx, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

19/02/13 09:16:33 WARN TaskSetManager: Lost task 2.0 in stage 152.0 (TID 916, ip-10.xx.xx.xxx.ec2.internal, executor 7): java.lang.ArrayIndexOutOfBoundsException: 1

at org.apache.spark.ml.linalg.DenseVector.apply(Vectors.scala:448)

at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:82)

at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator$$anonfun$1.apply(BinaryClassificationEvaluator.scala:81)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter..

And then two or three paragraphs of error actually. I'm not able to figure it out why is that happening, since I'm coding in scala for the first time. But as per my concept and the code given in the examples it doesn't seem to workout.

Here's my code:

import java.util.Calendar

import org.apache.log4j.{Level, Logger}

import org.apache.spark.ml.feature._

import org.apache.spark.sql._

import org.apache.spark.sql.functions.lit

import java.io.PrintWriter

import java.io.File

import org.apache.spark.ml.feature.StringIndexer

import org.apache.spark.ml.tuning._

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel

import org.apache.spark.ml.{Pipeline, PipelineModel}

val spark = SparkSession.builder().getOrCreate()

val dataset = spark.sql("select * from userdb.xgb_train_data")

val df = dataset.na.fill(0)

val header = df.columns.filter(_ != "id").filter(_ != "y_val")

val assembler = new VectorAssembler().setInputCols(header).setOutputCol("features")

val booster= new XGBoostClassifier().setLabelCol("y_val")

val pipeline = new Pipeline().setStages(Array(assembler,booster))

val model = pipeline.fit(df)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("y_val")

val paramGrid = new ParamGridBuilder().

addGrid(booster.maxDepth, Array(3, 8)).

addGrid(booster.eta, Array(0.2, 0.6)).

build()

val cv = new CrossValidator().

setEstimator(pipeline).

setEvaluator(evaluator).

setEstimatorParamMaps(paramGrid).

setNumFolds(10)

val cvModel = cv.fit(df)

val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages()

.asInstanceOf[XGBoostClassificationModel]

bestModel.extractParamMap()

Or is there any other way to do the hyper-parameter tuning and test on cross-validation? I'm facing the issue when the setEvaluator code is being executed. What I understand is somehow my features shape and y prediction shape are not matching. But how do I make sure they do?

P.S. I'm running this on an EMR cluster. Also I tried out the same thing just by changing the algorithm to Logistic regression and it just works fine. And I'm using xgboost v0.8 and spark is v2.2

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值