spark ml VectorIndexer 报错 java.util.NoSuchElementException: key not found: 0.0

6 篇文章 0 订阅

采用用spark的DecisionTree来训练样本,在使用pipeline中使用了VectorIndexer 转换特征向量。生成模型后,用模型来训练大规模样本(上千万样本,生成模型的训练集只有几千个)的时候报如下错误。

查阅资料,也没有得到正解。后来自己通过实验发现了原因。

VectorIndexer(类似的还有StringIndexer)是一种Estimator,用来对特征值进行映射转换。

例如,做人群画像的时候,你的收入特征值可能是“1000-2000”,“3000-4000”之类的字符串,这写值转化为特征向量Vector首先需要手动去映射为Double,而VectorIndexer则可以自动帮你做这类事情。

问题就出在这里,训练模型的时候,样本量只有几千个,采用VectorIndexer.fit生成VectorIndexerModel(和LRModel,DecisionTreeModel属于同一类的东东),作为你的决策树的输入特征值,pipeline顺序为(StringIndexer,VectorIndexer,DecisionTreeClassifier,labelConverter),前两步为输入转换,最后一步为输出转换。

训练模型的时候,VectorIndexer.fit用的是全部的样本集,假设为A1(训练集合测试集的和),因此,训练和测试都没有问题;

但当你真正用模型来训练几千万样本的时候,假设为A2,A2中会出现A1中没有的特征值,此时,特征值用VectorIndexer来转换的时候就会报错,找不到可以映射的模型。

例子:训练样本A1中的特征包含“1000-2000”,“3000-4000”两个特征值,VectorIndexer会把他们映射为0,1

但A2中包含了“1000-2000”,“3000-4000”,“6000-7000”,VectorIndexer会把前两个映射为0,1,最后一个,模型中没有,因此就无法映射


Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 4 times, most recent failure: Lost task 0.3 in stage 57.0 (TID 5698, 172.19.100.2): java.util.NoSuchElementException: key not found: 0.0

        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:324)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:323)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:323)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:317)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
        at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample$.trainModel(DecisionTreeForSample.scala:316)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample$$anonfun$getModel$2.apply(DecisionTreeForSample.scala:88)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample$$anonfun$getModel$2.apply(DecisionTreeForSample.scala:86)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample$.getModel(DecisionTreeForSample.scala:86)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample$.main(DecisionTreeForSample.scala:35)
        at cn.focusmedia.bigdata.etl.ml.DecisionTreeForSample.main(DecisionTreeForSample.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.NoSuchElementException: key not found: 0.0
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:324)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:323)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:323)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:317)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
        at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:362)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值