java spark使用flatMap或flatMapToPair报错空指针

本文解析了在Java版Spark中使用flatMap时遇到的空指针异常,重点在于理解为何返回null会导致错误,并提供了正确的代码修复,确保了空值情况下返回非空迭代器。通过实例展示了如何避免空指针异常并成功计数操作。
摘要由CSDN通过智能技术生成

java版spark使用faltmap时报空指针错误,错误如下:

20/08/28 09:41:44 INFO DAGScheduler: ResultStage 0 (count at TestJob.java:252) failed in 3.500 s due to Job aborted due to stage failure: Task 299 in stage 0.0 failed 1 times, most recent failure: Lost task 299.0 in stage 0.0 (TID 299, localhost, executor driver): java.lang.NullPointerException
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1817)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
 
Driver stacktrace:
20/08/28 09:41:44 INFO DAGScheduler: Job 0 failed: count at TestJob.java:252, took 3.555955 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 299 in stage 0.0 failed 1 times, most recent failure: Lost task 299.0 in stage 0.0 (TID 299, localhost, executor driver): java.lang.NullPointerException
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1817)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
 
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:929)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2113)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
	at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:455)
	at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45)
	at com.TestJob.main(TestJob.java:252)
Caused by: java.lang.NullPointerException
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1817)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2113)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
20/08/28 09:41:44 INFO SparkContext: Invoking stop() from shutdown hook
20/08/28 09:41:44 INFO SparkUI: Stopped Spark web UI at http://DESKTOP-LSJJA44.mshome.net:4040

问题代码如下:

List<Tuple2<Integer, String>> list = new ArrayList<>();
list.add(new Tuple2<Integer, String>(1, "a1"));
list.add(new Tuple2<Integer, String>(2, "a2"));
JavaPairRDD<Integer, String> rdd1 = javaSparkContext.parallelize(list).flatMapToPair(x -> {
    List<Tuple2<Integer, String>> resultList = new ArrayList<>();
    if(x._1().intValue() == 1){
        resultList.add(new Tuple2<Integer, String>(x._1, x._2));
    }
    if(null != resultList && !resultList.isEmpty()){
        return resultList.iterator();
    }
    return null;
});
log.info("rdd1 count: {}", rdd1.count());

问题原因:
在这里插入图片描述

因为flatmap系列的函数默认需要返回一个集合(即使集合为空也要返回一个非空集合的迭代器对象),如果此处有返回null则会报空指针错误

改正后代码:

List<Tuple2<Integer, String>> list = new ArrayList<>();
list.add(new Tuple2<Integer, String>(1, "a1"));
list.add(new Tuple2<Integer, String>(2, "a2"));
JavaPairRDD<Integer, String> rdd1 = javaSparkContext.parallelize(list).flatMapToPair(x -> {
    List<Tuple2<Integer, String>> resultList = new ArrayList<>();
    if(x._1().intValue() == 1){
        resultList.add(new Tuple2<Integer, String>(x._1, x._2));
    }
    return resultList.iterator();
});
log.info("rdd1 count: {}", rdd1.count());

执行结果:

(x._1, x._2));
}
return resultList.iterator();
});
log.info(“rdd1 count: {}”, rdd1.count());


**执行结果:**
![在这里插入图片描述](https://img-blog.csdnimg.cn/8e5060c9bd4446699ed9d3b7eac6d507.png#pic_center)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值