org.apache.spark.SparkException: Job aborted due to stage failure:Task 1 in stage 0.0 failed 4 times

在运行spark代码时,执行过半报出了如题的异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 460.0 failed 4 times, most recent failure: Lost task 0.3 in stage 460.0 (TID 5213, ip-172-31-4-190.ap-south-1.compute.internal, executor 195): java.lang.Exception: Could not compute split, block input-0-1598598841800 of RDD 3149 not found
    at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:50)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)

只能得知job没有正常执行,但是不知道具体的问题出现在哪里,没法直接看出错误就要去日志查看更详细的错误日志

每个application执行的日志都在如下路径中:hadoop安装目录下

/opt/apps/hadoop-3.1.1/logs/userlogs/

该目录中会列出所有已经执行过的application

drwx--x---. 3 root root  52 Oct 14 16:56 application_1602637680764_0035
drwx--x---. 3 root root  52 Oct 14 16:56 application_1602637680764_0036
drwx--x---. 3 root root  52 Oct 14 17:09 application_1602637680764_0037
drwx--x---. 3 root root  52 Oct 14 17:09 application_1602637680764_0038
drwx--x---. 2 root root   6 Oct 14 17:10 application_1602637680764_0039

进入执行失败的application中,查看执行任务的容器,进入容器中,查看其标准错误输出——stderr

-rw-r--r--. 1 root root  3085 Oct 14 17:10 directory.info
-rw-r-----. 1 root root  6129 Oct 14 17:10 launch_container.sh
-rw-r--r--. 1 root root   542 Oct 14 17:10 prelaunch.err
-rw-r--r--. 1 root root   100 Oct 14 17:10 prelaunch.out
-rw-r--r--. 1 root root  1033 Oct 14 17:10 stderr
-rw-r--r--. 1 root root  4521 Oct 14 17:10 stdout
-rw-r--r--. 1 root root 63914 Oct 14 17:10 syslog
-rw-r--r--. 1 root root  2642 Oct 14 17:10 syslog.shuffle

每台机器中都有该日志文件,查看hadoop01的stderr发现,错误原因是因为无法连接到hadoop02的executor,证明该executor挂掉了

Cannot receive any reply from hadoop02:37513 in 10000 milliseconds. 

此时查看该executor所在机器的错误日志,进入同样的路径中查看同一个application的错误日志,找到了如下错误信息

java.lang.OutOfMemoryError: java heap space

内存溢出的解决方法(要根据具体情况具体分析):

1.executo宕机可能是因为给executor指定的内存过大,但实际上机器并没有那么大的内存,可以调小executor的执行内存或者调大该机器的内存;要注意,实际运行时executor要多占用384M的内存

2.executor宕机很可能是出现了数据倾斜的问题,这要根据具体业务和代码进行调整

3.根据具体的机器内存和cpu核数情况调整yarn-site.xml配置文件

<!-- 表示该节点上YARN可使用的物理内存总量,默认是8192MB -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>

<!-- 表示该节点服务器上yarn可以使用的虚拟CPU个数,默认是8 -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>

<!-- 单个任务可申请的最多物理内存量,默认是8192MB -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>

若机器没有达到这个配置,最好将这些参数更改为自身的配置

 

  • 15
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值