在运行spark代码时,执行过半报出了如题的异常:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 460.0 failed 4 times, most recent failure: Lost task 0.3 in stage 460.0 (TID 5213, ip-172-31-4-190.ap-south-1.compute.internal, executor 195): java.lang.Exception: Could not compute split, block input-0-1598598841800 of RDD 3149 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
只能得知job没有正常执行,但是不知道具体的问题出现在哪里,没法直接看出错误就要去日志查看更详细的错误日志
每个application执行的日志都在如下路径中:hadoop安装目录下
/opt/apps/hadoop-3.1.1/logs/userlogs/
该目录中会列出所有已经执行过的application
drwx--x---. 3 root root 52 Oct 14 16:56 application_1602637680764_0035
drwx--x---. 3 root root 52 Oct 14 16:56 application_1602637680764_0036
drwx--x---. 3 root root 52 Oct 14 17:09 application_1602637680764_0037
drwx--x---. 3 root root 52 Oct 14 17:09 application_1602637680764_0038
drwx--x---. 2 root root 6 Oct 14 17:10 application_1602637680764_0039
进入执行失败的application中,查看执行任务的容器,进入容器中,查看其标准错误输出——stderr
-rw-r--r--. 1 root root 3085 Oct 14 17:10 directory.info
-rw-r-----. 1 root root 6129 Oct 14 17:10 launch_container.sh
-rw-r--r--. 1 root root 542 Oct 14 17:10 prelaunch.err
-rw-r--r--. 1 root root 100 Oct 14 17:10 prelaunch.out
-rw-r--r--. 1 root root 1033 Oct 14 17:10 stderr
-rw-r--r--. 1 root root 4521 Oct 14 17:10 stdout
-rw-r--r--. 1 root root 63914 Oct 14 17:10 syslog
-rw-r--r--. 1 root root 2642 Oct 14 17:10 syslog.shuffle
每台机器中都有该日志文件,查看hadoop01的stderr发现,错误原因是因为无法连接到hadoop02的executor,证明该executor挂掉了
Cannot receive any reply from hadoop02:37513 in 10000 milliseconds.
此时查看该executor所在机器的错误日志,进入同样的路径中查看同一个application的错误日志,找到了如下错误信息
java.lang.OutOfMemoryError: java heap space
内存溢出的解决方法(要根据具体情况具体分析):
1.executo宕机可能是因为给executor指定的内存过大,但实际上机器并没有那么大的内存,可以调小executor的执行内存或者调大该机器的内存;要注意,实际运行时executor要多占用384M的内存
2.executor宕机很可能是出现了数据倾斜的问题,这要根据具体业务和代码进行调整
3.根据具体的机器内存和cpu核数情况调整yarn-site.xml配置文件
<!-- 表示该节点上YARN可使用的物理内存总量,默认是8192MB -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
<!-- 表示该节点服务器上yarn可以使用的虚拟CPU个数,默认是8 -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
<!-- 单个任务可申请的最多物理内存量,默认是8192MB -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
若机器没有达到这个配置,最好将这些参数更改为自身的配置