本人维护的Spark主要运行在三个Hadoop集群上,此外还有其他一些小集群或者隐私集群。这些机器加起来有三万台左右。目前运维的Spark主要有Spark2.3和Spark1.6两个版本。用户在使用的过程中难免会发生各种各样的问题,为了对经验进行沉淀,也为了给Spark用户提供一些借鉴,这里将对各类问题如何处理进行介绍。由于故障种类繁多,所以我会开多个博客逐一进行介绍。
本文介绍一些Spark1.6和Spark2.3都存在的一些故障,后文统一命名为通用故障。
一、通用故障
1、集群环境类
1-1、提交的spark任务,长时间处于ACCEPTED状态。
这种问题非常常见,此时需要从客户端日志中找到tracking URL,例如:
18/12/14 17:42:29 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.test
start time: 1544780544655
final status: UNDEFINED
tracking URL: http://test.test.net:8888/proxy/application_1543893582405_838478/
user: test
18/12/14 17:42:32 INFO Client: Application report for application_1543893582405_838478 (state: ACCEPTED)
其中的tracking URL为http://test.test.net:8888/proxy/application_1543893582405_838478/,从浏览器打开页面将看到类似信息:
User: test
Queue: root.test
clientHost:
Name: XXXXX
Application Type: SPARK
Application Tags:
Application Priority: NORMAL (Higher Integer value indicates higher priority)
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
FinalStatus Reported by AM: Application has not completed yet.
Started: Fri Dec 14 15:50:20 +0800 2018
Elapsed: 2hrs, 3mins, 55sec
Tracking URL: ApplicationMaster
Diagnostics:
可以看到状态也是ACCEPTED。并且队列是root.test。
打开http://test.test.net:8888/cluster/scheduler?openQueues=root.test,找到root.test队列的资源,将看到如下信息:
Used Resources: <memory:799232, vCores:224, gCores:0>
Reserved Resources: <memory:0, vCores:0, gCores:0>
Num Active Applications: 2
Num Pending Applications: 12
Min Resources: <memory:40000, vCores:20, gCores:0>
Max Resources: <memory:800000, vCores:400, gCores:0>
Accessible Node Labels: CentOS6,CentOS7,DEFAULT_LABEL
Steady Fair Share: <memory:800000, vCores:0, gCores:0>
Instantaneous Fair Share: <memory:800000, vCores:0, gCores:0>
主要关注Max Resources和Used Resources,说明用户队列的资源已经消耗完了。
1-2、没有安装Java8
用户提交的作业在未运行task之前,AM已经退出,导致作业失败。
Container exited with a non-zero exit code 127
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.default
start time: 1546850986115
final status: FAILED
tracking URL: http://test.test.net:8888/cluster/app/application_1493705730010_45634
user: test
Moved to trash: /home/spark/cache/.sparkStaging/application_1493705730010_45634
19/01/07 16:48:07 INFO Client: Deleted staging directory hdfs://test.test.net:9000/home/spark/cache/.sparkStaging/application_1493705730010_45634
19/01/07 16:48:07 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSche