环境:ambari+hdp 2.7.3
出现背景:nodename服务器出现异常,发生重启。
出现问题:以前能跑的pyspark脚本,运行的时候Yarn application has already ended! It might have been killed or unable to launch application master的错误。
解决方法:
1.在ambari中重启yarn,问题未得到解决。
2.在ambari中重启hdfs,问题未得到解决。
3.在ambari中重启spark,问题未得到解决。
4.编写测试脚本,spark采用local的模式运行,能够正常运行,确认问题应该出现在yarn上。
5.通过ambari中的run service check的功能对yarn进行check,出现:
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 303, in _call
raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of 'yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -timeout 300000 --queue default' returned 2. 19/01/25 06:03:17 INFO distributedshell.Client: Initializing Client
19/01/25 06:03:17 INFO distributedshell.Client: Running Client
19/01/25 06:03:17 INFO client.RMProxy: Connecting to ResourceManager at lntbdnn1.lnt/10.250.10.67:8050
19/01/25 06:03:17 INFO client.AHSProxy: Connecting to Application History server at lntbddn2.lnt/10.250.10.69:10200
19/01/25 06:03:17 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=4
19/01/25 06:03:17 INFO distributedshell.Client: Got Cluster node info from ASM
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn1:45454, nodeAddresslntbddn1:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbdnn1:45454, nodeAddresslntbdnn1:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn3:45454, nodeAddresslntbddn3:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn2:45454, nodeAddresslntbddn2:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.0, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=llap, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=llap, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: Max mem capability of resources in this cluster 98304
19/01/25 06:03:17 INFO distributedshell.Client: Max virtual cores capabililty of resources in this cluster 25
19/01/25 06:03:17 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
19/01/25 06:03:18 INFO distributedshell.Client: Set the environment for the application master
19/01/25 06:03:18 INFO distributedshell.Client: Setting up app master command
19/01/25 06:03:18 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m
然后一看本本地时间,本地是时间是14:20,查看各个服务器时间,发现发现主服务器的时间少了8个小时,将主服务器时间修改。重新运行脚本正常,问题得到解决。