目录
2.增大mapreduce.map.memory.mb 或者 mapreduce.reduce.memory.mb (建议)
3.适当增大 yarn.nodemanager.vmem-pmem-ratio的大小
错误背景
大概是job运行超过了map和reduce设置的内存大小,导致任务失败 ,就是写了一个hql语句运行在大数据平台上面,发现报错了。
错误信息定位
client端日志
INFO : converting to local hdfs://hacluster/tenant/yxs/product/resources/resources/jar/f3c06465-4af1-4756-894e-ce74ec11b9c3.jar
INFO : Added [/opt/huawei/Bigdata/tmp/hivelocaltmp/session_resources/2d0a2efc-776c-4ccc-957d-927079862ab2_resources/f3c06465-4af1-4756-894e-ce74ec11b9c3.jar] to class path
INFO : Added resources: [hdfs://hacluster/tenant/yxs/product/resources/resources/jar/f3c06465-4af1-4756-894e-ce74ec11b9c3.jar]
INFO : Number of reduce tasks not specified. Estimated from input data size: 2
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:10
INFO : Submitting tokens for job: job_1567609664100_85580
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster
INFO : Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2ImpersonationToken
INFO : The url to track the job: https://yiclouddata03-szzb:26001/proxy/application_1567609664100_85580/
INFO : Starting Job = job_1567609664100_85580, Tracking URL = https://yiclouddata03-szzb:26001/proxy/application_1567609664100_85580/
INFO : Kill Command = /opt/huawei/Bigdata/FusionInsight_HD_V100R002C80SPC203/install/FusionInsight-Hive-1.3.0/hive-1.3.0/bin/..//../hadoop/bin/hadoop job -kill job_1567609664100_85580
INFO : Hadoop job information for Stage-6: number of mappers: 10; number of reducers: 2
INFO : 2019-09-24 16:16:17,686 Stage-6 map = 0%, reduce = 0%
INFO : 2019-09-24 16:16:27,299 Stage-6 map = 20%, reduce = 0%, Cumulative CPU 10.12 sec
INFO : 2019-09-24 16:16:28,474 Stage-6 map = 30%, reduce = 0%, Cumulative CPU 30.4 sec
INFO : 2019-09-24 16:16:29,664 Stage-6 map = 70%, reduce = 0%, Cumulative CPU 83.44 sec
INFO : 2019-09-24 16:16:30,841 Stage-6 map = 90%, reduce = 0%, Cumulative CPU 115.79 sec
INFO : 2019-09-24 16:16:32,004 Stage-6 map = 91%, reduce = 0%, Cumulative CPU 134.73 sec
INFO : 2019-09-24 16:16:44,928 Stage-6 map = 92%, reduce = 0%, Cumulative CPU 223.25 sec
INFO : 2019-09-24 16:16:55,613 Stage-6 map = 93%, reduce = 0%, Cumulative CPU 284.27 sec
INFO : 2019-09-24 16:17:03,797 Stage-6 map = 94%, reduce = 0%, Cumulative CPU 313.69 sec
INFO : 2019-09-24 16:17:11,881 Stage-6 map = 90%, reduce = 0%, Cumulative CPU 115.79 sec
INFO : 2019-09-24 16:18:12,546 Stage-6 map = 90%, reduce = 0%, Cumulative CPU 115.79 sec
INFO : 2019-09-24 16:19:04,473 Stage-6 map = 91%, reduce = 0%, Cumulative CPU 185.47 sec
INFO : 2019-09-24 16:19:13,683 Stage-6 map = 92%, reduce = 0%, Cumulative CPU 223.35 sec
INFO : 2019-09-24 16:19:22,825 Stage-6 map = 93%, reduce = 0%, Cumulative CPU 281.97 sec
INFO : 2019-09-24 16:19:32,053 Stage-6 map = 94%, reduce = 0%, Cumulative CPU 314.97 sec
INFO : 2019-09-24 16:19:54,143 Stage-6 map = 95%, reduce = 0%, Cumulative CPU 377.36 sec
INFO : 2019-09-24 16:19:56,520 Stage-6 map = 90%, reduce = 0%, Cumulative CPU 115.79 sec
INFO : 2019-09-24 16:20:09,338 Stage-6 map = 91%, reduce = 0%, Cumulative CPU 181.59 sec
INFO : 2019-09-24 16:20:18,574 Stage-6 map = 92%, reduce = 0%, Cumulative CPU 217.27 sec
INFO : 2019-09-24 16:20:27,772 Stage-6 map = 93%, reduce = 0%, Cumulative CPU 266.25 sec
INFO : 2019-09-24 16:20:40,439 Stage-6 map = 94%, reduce = 0%, Cumulative CPU 305.32 sec
INFO : 2019-09-24 16:20:57,751 Stage-6 map = 90%, reduce = 0%, Cumulative CPU 115.79 sec
INFO : 2019-09-24 16:21:11,624 Stage-6 map = 91%, reduce = 0%, Cumulative CPU 183.87 sec
INFO : 2019-09-24 16:21:20,948 Stage-6 map = 92%, reduce = 0%, Cumulative CPU 219.12 sec
INFO : 2019-09-24 16:21:31,427 Stage-6 map = 93%, reduce = 0%, Cumulative CPU 282.71 sec
INFO : 2019-09-24 16:21:39,754 Stage-6 map = 94%, reduce = 0%, Cumulative CPU 317.99 sec
INFO : 2019-09-24 16:21:45,519 Stage-6 map = 100%, reduce = 100%, Cumulative CPU 115.79 sec
INFO : MapReduce Total cumulative CPU time: 1 minutes 55 seconds 790 msec
ERROR : Ended Job = job_1567609664100_85580 with errors
任务-T_6260893799950704_20190924161555945_1_1 运行失败,失败原因:java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:283)
at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:379)
at com.dtwave.dipper.dubhe.node.executor.runner.impl.Hive2TaskRunner.doRun(Hive2TaskRunner.java:244)
at com.dtwave.dipper.dubhe.node.executor.runner.BasicTaskRunner.execute(BasicTaskRunner.java:100)
at com.dtwave.dipper.dubhe.node.executor.TaskExecutor.run(TaskExecutor.java:32)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
任务运行失败(Failed)
看完错误是不是一脸懵逼,两眼茫然...怀疑人生,哈哈...
APPlication日志
看这个能看出啥错误呀,需要去yarn里面看application任务运行日志如下所示:
2019-09-24 16:16:27,712 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 3
2019-09-24 16:16:27,712 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_e29_1567609664100_85580_01_000011 taskAttempt attempt_1567609664100_85580_m_000009_0
2019-09-24 16:16:27,713 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1567609664100_85580_m_000009_0
2019-09-24 16:16:27,713 INFO [ContainerLauncher #2] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : yiclouddata04-SZZB:26009
2019-09-24 16:16:27,997 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:2 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:10 AssignedReds:0 CompletedMaps:3 CompletedReds:0 ContAlloc:10 ContRel:0 HostLocal:8 RackLocal:1
2019-09-24 16:16:28,005 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e29_1567609664100_85580_01_000009
2019-09-24 16:16:28,006 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e29_1567609664100_85580_01_000011
2019-09-24 16:16:28,006 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e29_1567609664100_85580_01_000003
2019-09-24 16:16:28,006 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:125952, vCores:6>
2019-09-24 16:16:28,006 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 10
2019-09-24 16:16:28,006 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:2 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:7 AssignedReds:0 CompletedMaps:3 CompletedReds:0 ContAlloc:10 ContRel:0 HostLocal:8 RackLocal:1
2019-09-24 16:16:28,006 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1567609664100_85580_m_000008_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
2019-09-24 16:16:28,006 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemp