最近集群上的Tez任务经常跑失败,报错信息见下:
出错日志
Map 1: 555(+41)/596 Reducer 2: 0(+0,-2)/1
15/09/23 14:50:35 INFO SessionState: Map 1: 555(+41)/596 Reducer 2: 0(+0,-2)/1
Map 1: 555(+41)/596 Reducer 2: 0(+1,-2)/1
15/09/23 14:50:37 INFO SessionState: Map 1: 555(+41)/596 Reducer 2: 0(+1,-2)/1
Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1
15/09/23 14:50:38 INFO SessionState: Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1
Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1
15/09/23 14:50:41 INFO SessionState: Map 1: 555(+41)/596 Reducer 2: 0(+1,-3)/1
Map 1: 555(+0)/596 Reducer 2: 0(+0,-4)/1
15/09/23 14:50:44 INFO SessionState: Map 1: 555(+0)/596 Reducer 2: 0(+0,-4)/1
Status: Failed
15/09/23 14:50:45 ERROR SessionState: Status: Failed
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1442391298043_123239_1_01, diagnostics=[Task failed, taskId=task_1442391298043_123239_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1442391298043_123239_01_008650 finished with diagnostics set to [Container preempted internally]], TaskAttempt 1 failed, info=[Container container_1442391298043_123239_01_008771 finished with diagnostics set to [Container preempted internally]], TaskAttempt 2 failed, info=[Container container_1442391298043_123239_01_009010 finished with diagnostics set to [Container preempted internally]], TaskAttempt 3 failed, info=[Container container_1442391298043_123239_01_009723 finished with diagnostics set to [Container preempted internally]]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1442391298043_123239_1_01 [Reducer 2] killed/failed due to:null]
15/09/23 14:50:45 ERROR SessionState: Vertex failed, vertexName=Reducer 2, vertexId=vertex_1442391298043_123239_1_01, diagnostics=[Task failed, taskId=task_1442391298043_123239_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1442391298043_123239_01_008650 finished with diagnostics set to [Container preempted internally]], TaskAttempt 1 failed, info=[Container container_1442391298043_123239_01_008771 finished with diagnostics set to [Container preempted internally]], TaskAttempt 2 failed, info=[Container container_1442391298043_123239_01_009010 finished with diagnostics set to [Container preempted internally]], TaskAttempt 3 failed, info=[Container container_1442391298043_123239_01_009723 finished with diagnostics set to [Container preempted internally]]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1442391298043_123239_1_01 [Reducer 2] killed/failed due to:null]
Vertex killed, vertexName=Map 1, vertexId=vertex_1442391298043_123239_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1442391298043_123239_1_00 [Map 1] killed/failed due to:null]
15/09/23 14:50:45 ERROR SessionState: Vertex killed, vertexName=Map 1, vertexId=vertex_1442391298043_123239_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1442391298043_123239_1_00 [Map 1] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
分析:
task_1442391298043_123239_1_01_000000,失败了4次,失败的原因是container被高优先级的任务抢占了。而task最大的失败次数默认是4.当集群上的任务比较多时,比较容易出现这个问题。
解决方案:
修改默认值,
tez.am.task.max.failed.attempts=10
tez.am.max.app.attemps=5;