reduce task启动后的第一阶段是shuffle(向map端fetch数据),每次fetch数据的时候都可能因为connect timeout,read timeout,checksum error等原因时报,因而reduce task为每个map设置了一个计数器,用以记录fetch该map输出时失败的次数,当失败次数达到一定阀值的时候。会通知MRAppMaster 从该map fetch数据时失败的次数太多了,并打印想要的log;
该阀值计算方式:
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.java
float failureRate = runningReduceTasks == 0 ? 1.0f :
(float) fetchFailures / runningReduceTasks;
// declare faulty if fetch-failures >= max-allowed-failures
boolean isMapFaulty =
(failureRate >= MAX_ALLOWED_FETCH_FAILURES_FRACTION);
if (fetchFailures >= MAX_FETCH_FAILURES_NOTIFICATIONS && isMapFaulty) {
LOG.info("Too many fetch-failures for output of task attempt: " +
mapId + " ... raising fetch failure to map");