如果当前有MapReduce Job正在运行,而JobTracker突然down掉了,怎么办?由于JobTracker只是负责Job调度,记账,监控等工作,真正的任务执行在TaskTracker上,完全有可能重启JT而不丢失之前的任务运行。JT需要做的是将Job执行状态备份到文件,重启时读取文件以便恢复(在《MapReduce Job Files》一文中已经总结了几种备份文件)。
要打开Restart Recovery功能,需要设置mapreduce.jobtracker.restart.recover为true(默认为false)。JT重启时将顺次执行以下步骤来恢复Job运行状态:
1. 遍历system directory的文件,将要恢复的JobID放入集合jobsToRecover。因为JT为每个正在运行的Job在system directory下创建一个以jobId为名称帝directory,遍历可以得到JT重启前所有未完成的Job。
JobTracker.JobTracker()
FileStatus[] systemDirData = fs.listStatus(this.systemDir);
// Check if the history is enabled .. as we cant have persistence with
// history disabled
if (conf.getBoolean("mapred.jobtracker.restart.recover", false)
&& systemDirData != null) {
for (FileStatus status : systemDirData) {
try {
recoveryManager.checkAndAddJob(status);
} catch (Throwable t) {
LOG.warn("Failed to add the job " + status.getPath().getName(),
t);
}
}
2. 遍历集合jobsToRecover, 对每个要恢复的JobId,重建JobInProgress对象,加入Job调度队列,并根据JobId, userId, jobName等信息拼接成history filename
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
String logFileName =
JobHistory.JobInfo.getJobHistoryFileName(job.getJobConf(), id);
if (logFileName != null) {
Path jobHistoryFilePath =
JobHistory.JobInfo.getJobHistoryLogLocation(logFileName);
JobHistory.JobInfo.recoverJobHistoryFile(job.getJobConf(), jobHistoryFilePath);
jobHistoryFilenameMap.put(job.getJobID(), jobHistoryFilePath);
} else {
LOG.info("No history file found for job " + id);
idIter.remove(); // remove from recovery list
}
addJob(id, job);
3. 从History directory中读取和解析history file,恢复Job运行数据。
JobTracker.main() --> JobTracker.offerService() --> RecoverManager.recover()
JobRecoveryListener listener = new JobRecoveryListener(pJob);
try {
JobHistory.parseHistoryFromFS(jobHistoryFilePath.toString(),
listener, fs);
} catch (Throwable t) {
LOG.info("Error reading history file of job " + pJob.getJobID()
+ ". Ignoring the error and continuing.", t);
}
Job JOBID="job_201310190623_0001" LAUNCH_TIME="1382181102677" TOTAL_MAPS="100" TOTAL_REDUCES="1" JOB_STATUS="PREP" .
Task TASKID="task_201310190623_0001_m_000101" TASK_TYPE="SETUP" START_TIME="1382181104661" SPLITS="" .
Job/Task是recordType,其余的key-value pair是Job相关的信息。JT定义了如下key:
public static enum Keys {
JOBTRACKERID,
START_TIME, FINISH_TIME, JOBID, JOBNAME, USER, JOBCONF, SUBMIT_TIME,
LAUNCH_TIME, TOTAL_MAPS, TOTAL_REDUCES, FAILED_MAPS, FAILED_REDUCES,
FINISHED_MAPS, FINISHED_REDUCES, JOB_STATUS, TASKID, HOSTNAME, TASK_TYPE,
ERROR, TASK_ATTEMPT_ID, TASK_STATUS, COPY_PHASE, SORT_PHASE, REDUCE_PHASE,
SHUFFLE_FINISHED, SORT_FINISHED, COUNTERS, SPLITS, JOB_PRIORITY, HTTP_PORT,
TRACKER_NAME, STATE_STRING, VERSION, MAP_COUNTERS, REDUCE_COUNTERS,
VIEW_JOB, MODIFY_JOB, JOB_QUEUE, FAIL_REASON
}
根据History file中读取出来的数据,JT可以恢复到重启前的状态。