上一篇浅析了Hadoop心跳机制的TT(TaskTracker)方面,这一篇浅析下JT(JobTracker)方面。
我们知道心跳是TT通过RPC请求调用JT的heartbeat()方法的,TT在调用JT的heartbeat回收集自身的状态信息封装到TaskTrackerStatus对象中,传递给JT。下面看看JT如何处理来自TT的心跳。
1.JobTracker.heartbeat():
- // Make sure heartbeat is from a tasktracker allowed by the jobtracker.
- if (!acceptTaskTracker(status)) {
- throw new DisallowedTaskTrackerException(status);
- }
2.JobTracker.heartbeat():
- String trackerName = status.getTrackerName();
- long now = clock.getTime();
- if (restarted) {
- faultyTrackers.markTrackerHealthy(status.getHost());
- } else {
- faultyTrackers.checkTrackerFaultTimeout(status.getHost(), now);
- }
3.JobTracker.heartbeat():
- HeartbeatResponse prevHeartbeatResponse =
- trackerToHeartbeatResponseMap.get(trackerName);
- boolean addRestartInfo = false;
- if (initialContact != true) {
- // If this isn't the 'initial contact' from the tasktracker,
- // there is something seriously wrong if the JobTracker has
- // no record of the 'previous heartbeat'; if so, ask the
- // tasktracker to re-initialize itself.
- if (prevHeartbeatResponse == null) {
- // This is the first heartbeat from the old tracker to the newly
- // started JobTracker
- if (hasRestarted()) {
- addRestartInfo = true;
- // inform the recovery manager about this tracker joining back
- recoveryManager.unMarkTracker(trackerName);
- } else {
- // Jobtracker might have restarted but no recovery is needed
- // otherwise this code should not be reached
- LOG.warn("Serious problem, cannot find record of 'previous' " +
- "heartbeat for '" + trackerName +
- "'; reinitializing the tasktracker");
- return new HeartbeatResponse(responseId,
- new TaskTrackerAction[] {new ReinitTrackerAction()});
- }
- } else {
- // It is completely safe to not process a 'duplicate' heartbeat from a
- // {@link TaskTracker} since it resends the heartbeat when rpcs are
- // lost see {@link TaskTracker.transmitHeartbeat()};
- // acknowledge it by re-sending the previous response to let the
- // {@link TaskTracker} go forward.
- if (prevHeartbeatResponse.getResponseId() != responseId) {
- LOG.info("Ignoring 'duplicate' heartbeat from '" +
- trackerName + "'; resending the previous 'lost' response");
- return prevHeartbeatResponse;
- }
- }
- }
4.JobTracker.heartbeat():
- // Process this heartbeat
- short newResponseId = (short)(responseId + 1);
- status.setLastSeen(now);
- if (!processHeartbeat(status, initialContact, now)) {
- if (prevHeartbeatResponse != null) {
- trackerToHeartbeatResponseMap.remove(trackerName);
- }
- return new HeartbeatResponse(newResponseId,
- new TaskTrackerAction[] {new ReinitTrackerAction()});
- }
首先将responseId+1,然后记录心跳发送时间。接着来看看processHeartbeat()方法。
5.JobTracker.processHeartbeat():
- boolean seenBefore = updateTaskTrackerStatus(trackerName,
- trackerStatus);
6.JobTracker.processHeartbeat():
- TaskTracker taskTracker = getTaskTracker(trackerName);
- if (initialContact) {
- // If it's first contact, then clear out
- // any state hanging around
- if (seenBefore) {
- lostTaskTracker(taskTracker);
- }
- } else {
- // If not first contact, there should be some record of the tracker
- if (!seenBefore) {
- LOG.warn("Status from unknown Tracker : " + trackerName);
- updateTaskTrackerStatus(trackerName, null);
- return false;
- }
- }
7.JobTracker.processHeartbeat():
- updateTaskStatuses(trackerStatus);
- updateNodeHealthStatus(trackerStatus, timeStamp);
8.JobTracker.heartbeat():如果processHeartbeat()返回false,则返回HeartbeatResponse(),并下达重新初始化TT指令。
- // Initialize the response to be sent for the heartbeat
- HeartbeatResponse response = new HeartbeatResponse(newResponseId, null);
- List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>();
- boolean isBlacklisted = faultyTrackers.isBlacklisted(status.getHost());
- // Check for new tasks to be executed on the tasktracker
- if (recoveryManager.shouldSchedule() && acceptNewTasks && !isBlacklisted) {
- TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName);
- if (taskTrackerStatus == null) {
- LOG.warn("Unknown task tracker polling; ignoring: " + trackerName);
- } else {
- List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);
- if (tasks == null ) {
- tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName));
- }
- if (tasks != null) {
- for (Task task : tasks) {
- expireLaunchingTasks.addNewTask(task.getTaskID());
- if(LOG.isDebugEnabled()) {
- LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
- }
- actions.add(new LaunchTaskAction(task));
- }
- }
- }
- }
9.JobTracker.getSetupAndCleanupTasks():
- // Don't assign *any* new task in safemode
- if (isInSafeMode()) {
- return null;
- }
- int maxMapTasks = taskTracker.getMaxMapSlots();
- int maxReduceTasks = taskTracker.getMaxReduceSlots();
- int numMaps = taskTracker.countOccupiedMapSlots();
- int numReduces = taskTracker.countOccupiedReduceSlots();
- int numTaskTrackers = getClusterStatus().getTaskTrackers();
- int numUniqueHosts = getNumberOfUniqueHosts();
- for (Iterator<JobInProgress> it = jobs.values().iterator();
- it.hasNext();) {
- JobInProgress job = it.next();
- t = job.obtainJobCleanupTask(taskTracker, numTaskTrackers,
- numUniqueHosts, true);
- if (t != null) {
- return Collections.singletonList(t);
- }
- }
- for (Iterator<JobInProgress> it = jobs.values().iterator();
- it.hasNext();) {
- JobInProgress job = it.next();
- t = job.obtainTaskCleanupTask(taskTracker, true);
- if (t != null) {
- return Collections.singletonList(t);
- }
- }
然后获取一个Cleanup任务的TaskAttempt。
- for (Iterator<JobInProgress> it = jobs.values().iterator();
- it.hasNext();) {
- JobInProgress job = it.next();
- t = job.obtainJobSetupTask(taskTracker, numTaskTrackers,
- numUniqueHosts, true);
- if (t != null) {
- return Collections.singletonList(t);
- }
- }
如果该方法返回null,则表示没有cleanup或者setup任务需要执行,则执行map/reduce任务。
10.JobTracker.heartbeat():
- if (tasks == null ) {
- tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName));
- }
11.JobTracker.heartbeat():
- if (tasks != null) {
- for (Task task : tasks) {
- expireLaunchingTasks.addNewTask(task.getTaskID());
- if(LOG.isDebugEnabled()) {
- LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
- }
- actions.add(new LaunchTaskAction(task));
- }
- }
- // Check for tasks to be killed
- List<TaskTrackerAction> killTasksList = getTasksToKill(trackerName);
- if (killTasksList != null) {
- actions.addAll(killTasksList);
- }
- // Check for jobs to be killed/cleanedup
- List<TaskTrackerAction> killJobsList = getJobsForCleanup(trackerName);
- if (killJobsList != null) {
- actions.addAll(killJobsList);
- }
- // Check for tasks whose outputs can be saved
- List<TaskTrackerAction> commitTasksList = getTasksToSave(status);
- if (commitTasksList != null) {
- actions.addAll(commitTasksList);
- }
12.JobTracker.heartbeat():
- // calculate next heartbeat interval and put in heartbeat response
- int nextInterval = getNextHeartbeatInterval();
- response.setHeartbeatInterval(nextInterval);
- response.setActions(
- actions.toArray(new TaskTrackerAction[actions.size()]));
- // check if the restart info is req
- if (addRestartInfo) {
- response.setRecoveredJobs(recoveryManager.getJobsToRecover());
- }
- // Update the trackerToHeartbeatResponseMap
- trackerToHeartbeatResponseMap.put(trackerName, response);
- // Done processing the hearbeat, now remove 'marked' tasks
- removeMarkedTasks(trackerName);
到此JT的heartbeat()完成了,中间很多地方比较复杂,都没有去深追,以后有时间可以继续研究,如有错误,请不吝指教,谢谢