Hadoop JobTracker的启动
org.apache.hadoop.mapred.JobTracker类实现了Hadoop MapReduce模型的中JobTracker的功能,主要负责任务的接受,初始化,调度以及对TaskTracker的监控。在Hadoop中JobTracker单独作为一个JVM运行,其main函数就是启动JobTracker的入口函数。
public static void main(String argv[]
) throws IOException, InterruptedException {
StringUtils.startupShutdownMessage(JobTracker.class, argv, LOG);
try {
if(argv.length == 0) {
// 创建一个JobTracker实例
JobTracker tracker = startTracker(new JobConf());
// JobTracker提供服务
tracker.offerService();
}
else {
if ("-dumpConfiguration".equals(argv[0]) && argv.length == 1) {
dumpConfiguration(new PrintWriter(System.out));
}
else {
System.out.println("usage: JobTracker [-dumpConfiguration]");
System.exit(-1);
}
}
} catch (Throwable e) {
LOG.fatal(StringUtils.stringifyException(e));
System.exit(-1);
}
}
startTracker(new JobConf())函数是一个静态函数,通过调用JobTracker的构造函数生成一个JobTracker实例,名为result。 然后,对result进行一系列的初始化,包括创建TaskScheduler,启动RPC server,启动内置的jetty服务器,检查是否需要重启JobTracker等。
在JobTracker.offerService()中,又调用了任务调度器taskScheduler对象的start()方法,用来启动任务调度器。该对象是JobTracker的一个数据成员,类型为TaskScheduler。该类型的提供了一系列接口,使得JobTracker可以对所有提交的job进行初始化以及调度。但是该类型实际上是一个抽象类型,其真正的实现类型是我们在配置文件中指定的“mapred.jobtracker.taskScheduler”,默认是JobQueueTaskScheduler类。public static JobTracker startTracker(JobConf conf ) throws IOException, InterruptedException { return startTracker(conf, generateNewIdentifier()); } public static JobTracker startTracker(JobConf conf, String identifier) throws IOException, InterruptedException { DefaultMetricsSystem.initialize("JobTracker"); JobTracker result = null; while (true) { try { // 创建JobTracker实例 result = new JobTracker(conf, identifier); // 设置TaskScheduler的Manager为JobTracker, // 从此处可以看出TaskScheduler和JobTracker是相互包含的,即你中有我、我中有你。 result.taskScheduler.setTaskTrackerManager(result); ..... return result; }
/** * Run forever */ public void offerService() throws InterruptedException, IOException { // Prepare for recovery. This is done irrespective of the status of restart ..... // 启动任务调度器 taskScheduler.start(); ..... }
Hadoop TaskScheduler的启动
下面我们看一下默认的任务调度器JobQueueTaskScheduler的start():@Override public synchronized void start() throws IOException { super.start(); // 注册一个JobQueueInProgressListener监听器 // JobQueueInProgress类以一定方式(默认是先进先出(FIFO))维持一个JobInProgress的队列, // 并且监听各个JobInProgress实例在生命周期中的变化 taskTrackerManager.addJobInProgressListener(jobQueueJobInProgressListener); // 注册一个EagerTaskInitializationListener监听器 // EagerInitializationListener类不断监听jobInitQueue, // 一旦发现有新的job被提交(即有新的JobInProgress实例被加入), // 则立即调用该实例的initTasks方法,对job进行初始化。 eagerTaskInitializationListener.setTaskTrackerManager(taskTrackerManager); eagerTaskInitializationListener.start(); taskTrackerManager.addJobInProgressListener(eagerTaskInitializationListener); }
/ // Used to init new jobs that have just been created / class JobInitManager implements Runnable { public void run() { JobInProgress job = null; while (true) { try { synchronized (jobInitQueue) { while (jobInitQueue.isEmpty()) { jobInitQueue.wait(); } job = jobInitQueue.remove(0); } threadPool.execute(new InitJob(job)); } catch (InterruptedException t) { LOG.info("JobInitManagerThread interrupted."); break; } } LOG.info("Shutting down thread pool"); threadPool.shutdownNow(); } } class InitJob implements Runnable { private JobInProgress job; public InitJob(JobInProgress job) { this.job = job; } public void run() { // 调用TaskTrackerManager(即JobTracker)的initJob(JobInProgress job)方法 // 在JobTracker的initJob中又是通过调用JobInProgress.initTasks()方法完成job的初始化的 ttm.initJob(job); } }
Hadoop Job的初始化
下面我看一下Job的初始化过程:/** * Construct the splits, etc. This is invoked from an async * thread so that split-computation doesn't block anyone. */ public synchronized void initTasks() throws IOException, KillInterruptedException, UnknownHostException { ..... // // read input splits and create a map per a split // // 从JobTracker获取Job输文件的split元信息(TaskSplitMetaInfo), // 包括分片位置(splitLocation)、偏移量(startOffset)、输入数据长度(inputDataLength)、location TaskSplitMetaInfo[] splits = createSplits(jobId); if (numMapTasks != splits.length) { throw new IOException("Number of maps in JobConf doesn't match number of " + "recieved splits for job " + jobId + "! " + "numMapTasks=" + numMapTasks + ", #splits=" + splits.length); } // map task的个数就是input split个数,即一个map处理一个input split numMapTasks = splits.length; // // Create Map Tasks // maps = new TaskInProgress[numMapTasks]; for(int i=0; i < numMapTasks; ++i) { inputLength += splits[i].getInputDataLength(); maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i, numSlotsPerMap); } LOG.info("Input size for job " + jobId + " = " + inputLength + ". Number of splits = " + splits.length); // Set localityWaitFactor before creating cache localityWaitFactor = conf.getFloat(LOCALITY_WAIT_FACTOR, DEFAULT_LOCALITY_WAIT_FACTOR); // 对于map task,将其放入nonRunningMapCache,是一个Map<Node, // List<TaskInProgress>>,也即对于map task来讲,其将会被分配到其input split所在的Node上。 // 在此,Node代表一个datanode或者机架或者数据中心。 // nonRunningMapCache将在JobTracker向TaskTracker分配map task的时候使用。 if (numMapTasks > 0) { nonRunningMapCache = createCache(splits, maxLevel); } // set the launch time this.launchTime = jobtracker.getClock().getTime(); // // Create reduce tasks // // nonRunningReduces将在JobTracker向TaskTracker分配reduce task的时候使用。 this.reduces = new TaskInProgress[numReduceTasks]; for (int i = 0; i < numReduceTasks; i++) { reduces[i] = new TaskInProgress(jobId, jobFile, numMapTasks, i, jobtracker, conf, this, numSlotsPerReduce); nonRunningReduces.add(reduces[i]); } ..... // create cleanup two cleanup tips, one map and one reduce. // 创建两个cleanup taskinprogress,一个用来清理map task,一个用来清理reduce task cleanup = new TaskInProgress[2]; // cleanup map tip. This map doesn't use any splits. Just assign an empty // split. TaskSplitMetaInfo emptySplit = JobSplit.EMPTY_TASK_SPLIT; cleanup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks, 1); cleanup[0].setJobCleanupTask(); // cleanup reduce tip. cleanup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks, jobtracker, conf, this, 1); cleanup[1].setJobCleanupTask(); // create two setup tips, one map and one reduce. // 创建两个setip taskinprogress, 一个用来初始化map task, 一个用来初始化reduce task setup = new TaskInProgress[2]; // setup map tip. This map doesn't use any split. Just assign an empty // split. setup[0] = new TaskInProgress(jobId, jobFile, emptySplit, jobtracker, conf, this, numMapTasks + 1, 1); setup[0].setJobSetupTask(); // setup reduce tip. setup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks + 1, jobtracker, conf, this, 1); setup[1].setJobSetupTask(); ..... // 初始化完成 tasksInited = true; ..... }
Hadoop TaskTracker的启动
org.apache.hadoop.mapred.TaskTracker类实现了MapReduce模型中TaskTracker的功能。在Hadoop中,TaskTracker作为一个单独的JVM来运行的,其main函数就是TaskTracker的入口函数。当运行start-all.sh时,通过SSH运行该函数来启动TaskTracker的。其中TaskTracker的run()又调用了offerService(),该函数每隔一定的时间就想JobTracker发送heartbeat,报告当前TaskTracker的状态:/** * Start the TaskTracker, point toward the indicated JobTracker */ public static void main(String argv[]) throws Exception { StringUtils.startupShutdownMessage(TaskTracker.class, argv, LOG); if (argv.length != 0) { System.out.println("usage: TaskTracker"); System.exit(-1); } try { JobConf conf=new JobConf(); // enable the server to track time spent waiting on locks ReflectionUtils.setContentionTracing (conf.getBoolean("tasktracker.contention.tracking", false)); DefaultMetricsSystem.initialize("TaskTracker"); // 创建TaskTracker实例 TaskTracker tt = new TaskTracker(conf); MBeans.register("TaskTracker", "TaskTrackerInfo", tt); // 启动Tasktracker的服务 tt.run(); } catch (Throwable e) { LOG.error("Can not start task tracker because "+ StringUtils.stringifyException(e)); System.exit(-1); } }
/** * Main service loop. Will stay in this loop forever. */ State offerService() throws Exception { long lastHeartbeat = 0; while (running && !shuttingDown) { try { long now = System.currentTimeMillis(); long waitTime = heartbeatInterval - (now - lastHeartbeat); if (waitTime > 0) { // 睡眠一段时间(waitTime),或者直到有空闲slot synchronized (finishedCount) { if (finishedCount.get() == 0) { finishedCount.wait(waitTime); } finishedCount.set(0); } } ..... // Send the heartbeat and process the jobtracker's directives // 向JobTracker发送heartbeat,得到JobTracker的回复heartbeatResponse HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); lastHeartbeat = System.currentTimeMillis(); ..... //从HeartbeatResponse中获取TaskTracker的所需要做的事情(actions) TaskTrackerAction[] actions = heartbeatResponse.getActions(); ..... if (actions != null){ for(TaskTrackerAction action: actions) { // 如果是启动一个新task,则就其加入到mapLaunch或者是reduceLaunch中 // mapLaunch和reduceLaunch是一个TaskLauncher,其包括了一个List<TaskInProgress>成员 if (action instanceof LaunchTaskAction) { addToTaskQueue((LaunchTaskAction)action); } else if (action instanceof CommitTaskAction) { CommitTaskAction commitAction = (CommitTaskAction)action; if (!commitResponses.contains(commitAction.getTaskID())) { LOG.info("Received commit task action for " + commitAction.getTaskID()); commitResponses.add(commitAction.getTaskID()); } } else { tasksToCleanup.put(action); } } } ..... return State.NORMAL; }
TaskTracker向JobTracker发送Heartbeat
/** * Build and transmit the heart beat to the JobTracker * @param now current time * @return false if the tracker was unknown * @throws IOException */ HeartbeatResponse transmitHeartBeat(long now) throws IOException { // Send Counters in the status once every COUNTER_UPDATE_INTERVAL boolean sendCounters; if (now > (previousUpdate + COUNTER_UPDATE_INTERVAL)) { sendCounters = true; previousUpdate = now; } else { sendCounters = false; } // // Check if the last heartbeat got through... // if so then build the heartbeat information for the JobTracker; // else resend the previous status information. // if (status == null) { synchronized (this) { status = new TaskTrackerStatus(taskTrackerName, localHostname, httpPort, cloneAndResetRunningTaskStatuses( sendCounters), failures, maxMapSlots, maxReduceSlots); } } else { LOG.info("Resending 'status' to '" + jobTrackAddr.getHostName() + "' with reponseId '" + heartbeatResponseId); } // // Check if we should ask for a new Task // boolean askForNewTask; long localMinSpaceStart; // 当满足下面的条件的时候,此TaskTracker请求JobTracker为其分配一个新的Task来运行: // 1)当前TaskTracker正在运行的map task的个数小于可以运行的map task的最大个数 // 2)当前TaskTracker正在运行的reduce task的个数小于可以运行的reduce task的最大个数 synchronized (this) { askForNewTask = ((status.countOccupiedMapSlots() < maxMapSlots || status.countOccupiedReduceSlots() < maxReduceSlots) && acceptNewTasks); localMinSpaceStart = minSpaceStart; } ..... // 向TaskTracker发送heartbeat,RPC调用,调用的是JobTracker的heartbeat方法 HeartbeatResponse heartbeatResponse = jobClient.heartbeat(status, justStarted, justInited, askForNewTask, heartbeatResponseId); ..... return heartbeatResponse; }
JobTracker向TaskTracker分配任务
TaskTracker每隔一定的时间,就通过RPC调用(JobTracker.heartbeat())向JobTracker发送heartbeat报告当前TaskTracker的状态。
JobTracker根据TaskTracker的状态来决定返回给TaskTracker执行的actions
/** * The periodic heartbeat mechanism between the {@link TaskTracker} and * the {@link JobTracker}. * * The {@link JobTracker} processes the status information sent by the * {@link TaskTracker} and responds with instructions to start/stop * tasks or jobs, and also 'reset' instructions during contingencies. */ public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status, boolean restarted, boolean initialContact, boolean acceptNewTasks, short responseId) throws IOException { ..... // First check if the last heartbeat response got through String trackerName = status.getTrackerName(); ..... HeartbeatResponse prevHeartbeatResponse = trackerToHeartbeatResponseMap.get(trackerName); boolean addRestartInfo = false; ..... // Initialize the response to be sent for the heartbeat HeartbeatResponse response = new HeartbeatResponse(newResponseId, null); List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>(); boolean isBlacklisted = faultyTrackers.isBlacklisted(status.getHost()); // Check for new tasks to be executed on the tasktracker if (recoveryManager.shouldSchedule() && acceptNewTasks && !isBlacklisted) { TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName); if (taskTrackerStatus == null) { LOG.warn("Unknown task tracker polling; ignoring: " + trackerName); } else { List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus); if (tasks == null ) { // 调用任务调度器TaskScheduler的assignTasks向该TaskTracker分配task tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName)); } if (tasks != null) { for (Task task : tasks) { // 将分配给改TaskTracker的tasks,加入到actions列表中 expireLaunchingTasks.addNewTask(task.getTaskID()); if(LOG.isDebugEnabled()) { LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID()); } actions.add(new LaunchTaskAction(task)); } } } } ..... // calculate next heartbeat interval and put in heartbeat response int nextInterval = getNextHeartbeatInterval(); response.setHeartbeatInterval(nextInterval); response.setActions(actions.toArray(new TaskTrackerAction[actions.size()])); ..... return response; }
任务调度器分配任务——TaskScheduler.assignTasks()
目前在Hadoop中比较常用的任务调度器有三种:JobQueueTaskScheduler,FairScheduler,CapacityScheduler。
下面我们看一下Hadoop的默认任务调度——JobQueueTaskScheduler是如何分配任务的。
从上面的代码中我们可以知道:@Override public synchronized List<Task> assignTasks(TaskTracker taskTracker) throws IOException { TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); final int numTaskTrackers = clusterStatus.getTaskTrackers(); final int clusterMapCapacity = clusterStatus.getMaxMapTasks(); final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue(); // // Get map + reduce counts for the current tracker. // final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots(); final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots(); final int trackerRunningMaps = taskTrackerStatus.countMapTasks(); final int trackerRunningReduces = taskTrackerStatus.countReduceTasks(); // Assigned tasks List<Task> assignedTasks = new ArrayList<Task>(); // // Compute (running + pending) map and reduce task numbers across pool // int remainingReduceLoad = 0; int remainingMapLoad = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() == JobStatus.RUNNING) { remainingMapLoad += (job.desiredMaps() - job.finishedMaps()); if (job.scheduleReduces()) { remainingReduceLoad += (job.desiredReduces() - job.finishedReduces()); } } } } // Compute the 'load factor' for maps and reduces // 计算每个TaskTracker的负载 double mapLoadFactor = 0.0; if (clusterMapCapacity > 0) { mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity; } double reduceLoadFactor = 0.0; if (clusterReduceCapacity > 0) { reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity; } // 先分配map task, 然后分配reduce task。 // // In the below steps, we allocate first map tasks (if appropriate), // and then reduce tasks if appropriate. We go through all jobs // in order of job arrival; jobs only get serviced if their // predecessors are serviced, too. // // // We assign tasks to the current taskTracker if the given machine // has a workload that's less than the maximum load of that kind of // task. // However, if the cluster is close to getting loaded i.e. we don't // have enough _padding_ for speculative executions etc., we only // schedule the "highest priority" task i.e. the task from the job // with the highest priority. // final int trackerCurrentMapCapacity = Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), trackerMapCapacity); int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps; boolean exceededMapPadding = false; if (availableMapSlots > 0) { exceededMapPadding = exceededPadding(true, clusterStatus, trackerMapCapacity); } int numLocalMaps = 0; int numNonLocalMaps = 0; scheduleMaps: for (int i=0; i < availableMapSlots; ++i) { synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; } Task t = null; // Try to schedule a node-local or rack-local Map task t = job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numLocalMaps; // Don't assign map tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededMapPadding) { break scheduleMaps; } // Try all jobs again for the next Map task break; } // Try to schedule a non-local Map task t = job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { assignedTasks.add(t); ++numNonLocalMaps; // We assign at most 1 off-switch or speculative task // This is to prevent TaskTrackers from stealing local-tasks // from other TaskTrackers. break scheduleMaps; } } } } int assignedMaps = assignedTasks.size(); // // Same thing, but for reduce tasks // However we _never_ assign more than 1 reduce task per heartbeat // final int trackerCurrentReduceCapacity = Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), trackerReduceCapacity); final int availableReduceSlots = Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1); boolean exceededReducePadding = false; if (availableReduceSlots > 0) { exceededReducePadding = exceededPadding(false, clusterStatus, trackerReduceCapacity); synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING || job.numReduceTasks == 0) { continue; } Task t = job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts() ); if (t != null) { assignedTasks.add(t); break; } // Don't assign reduce tasks to the hilt! // Leave some free slots in the cluster for future task-failures, // speculative tasks etc. beyond the highest priority job if (exceededReducePadding) { break; } } } } ..... return assignedTasks; }
1)JobInProgress的obtainNewNodeOrRackLocalMapTask()和obtainNewNonLocalMapTask()是用来分配map task的;
通过调用findNewMapTask(),findNewMapTask()函数会从近到远一层一层地寻找。首先是同一节点(node-local),然后在寻找同一机柜上的节点(rack-local),
接着寻找相同数据中心下的节点(off-switch),直到找了maxLevel层结束。根据TaskTracker所在的Node从nonRunningMapCache中查找TaskInProgress。
2)JobInProgress的obtainNewReduceTask()是用来分配reduce task的,通过调用findNewReduceTask(),从nonRunningReduces查找TaskInProgress。
FindNewMapTask() & FindNewReduceTask()方法
(此处有点没看明白,回头再琢磨琢磨!)/** * Find new map task * @param tts The task tracker that is asking for a task * @param clusterSize The number of task trackers in the cluster * @param numUniqueHosts The number of hosts that run task trackers * @param avgProgress The average progress of this kind of task in this job * @param maxCacheLevel The maximum topology level until which to schedule * maps. * A value of {@link #anyCacheLevel} implies any * available task (node-local, rack-local, off-switch and * speculative tasks). * A value of {@link #NON_LOCAL_CACHE_LEVEL} implies only * off-switch/speculative tasks should be scheduled. * @return the index in tasks of the selected task (or -1 for no task) */ private synchronized int findNewMapTask(final TaskTrackerStatus tts, final int clusterSize, final int numUniqueHosts, final int maxCacheLevel, final double avgProgress) { ..... String taskTracker = tts.getTrackerName(); TaskInProgress tip = null; ..... // When scheduling a map task: // 0) Schedule a failed task without considering locality // 1) Schedule non-running tasks // 2) Schedule speculative tasks // 3) Schedule tasks with no location information // First a look up is done on the non-running cache and on a miss, a look // up is done on the running cache. The order for lookup within the cache: // 1. from local node to root [bottom up] // 2. breadth wise for all the parent nodes at max level // We fall to linear scan of the list ((3) above) if we have misses in the above caches // 0) Schedule the task with the most failures, unless failure was on this machine tip = findTaskFromList(failedMaps, tts, numUniqueHosts, false); if (tip != null) { // Add to the running list scheduleMap(tip); LOG.info("Choosing a failed task " + tip.getTIPId()); return tip.getIdWithinJob(); } Node node = jobtracker.getNode(tts.getHost()); // // 1) Non-running TIP : // // 1. check from local node to the root [bottom up cache lookup] // i.e if the cache is available and the host has been resolved // (node!=null) if (node != null) { Node key = node; int level = 0; // maxCacheLevel might be greater than this.maxLevel if findNewMapTask is // called to schedule any task (local, rack-local, off-switch or // speculative) tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if // findNewMapTask is (i.e. -1) if findNewMapTask is to only schedule // off-switch/speculative tasks int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel); for (level = 0;level < maxLevelToSchedule; ++level) { List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key); if (cacheForLevel != null) { tip = findTaskFromList(cacheForLevel, tts, numUniqueHosts,level == 0); if (tip != null) { // Add to running cache scheduleMap(tip); // remove the cache if its empty if (cacheForLevel.size() == 0) { nonRunningMapCache.remove(key); } return tip.getIdWithinJob(); } } key = key.getParent(); } // Check if we need to only schedule a local task (node-local/rack-local) if (level == maxCacheLevel) { return -1; } } //2. Search breadth-wise across parents at max level for non-running // TIP if // - cache exists and there is a cache miss // - node information for the tracker is missing (tracker's topology // info not obtained yet) // collection of node at max level in the cache structure Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel(); // get the node parent at max level Node nodeParentAtMaxLevel = (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1); for (Node parent : nodesAtMaxLevel) { // skip the parent that has already been scanned if (parent == nodeParentAtMaxLevel) { continue; } List<TaskInProgress> cache = nonRunningMapCache.get(parent); if (cache != null) { tip = findTaskFromList(cache, tts, numUniqueHosts, false); if (tip != null) { // Add to the running cache scheduleMap(tip); // remove the cache if empty if (cache.size() == 0) { nonRunningMapCache.remove(parent); } LOG.info("Choosing a non-local task " + tip.getTIPId()); return tip.getIdWithinJob(); } } } // 3. Search non-local tips for a new task tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false); if (tip != null) { // Add to the running list scheduleMap(tip); LOG.info("Choosing a non-local task " + tip.getTIPId()); return tip.getIdWithinJob(); } // // 2) Running TIP : // if (hasSpeculativeMaps) { long currentTime = jobtracker.getClock().getTime(); // 1. Check bottom up for speculative tasks from the running cache if (node != null) { Node key = node; for (int level = 0; level < maxLevel; ++level) { Set<TaskInProgress> cacheForLevel = runningMapCache.get(key); if (cacheForLevel != null) { tip = findSpeculativeTask(cacheForLevel, tts, avgProgress, currentTime, level == 0); if (tip != null) { if (cacheForLevel.size() == 0) { runningMapCache.remove(key); } return tip.getIdWithinJob(); } } key = key.getParent(); } } // 2. Check breadth-wise for speculative tasks for (Node parent : nodesAtMaxLevel) { // ignore the parent which is already scanned if (parent == nodeParentAtMaxLevel) { continue; } Set<TaskInProgress> cache = runningMapCache.get(parent); if (cache != null) { tip = findSpeculativeTask(cache, tts, avgProgress, currentTime, false); if (tip != null) { // remove empty cache entries if (cache.size() == 0) { runningMapCache.remove(parent); } LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation"); return tip.getIdWithinJob(); } } } // 3. Check non-local tips for speculation tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, currentTime, false); if (tip != null) { LOG.info("Choosing a non-local task " + tip.getTIPId() + " for speculation"); return tip.getIdWithinJob(); } } return -1; }
TaskTracker接收JobTracker的HeartbeatResponse
TaskTracker向JobTracker发送heartbeat后,如果Jobtracker返回的heartbeatreponse中含有分配好的任务LaunchTaskAction,
TaskTracker则调用addToTaskQueue方法,将其加入TaskTracker类中MapLauncher或者ReduceLauncher对象的taskToLaunch队列。
在此,MapLauncher和ReduceLauncher对象均为TaskLauncher类的实例。
该类是TaskTracker类的一个内部类,具有一个数据成员,是TaskTracker.TaskInProgress类型的队列。
需要特别注意的是,在TaskTracker类内部所提到的TaskInProgress类均为TaskTracker的内部类要与mapred包中的TaskInProgress类区分开来!
在此我们用TaskTracker.TaskInProgress表示。如果heartbeatresponse中包含的任务是map task则放入mapLancher的taskToLaunch队列,如果是reduce task则放入reduceLancher的taskToLaunch队列:
下面我们看一下TaskTracker的addToTaskQueue方法:
TaskLauncher的addToTaskQueue方法:private void addToTaskQueue(LaunchTaskAction action) { if (action.getTask().isMapTask()) { // 调用TaskLauncher的addToTaskQueue方法 mapLauncher.addToTaskQueue(action); } else { reduceLauncher.addToTaskQueue(action); } }
TaskTracker的registerTask方法:public void addToTaskQueue(LaunchTaskAction action) { synchronized (tasksToLaunch) { TaskInProgress tip = registerTask(action, this); tasksToLaunch.add(tip); tasksToLaunch.notifyAll(); } }
TaskLauncher类是一个Thread类,它们各自都以一个线程(即有两个线程mapTaskLauncher和reduceTaskLauncher)独立运行。private TaskInProgress registerTask(LaunchTaskAction action, TaskLauncher launcher) { Task t = action.getTask(); LOG.info("LaunchTaskAction (registerTask): " + t.getTaskID() + " task's state:" + t.getState()); TaskInProgress tip = new TaskInProgress(t, this.fConf, launcher); synchronized (this) { // 通知其他程序 tasks.put(t.getTaskID(), tip); runningTasks.put(t.getTaskID(), tip); boolean isMap = t.isMapTask(); if (isMap) { mapTotal++; } else { reduceTotal++; } } return tip; }
它们的启动在TaskTracker初始化过程中已经完成。该类的run函数会不断监测taskToLaunch队列中是否有新的TaskTracker.TaskInProgress对象加入。
如果有则从中取出一个对象,然后调用TaskTracker类的startNewTask(TaskInProgress tip)来启动一个task,其又主要调用了localizeJob(TaskInProgresstip),
该函数的工作就是实现本地化。
TaskLauncher的run()函数:
在startNewTask中,会为每个task创建一个线程launchThread。TaskTracker的startNewTask()方法:public void run() { while (!Thread.interrupted()) { try { TaskInProgress tip; Task task; synchronized (tasksToLaunch) { // 不断监测tasksToLaunch队列 while (tasksToLaunch.isEmpty()) { tasksToLaunch.wait(); } //get the TIP tip = tasksToLaunch.remove(0); task = tip.getTask(); LOG.info("Trying to launch : " + tip.getTask().getTaskID() + " which needs " + task.getNumSlotsRequired() + " slots"); } //wait for free slots to run ..... //got a free slot. launch the task startNewTask(tip); ..... } }
task的本地化,TaskTracker的localizeJob()方法通过调用TaskTracker.initializeJob()来实现的。将与该Task相关的文件从HDFS拷贝的TaskTracker的本地文件系统中:job.split,job.xml以及job.jar。完成task的本地化后,通过调用launchTaskForJob()来启动task, 它又调用了TaskTracker.TaskInProgress的launchTask()方法。void startNewTask(final TaskInProgress tip) throws InterruptedException { Thread launchThread = new Thread(new Runnable() { @Override public void run() { try { RunningJob rjob = localizeJob(tip); tip.getTask().setJobFile(rjob.getLocalizedJobConf().toString()); // Localization is done. Neither rjob.jobConf nor rjob.ugi can be null launchTaskForJob(tip, new JobConf(rjob.getJobConf()), rjob); } catch (Throwable e) { ..... } }); launchThread.start(); }
TaskRunner的run()方法:public synchronized void launchTask(RunningJob rjob) throws IOException { if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED || this.taskStatus.getRunState() == TaskStatus.State.FAILED_UNCLEAN || this.taskStatus.getRunState() == TaskStatus.State.KILLED_UNCLEAN) { // localizeTask(task); if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED) { this.taskStatus.setRunState(TaskStatus.State.RUNNING); } // 创建一个TaskRunner对象,TaskRunner是一个抽象类,并且继承了Thread类 // 实际运行时是MapTaskRunner 或者 ReduceTaskRunner setTaskRunner(task.createRunner(TaskTracker.this, this, rjob)); // 调用MapTaskRunner或者ReduceTaskRunner的run()方法 this.runner.start(); ..... }
接下来就是创建JVM的过程:@Override public final void run() { // 准备运行时所需要的参数(环境变量)包括classpath, workdir, child JVM args等 ..... // 启动一个子进程,创建一个新的JVM来执行新的程序(map或者reduce) launchJvmAndWait(setupCmds, vargs, stdout, stderr, logSize, workDir); ..... } void launchJvmAndWait(List <String> setup, Vector<String> vargs, File stdout, File stderr, long logSize, File workDir) throws InterruptedException, IOException { jvmManager.launchJvm(this, jvmManager.constructJvmEnv(setup, vargs, stdout, stderr, logSize, workDir, conf)); synchronized (lock) { while (!done) { lock.wait(); } }
JvmManager.launchJvm() -----> JvmManagerForType.reapJvm() ----->JvmManagerForType.reapJvm() ----->JvmManagerForType.spawnNewJvm()
----->JvmManagerForType.JvmRunner.run()----->JvmManagerForType.JvmRunner.runChild()----->TaskTracker.getTaskController().launchTask()
-----> ..... ----->Child.main()
MapTask.run()方法public static void main(String[] args) throws Throwable { ..... try { while (true) { taskid = null; currentJobSegmented = true; // 通过网络通信从TaskTracker获取JvmTask对象 JvmTask myTask = umbilical.getTask(context); ..... // Create a final reference to the task for the doAs block final Task taskFinal = task; childUGI.doAs(new PrivilegedExceptionAction<Object>() { @Override public Object run() throws Exception { try { // use job-specified working directory FileSystem.get(job).setWorkingDirectory(job.getWorkingDirectory()); // 调用MapTask或者ReduceTask对象的run方法 taskFinal.run(job, umbilical); // run the task } finally { ..... } return null; } }); ..... }
下面我们以使用新的MapReduce API为例说明。@Override public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, ClassNotFoundException, InterruptedException { // 用来与TaskTracker通信 this.umbilical = umbilical; // 创建一个TaskReporter线程用来想TaskTracker汇报Task的progress TaskReporter reporter = new TaskReporter(getProgress(), umbilical); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewMapper(); // 用来初始化任务,主要是进行一些和任务输出相关的设置, // 比如创建commiter,设置工作目录等 // 调用了org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize initialize(job, getJobID(), reporter, useNewApi); // 判断Task的类型,若不属于下列四种,则是运行真正的map task if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return; } if (useNewApi) { runNewMapper(job, splitMetaInfo, umbilical, reporter); } else { runOldMapper(job, splitMetaInfo, umbilical, reporter); } done(umbilical, reporter); }
Mapper.run方法:private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runNewMapper(final JobConf job, final TaskSplitIndex splitIndex, final TaskUmbilicalProtocol umbilical, TaskReporter reporter ) throws IOException, ClassNotFoundException, InterruptedException { // make a task context so we can get the classes // TaskAttemptContext类继承于JobContext类,在JobContext类的基础上增加了一些有关task的信息。 // 通过taskContext对象可以获得很多与任务执行相关的类,比如用户定义的Mapper类,InputFormat类等等 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext = new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID()); // 创建用户自定义的Mapper对象 org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper = (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>) ReflectionUtils.newInstance(taskContext.getMapperClass(), job); // make the input format org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat = (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>) ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job); // rebuild the input split org.apache.hadoop.mapreduce.InputSplit split = null; split = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()); // 根据InputFormat对象创建RecordReader对象,默认是org.apache.hadoop.mapreduce.lib.input.LineRecordReader org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input = new NewTrackingRecordReader<INKEY,INVALUE> (split, inputFormat, reporter, job, taskContext); job.setBoolean("mapred.skip.on", isSkipping()); org.apache.hadoop.mapreduce.RecordWriter output = null; org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context mapperContext = null; try { Constructor<org.apache.hadoop.mapreduce.Mapper.Context> contextConstructor = org.apache.hadoop.mapreduce.Mapper.Context.class.getConstructor (new Class[]{org.apache.hadoop.mapreduce.Mapper.class, Configuration.class, org.apache.hadoop.mapreduce.TaskAttemptID.class, org.apache.hadoop.mapreduce.RecordReader.class, org.apache.hadoop.mapreduce.RecordWriter.class, org.apache.hadoop.mapreduce.OutputCommitter.class, org.apache.hadoop.mapreduce.StatusReporter.class, org.apache.hadoop.mapreduce.InputSplit.class}); // get an output object if (job.getNumReduceTasks() == 0) { output = new NewDirectOutputCollector(taskContext, job, umbilical, reporter); } else { output = new NewOutputCollector(taskContext, job, umbilical, reporter); } mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(), input, output, committer, reporter, split); // 初始化,在默认情况下调用的是LineRecordReader的initialize方法,主要是打开输入文件并且将文件指针指向文件头 input.initialize(split, mapperContext); // 运行用户自定义的Mapper中的map()方法 mapper.run(mapperContext); input.close(); output.close(mapperContext); } catch (NoSuchMethodException e) { throw new IOException("Can't find Context constructor", e); } catch (InstantiationException e) { throw new IOException("Can't create Context", e); } catch (InvocationTargetException e) { throw new IOException("Can't invoke Context constructor", e); } catch (IllegalAccessException e) { throw new IOException("Can't invoke Context constructor", e); } }
Mapper.run()方法首先调用了setup方法,这个方法在Mapper类中实际上是什么也没有做。/** * Called once at the beginning of the task. */ protected void setup(Context context ) throws IOException, InterruptedException { // NOTHING } /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ @SuppressWarnings("unchecked") protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((KEYOUT) key, (VALUEOUT) value); } /** * Called once at the end of the task. */ protected void cleanup(Context context ) throws IOException, InterruptedException { // NOTHING } /** * Expert users can override this method for more complete control over the * execution of the Mapper. * @param context * @throws IOException */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { // 用户自定义Mapper中的map函数 map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); }
用户通过可重写此方法让程序在执行map函数之前进行一些其他操作。
然后,程序将不断获取键值对交给map函数处理,也就是用户所希望进行的操作。
最后,程序调用cleanup函数。这个方法和setup一样,也是Mapper类的一个方法,但是实际上什么也没有做。用户可以重写此方法进行一些收尾工作。
至此,map的工作已经完成了。
ReduceTask的运行过程
@Override @SuppressWarnings("unchecked") public void run(JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException, InterruptedException, ClassNotFoundException { this.umbilical = umbilical; job.setBoolean("mapred.skip.on", isSkipping()); // reduce task需要经历:copy,sort,reduce三个过程 if (isMapOrReduce()) { copyPhase = getProgress().addPhase("copy"); sortPhase = getProgress().addPhase("sort"); reducePhase = getProgress().addPhase("reduce"); } // start thread that will handle communication with parent TaskReporter reporter = new TaskReporter(getProgress(), umbilical); reporter.startCommunicationThread(); boolean useNewApi = job.getUseNewReducer(); initialize(job, getJobID(), reporter, useNewApi); // check if it is a cleanupJobTask if (jobCleanup) { runJobCleanupTask(umbilical, reporter); return; } if (jobSetup) { runJobSetupTask(umbilical, reporter); return; } if (taskCleanup) { runTaskCleanupTask(umbilical, reporter); return; } // Initialize the codec codec = initCodec(); boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local")); if (!isLocal) { // ReduceCopier对象负责将Mapper的输出拷贝至Reducer所在机器 reduceCopier = new ReduceCopier(umbilical, job, reporter); if (!reduceCopier.fetchOutputs()) { if(reduceCopier.mergeThrowable instanceof FSError) { throw (FSError)reduceCopier.mergeThrowable; } throw new IOException("Task: " + getTaskID() + " - The reduce copier failed", reduceCopier.mergeThrowable); } } copyPhase.complete(); // copy is already complete setPhase(TaskStatus.Phase.SORT); statusUpdate(umbilical); final FileSystem rfs = FileSystem.getLocal(job).getRaw(); //根据JobTracker是否在本地来决定调用哪种排序方式 RawKeyValueIterator rIter = isLocal ? Merger.merge(job, rfs, job.getMapOutputKeyClass(), job.getMapOutputValueClass(), codec, getMapFiles(rfs, true), !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100), new Path(getTaskID().toString()), job.getOutputKeyComparator(), reporter, spilledRecordsCounter, null) : reduceCopier.createKVIterator(job, rfs, reporter); // free up the data structures mapOutputFilesOnDisk.clear(); sortPhase.complete(); // sort is complete setPhase(TaskStatus.Phase.REDUCE); statusUpdate(umbilical); Class keyClass = job.getMapOutputKeyClass(); Class valueClass = job.getMapOutputValueClass(); RawComparator comparator = job.getOutputValueGroupingComparator(); if (useNewApi) { runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } else { runOldReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass); } done(umbilical, reporter); }
@SuppressWarnings("unchecked") private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runNewReducer(JobConf job, final TaskUmbilicalProtocol umbilical, final TaskReporter reporter, RawKeyValueIterator rIter, RawComparator<INKEY> comparator, Class<INKEY> keyClass, Class<INVALUE> valueClass ) throws IOException,InterruptedException, ClassNotFoundException { // wrap value iterator to report progress. final RawKeyValueIterator rawIter = rIter; rIter = new RawKeyValueIterator() { public void close() throws IOException { rawIter.close(); } public DataInputBuffer getKey() throws IOException { return rawIter.getKey(); } public Progress getProgress() { return rawIter.getProgress(); } public DataInputBuffer getValue() throws IOException { return rawIter.getValue(); } public boolean next() throws IOException { boolean ret = rawIter.next(); reducePhase.set(rawIter.getProgress().get()); reporter.progress(); return ret; } }; // make a task context so we can get the classes org.apache.hadoop.mapreduce.TaskAttemptContext taskContext = new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID()); // make a reducer org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer = (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>) ReflectionUtils.newInstance(taskContext.getReducerClass(), job); org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW = new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(reduceOutputCounter, job, reporter, taskContext); job.setBoolean("mapred.skip.on", isSkipping()); org.apache.hadoop.mapreduce.Reducer.Context reducerContext = createReduceContext(reducer, job, getTaskID(), rIter, reduceInputKeyCounter, reduceInputValueCounter, trackedRW, committer, reporter, comparator, keyClass, valueClass); reducer.run(reducerContext); trackedRW.close(reducerContext); }
与Mapper.run()方法类似。该函数也是先调用setup函数,该函数默认是什么都不做,但是用户可以通过重写此函数来在运行reduce函数之前做一些初始化工作。/** * Called once at the start of the task. */ protected void setup(Context context ) throws IOException, InterruptedException { // NOTHING } /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } /** * Called once at the end of the task. */ protected void cleanup(Context context ) throws IOException, InterruptedException { // NOTHING } /** * Advanced application writers can use the * {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to * control how the reduce task works. */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); } cleanup(context); } }
然后程序会不断读取输入数据,交给reduce函数处理。这里的reduce函数就是用户自定义的的reduce函数。
最后调用cleanup函数。默认的cleanup函数是没有做任何事情,但是用户可以通过重写此函数来进行一些收尾工作。